Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> RE: de-dup process

RE: de-dup process

From: Kerber, Andrew <Andrew.Kerber_at_umb.com>
Date: Thu, 14 Dec 2006 07:31:37 -0600
Message-ID: <D40740337A3B524FA81DB598D2D7EBB304A93CD6@x6009a.umb.corp.umb.com>


This will probably work, but keep in mind that by adding a column you run the risk of breaking any applications that rely on the table structure to remain constant (ie, anything that relies on a select *.. ). Admittedly, it is bad programming practice to do this, but I have seen it done quite a bit.  

Andrew W. Kerber
Oracle DBA
UMB     "If at first you dont succeed, dont take up skydiving"

-----Original Message-----
From: oracle-l-bounce_at_freelists.org
[mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Tim Gorman Sent: Thursday, December 14, 2006 7:19 AM To: ebadi01_at_yahoo.com
Cc: mgogala_at_verizon.net; tboss_at_bossconsulting.com; oracle-l_at_freelists.org
Subject: Re: de-dup process  

When an UPDATE or DELETE won't scale (and it never will!), convert it into a parallel nologging direct-path INSERT.

Use a query similar to Tom's to insert only the first "copy" of a row (i.e. use "RN = 1" instead of "RN <> 1" the outermost WHERE clause) to INSERT the rows to be retained into a new table. Make the new table NOLOGGING, use APPEND and PARALLEL hints on the INSERT portion and PARALLEL hints on the SELECT portion of the INSERT ... SELECT statement.

Here's the cute part. If you can't simply RENAME the two tables (i.e. old and new) to swap them, then make the newly-created table look exactly like the original table, but make it range-partitioned on a dummy numeric column (call it DUMMY or PARTKEY), give this column a constant value of zero, and name the single partition PZERO (or PDUMMY or whatever). Also, add this DUMMY or PARTKEY column to the original table as well. Then, use "ALTER TABLE <new-table> EXCHANGE PARTITION PZERO WITH TABLE <original-table>" to swap the new data with the old data.

A direct-path parallel INSERT...SELECT will scale far better than any possible optimization on an UPDATE or DELETE statement. The comparison is not even close.

Hope this helps...

-Tim

A Ebadi wrote:

Biggest problem we've faced in coming up with a solution is none of the solutions so far scale. In other words, things are fine if we have a 20 million row table with 2-3 million duplicates - runs in 10-15 minutes. However, trying it for 100+ million row table - it runs for hrs!  

We've even had another tool (Informatica) select out the ROWIDs of the duplicates into a separate table then we are using PL/SQL cursor to delete those rows from the large table, but this doesn't scale either!  

I'm currently looking at some of the suggestions I got from this list and seeing if any of them will work with us.

Thanks for all who replied and I'll let you know how it goes!

Mladen Gogala <mgogala_at_verizon.net> <mailto:mgogala_at_verizon.net> wrote:         

	On 12/12/2006 08:42:38 PM, tboss_at_bossconsulting.com wrote:

> >From asktom, the best way I've found is to use Tom's little
code snippet below:
>
> delete from table your_huge_table
> where rowid in
> (select rid
> from
> (select rowid rid,
> row_number() over
> (partition by varchar_that_defines_duplicates
> order by rowid ) rn
> from your_huge_table
> )
> where rn <> 1
> )
> /
Good luck with that if cardinality is +60M rows. I would also add a condition like WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query. You only need to populate exceptions table with duplicates and if those duplicates are a small percentage of the total number of records, your task will be done two order of magnitude faster then without the exceptions table. -- Mladen Gogala http://www.mladen-gogala.com -- http://www.freelists.org/webpage/oracle-l ________________________________

Everyone is raving about the all-new Yahoo! Mail beta. <http://us.rd.yahoo.com/evt=42297/*http:/advision.webevents.yahoo.com/ma ilbeta>


NOTICE: This electronic mail message and any attached files are confidential. The information is exclusively for the use of the individual or entity intended as the recipient. If you are not the intended recipient, any use, copying, printing, reviewing, retention, disclosure, distribution or forwarding of the message or any attached file is not authorized and is strictly prohibited. If you have received this electronic mail message in error, please advise the sender by reply electronic mail immediately and permanently delete the original transmission, any attachments and any copies of this message from your computer system. Thank you.
--
http://www.freelists.org/webpage/oracle-l
Received on Thu Dec 14 2006 - 07:31:37 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US