Re: de-dup process

From: jaromir nemec <jaromir_at_db-nemec.com>
Date: Wed, 13 Dec 2006 22:22:12 +0100 (CET)
Message-ID: <1726.81.189.73.191.1166044932.bloek@pwebmail.utanet.at>

Hi,

> (select rowid rid,
> vrow_number() over
> (partition by varchar_that_defines_duplicates
> order by rowid ) rn

a good thing on this approach is that you can control which of the duplicated rows will be discarded and which preserved. You may order not only by rowid but by any other attribute(s). (In case that dups are defined based on a key attribute there is no guarantee that all attributes in duplicated rows are identical).

A common approach by incremental loading is to check for dups in the newly loaded data and to eliminate them.

Optionally, if relevant, a second step is performed. In the second step the new data is checked against you base table. A big optimisation can be reached if there is some business rule saying that the dups must be timely coupled (i.e. they can appear only within a limited interval of time). This allowed to limit the check only to the some most recent partitions of your base table and it is not necessary to consider the whole base table. Provided that the base table is timely partitioned, of course.

Both step can be combined and performed in one pass.

Regards,

Jaromir

> --
> http://www.freelists.org/webpage/oracle-l

--
http://www.freelists.org/webpage/oracle-l

Received on Wed Dec 13 2006 - 15:22:12 CST