Re: Non-duplicate test on LARGE table

From: DA Morgan <damorgan_at_psoug.org>
Date: Tue, 27 May 2008 09:24:31 -0700
Message-ID: <1211905484.642977@bubbleator.drizzle.com>


mattsteel wrote:
> Hello.
> My question is a little broad but maybe you already saw something like
> this ;-)
>
> Here is my scenario:
> Let us have a large table which contains over 20 million records and
> search for rows that are duplicated under some rule.
> The column involved in that comparison rule, named AREA, is declared
> VARCHAR2(4000) NOT NULL.
>
> Seeing that an usual index built on column AREA would not be so good,
> I tried to add a column, HASHVALUE, which always contains a suitable
> hash-value given by the standard "dbms_utility.get_hash_value"
> function.
>
> Now that I have a much smaller column to search through for
> duplicates, should I put an index on column HASHVALUE, do you think
> overall performance would be improved or not?
>
> Thank you in advance.
>
> Matt.

Let's see what we are missing:

1. Version information
2. DDL
3. DML
4. Explain Plan output

About the only thing I can say is that 20M rows is not that large and the answer to your question with respect to overall performance is complex and can only be solved through testing.

That said ... if all you want to do is find duplicates create a disabled unique constraint, use DBMS_ERRLOG to create an error table and try enabling it ... or do a simple query using GROUP BY with HAVING COUNT(*) > 1.

-- 
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan_at_x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org
Received on Tue May 27 2008 - 11:24:31 CDT

Original text of this message