Re: Selecting SIMILAR, not the same records (PROBABLE) duplicates

From: kroger <kroger_at_vp.pl>
Date: Wed, 6 Sep 2006 18:07:20 +0200
Message-ID: <edmrns$821$1@news.onet.pl>

> And if you have this what do you do?
>
> ID
> 1 aaa
> 2 aaa h
> 3 h
> 4 h aaa
>
> The request makes no business sense. Here's what I would suggest:

In that case (1,2) and (3,4) are candidate duplicates. It makes business sense, looking at a simple example (data entered by dumb or dumber user into name field):

-JohnSmith
-John
-John Smith Jr

In the app we are developing, we must be able to display that those three entries MAY refer to the same person...

>
> SELECT DISTINCT name
> FROM table;
>
> Spool the output, send it to the manager of the department, and ask
> them to sort it out.

This wouldn't make sense because
1. above
2. data changing approximately 5-20k records per day

>
> Alternatively ask for a list of legal values. Create a foreign key
> constraint and blow everything that doesn't match out of the table
> so they can sort it out.

The point is not to blow duplicates, but to pass the candidates to manual verification.

>
> This is not a programming problem ... this is a business problem
> and there really is no solution other than a subject matter expert
> looking over the garbage value by value.

I wouldn't agree.. In my particular case there is a (reasonable) business request, which I need to implement, with or without magic ;)

I solved the issue though. I still need to polish the code, however the basic idea is to choose ids of the records with the shortest name having candidate duplicates (in your sample records 1 and 3) and then I am able to display all others, with <pseudo-sql> name like(chosen_shortest_name_by_id||'%') </pseudo-sql>

> PS: rename your columns.

Those are just samples, the real table has ca. 15 columns with fine names..

Thanks anyway!
BR,
Kroger Received on Wed Sep 06 2006 - 11:07:20 CDT