Re: finding duplicate records with typo's

From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Sun, 05 Aug 2007 22:25:00 -0300
Message-ID: <46b67836$0$4038$9a566e8b_at_news.aliant.net>


tom wrote:

> hello,
>
> can someone tell me (or point me in the right direction) of what the
> right way of finding duplicates in dirty data (caused by typo's) ?
>
> is there something like a 'hashing' or 'rating' of text that will give
> you a number that you can compare ?
>
> for example
>
> hash( "hello") => 4323
> hash( "helo") => 4334
> hash("tree") => 7326
>
> i'm not sure what direction i should look in, this is just an idea
> that i had, but any idea's are very welcome.
>
> thanks,
> tom
>

If you are looking for duplicates, I assume you want to note the similarity between "hello" and "helo". The name of the function usually used for that is soundex. Received on Mon Aug 06 2007 - 03:25:00 CEST

Original text of this message