Re: finding duplicate records with typo's
From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Sun, 05 Aug 2007 22:25:00 -0300
Message-ID: <46b67836$0$4038$9a566e8b_at_news.aliant.net>
Date: Sun, 05 Aug 2007 22:25:00 -0300
Message-ID: <46b67836$0$4038$9a566e8b_at_news.aliant.net>
tom wrote:
> hello,
>
> can someone tell me (or point me in the right direction) of what the
> right way of finding duplicates in dirty data (caused by typo's) ?
>
> is there something like a 'hashing' or 'rating' of text that will give
> you a number that you can compare ?
>
> for example
>
> hash( "hello") => 4323
> hash( "helo") => 4334
> hash("tree") => 7326
>
> i'm not sure what direction i should look in, this is just an idea
> that i had, but any idea's are very welcome.
>
> thanks,
> tom
>
If you are looking for duplicates, I assume you want to note the similarity between "hello" and "helo". The name of the function usually used for that is soundex. Received on Mon Aug 06 2007 - 03:25:00 CEST