Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Usenet -> c.d.o.server -> Re: Detecting typos
CE wrote:
> CE wrote:
> > Hi,
> >
> > I'm working on a project which involves matching customers against
an
> > existing database using (amonst other things) address, date of
birth
> > and name.
> >
> > One problem I have to overcome is allowing for typographical
errors.
> > E.g. matching "CATHERINE" with "CSTHERINE". So what I though I'd
do
> is
> > have a little function to compare 2 strings and return the number
of
> > differences:
> >
> > CREATE OR REPLACE function string_compare
> > (string1 IN VARCHAR2,
> > string2 IN VARCHAR2) RETURN NUMBER IS
> > diffs BINARY_INTEGER := 0;
> > BEGIN
> > IF least(length(string1),length(string2)) > 0 THEN
> > FOR i IN 1..least(length(string1),length(string2)) LOOP
> > IF substr(string1,i,1) <> substr(string2,i,1) THEN
> > diffs := diffs + 1;
> > END IF;
> > END LOOP;
> > END IF;
> > RETURN diffs;
> > END;
> > /
> >
> > This works ok (I've got to watch out for comparing MARY and MARK
etc,
> > though), but doesn't handle a comparison where there might be a
> missing
> > letter (e.g. "CATHERINE" and "CTHERINE"). Can anybody think of
> > anything a little bit cleverer?
> >
> > Also if anyone could recommend a book on the rationale/logic behind
> > such customer matching/de-duplicating, I'd also be grateful.
> > Thanks
> >
> > CE
>
> For completeness sake, this is the function I knocked up in the end.
> It returns a 1 (as opposed to a 0) if it's an exact match of if there
> is a single typographical error which I've defined as one of
> 1) Missing or added character (e.g. CHARLIE and CHARLI)
> 2) One character incorrect (e.g. CHARLIE and CHARLIW)
> 3) Pair of characters transposed (e.g. CHARLIE and CHALRIE)
>
> CREATE OR REPLACE function is_typo
> (p_string1 IN VARCHAR2,
> p_string2 IN VARCHAR2) RETURN NUMBER IS
> v_differences BINARY_INTEGER := 0;
> v_string1 VARCHAR2(100) := p_string1;
> v_string2 VARCHAR2(100) := p_string2;
> v_is_typo NUMBER := 0;
> BEGIN
> IF v_string1 = v_string2 THEN
> v_is_typo := 1;
> ELSIF LENGTH(v_string1) = LENGTH(v_string2) THEN
> FOR i IN 1..LENGTH(v_string1) LOOP
> IF substr(v_string1,i,1) <> substr(v_string2,i,1) THEN
> v_differences := v_differences + 1;
> IF v_differences > 1 THEN
> EXIT; -- Quit the loop as soon as > 1 difference
> END IF;
> END IF;
> END LOOP;
> IF v_differences < 2 THEN
> v_is_typo := 1;
> END IF;
> IF v_is_typo <> 1 THEN
> FOR i IN 1..LENGTH(v_string1)-1 LOOP
> IF SUBSTR(v_string1,1,i-1)||
> SUBSTR(v_string1,i+1,1)||
> SUBSTR(v_string1,i,1)||
> SUBSTR(v_string1,i+2) = v_string2 THEN
> v_is_typo := 1;
> EXIT; -- Quit the loop as soon as a match is found
> END IF;
> END LOOP;
> END IF;
> ELSIF ABS(LENGTH(v_string1) - LENGTH(v_string2)) = 1 THEN
> -- Put the longer string in v_string1
> IF LENGTH(v_string1) < LENGTH(v_string2) THEN
> v_string1 := p_string2;
> v_string1 := p_string1;
> END IF;
> FOR i IN 1..LENGTH(v_string1) LOOP
> IF SUBSTR(v_string1,1,i-1)||
> SUBSTR(v_string1,i+1) = v_string2 THEN
> v_is_typo := 1;
> EXIT; -- Quit the loop as soon as a match is found
> END IF;
> END LOOP;
> END IF;
> RETURN v_is_typo;
> END is_typo;
> /
>
> Regards
>
> CE
Maybe you should have incorporated SOUNDEX as originally suggested by Sybrand. Soundex will not pick up any gross spelling mistakes but if it sounds similar then it will pick it up. In your example above, 'CHARLIE' does not sound similar to 'CHALRIE' so soundex will not pick it up.
SQL> select 1 from dual
2 where soundex('CHARLIE') = soundex('CHARLI');
1
SQL> select 1 from dual
2 where soundex('CHARLIE') = soundex('CHARLIW');
1
SQL> select 1 from dual
2 where soundex('CHARLIE') = soundex('CHARL');
1
SQL> select 1 from dual
2 where soundex('CHARLIE') = soundex('CHALRIE');
1
SQL>
Regards
/Rauf
Received on Fri Jan 07 2005 - 05:57:00 CST
![]() |
![]() |