Home » Server Options » Text & interMedia » Fuzzy Matching Using Contains (Oracle 11g)
Fuzzy Matching Using Contains [message #589509] Mon, 08 July 2013 01:47 Go to next message
sdCraig
Messages: 2
Registered: July 2013
Location: San Diego
Junior Member
Apologies for the general nature of this question, but from searching this forum I can't find an answer that provides a basis of understanding that I need, where I can then branch into more specifics. Perhaps someone can point me there.

I would like to understand the "fuzzy" search feature - specifically when used with the "contains" query:

CONTAINS(TEXT, 'fuzzy(government, 70, 6, weight)', 1) > 0

All the Oracle documentation I find seems to show the above example, with little detail or explanation. I need to be clear on the concepts of this query. I'm not sure how the components of this query (score, num terms, label value) work and their relationship to each other and the effect on the result set (ie, increasing score versus number of terms and vice versa; also the effect of the "weight" feature and what leaving this variable out of the equation does).

I would like to see examples and the effect of changing the variables (score, num terms, label value). Can someone point me in the general direction or perhaps offer there own summary?
Re: Fuzzy Matching Using Contains [message #589622 is a reply to message #589509] Tue, 09 July 2013 03:28 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9100
Registered: November 2002
Location: California, USA
Senior Member
I see this is your first post, so welcome to the OraFAQ forums.

Fuzzy is designed to find similarly spelled words, which is different from soundex which finds similar sounding words.

In the following:

select score(label), column_name
from   table_name
where  contains 
        (column_name, 
         'fuzzy (term, similarity_score, numterms, weight)', 
         label) > 0;


The Oracle Text score and the fuzzy similarity_score are two different things and Oracle Text has complex algorithms for both. The general algorithm for the Oracle Text score is listed here:

http://docs.oracle.com/cd/E11882_01/text.112/e24436/ascore.htm#CCREF1475

The similarity_score is based on how similar the term is to the words in the column. The lower that you set the similarity score, the more results you get. The higher the similarity_score, the fewer results you get.

The numterms is the number of similarly matching words that you limit your search to. If you limit it to 1, then you will only get results that contain one or more occurrences of the closest matching term. The higher the numterms, the more results you get.

The weight affects the score by factoring in how similar the words are to the one searched for, instead of just how many times a matching word occurs.

The label is not part of the fuzzy syntax. Label is part of the contains syntax. It is used to allow you to display different scores, for example:

select score(1), score(2)
from   table_name
where  contains (column_name, 'word1', 1) > 0
or     contains (column_name, 'word2', 2) > 0
/


Suppose we have the following test environment:

SCOTT@orcl_11gR2> CREATE TABLE test_tab (text  VARCHAR2(45))
  2  /

Table created.

SCOTT@orcl_11gR2> INSERT ALL
  2  INTO test_tab (text) VALUES ('government government government')
  3  INTO test_tab (text) VALUES ('government government')
  4  INTO test_tab (text) VALUES ('government')
  5  INTO test_tab (text) VALUES ('overnment overnment')
  6  INTO test_tab (text) VALUES ('overnment')
  7  INTO test_tab (text) VALUES ('governmen')
  8  INTO test_tab (text) VALUES ('governor governor')
  9  INTO test_tab (text) VALUES ('governor')
 10  INTO test_tab (text) VALUES ('governing')
 11  INTO test_tab (text) VALUES ('govrment')
 12  INTO test_tab (text) VALUES ('government overnment')
 13  SELECT * FROM DUAL
 14  /

11 rows created.

SCOTT@orcl_11gR2> INSERT INTO test_tab (text) SELECT object_name FROM all_objects
  2  /

75046 rows created.

SCOTT@orcl_11gR2> CREATE INDEX test_idx ON test_tab (text) INDEXTYPE IS CTXSYS.CONTEXT
  2  /

Index created.


The following query sets the numterms to 1, so it selects only rows containing the closest matching word, which in this case is the exact match of "government". The weight is not factored in, so the score is based on the number of occurrences of the term. The numeric value of the score is based on the algorithm, which is dependent on what percentage of the rows in the table contain the term.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 1, 1)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
        47 government government government
        32 government government
        16 government
        16 government overnment

4 rows selected.


If we modify the above query only by increasing the numterms to 2, then it returns all rows containing any of the two closest matching terms, "government" and "overnment". Since there is still no weight, the score is based on the number of occurrences of the two matching terms. All rows with the same number of matching terms get the same score, the value of which is determined by the Oracle Text scoring algorithm.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 1, 2)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
        47 government government government
        31 government government
        31 government overnment
        31 overnment overnment
        16 overnment
        16 government

6 rows selected.


If we increase the numterms to 3, then we get the same sort of result for all rows containing any of the closest 3 matching words, "government", "overnment", and "govrment".

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 1, 3)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
        46 government government government
        31 government government
        31 overnment overnment
        31 government overnment
        15 overnment
        15 govrment
        15 government

7 rows selected.


Increasing the numterms to the maximum 5000 yields the maximum matching terms below.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 1, 5000)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
        46 government government government
        31 government government
        31 overnment overnment
        31 government overnment
        15 overnment
        15 governmen
        15 govrment
        15 government

8 rows selected.


If we apply weight, then the Oracle Text score is affected by the otherwise hidden similarity score that is determined by how closely the terms match. So, you may get different scores with the same number of matching terms, or the same score with different numbers of matching terms, as shown below.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 1, 5000, WEIGHT)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
       100 government government government
       100 government government
       100 government
       100 government overnment
       100 overnment overnment
        98 overnment
        80 governmen
        80 govrment

8 rows selected.


If you increase the similarity score, as in the two examples below, then you limit the rows returned to only those whose similarity score (not Oracle Text score) are above that value. The similarity score is hidden and I don't know if the algorithm is documented.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 65, 5000, WEIGHT)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
       100 government government government
       100 government government
       100 government overnment
       100 overnment overnment
       100 government
        99 overnment

6 rows selected.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 80, 5000, WEIGHT)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
       100 government government government
       100 government overnment
       100 government
       100 government government

4 rows selected.


If you remove the weight, the hidden similarity score still causes the same rows to be returend, but does not affect the values of the displayed Oracle Text score, as shown below.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 65, 5000)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
        47 government government government
        31 government government
        31 government overnment
        31 overnment overnment
        16 overnment
        16 government

6 rows selected.

SCOTT@orcl_11gR2> SELECT SCORE(1), text
  2  FROM   test_tab
  3  WHERE  CONTAINS (text, 'FUZZY (government, 80, 5000)', 1) > 0
  4  ORDER  BY SCORE(1) DESC
  5  /

  SCORE(1) TEXT
---------- ---------------------------------------------
        47 government government government
        32 government government
        16 government
        16 government overnment

4 rows selected.

[Updated on: Tue, 09 July 2013 03:32]

Report message to a moderator

Re: Fuzzy Matching Using Contains [message #590685 is a reply to message #589622] Mon, 22 July 2013 02:12 Go to previous messageGo to next message
sdCraig
Messages: 2
Registered: July 2013
Location: San Diego
Junior Member
Hi Barbara:

Quick question if I may: Unless I missed something, I'm confused by a part of the examples provided above. Specifically, the oracle text score versus similarity score examples. I noticed that none of the examples show results where the similarity score is increased - just where the numterms value is increased. You mention that when the similarity score is lowered, the number of results increases. So, setting the similarity score at "1" will result in the maximum number of results? But the examples above all show similarity score of "1". I guess I need to see/understand the results of all combinations, etc. (ie, increase both; increase one while decreasing the other; decrease one while increasing the other; decrease both).

I've come across client environments (legal environment where search terms are applied to databases containing boxes of documents that may be relevant to a particular litigation), where they have set parameters in their queries (ie, similarity score of 15 and numterms set at 25). Parameters like this confuse me, as I'm not sure how one would derive such specific paramenters (why not a score of 20 and numterms of 15, 30, 40?). But these parameter values are set and are run on all fuzzy searches - and NO weight is applied either. I would imagine coming up with such specific search parameters would entail some kind of reasoning and I'd like to understand whey (ie, we have a database of size "x" and we want to return no more than "y" percent of records, therefore use these parameters).

Thank you.

C

Re: Fuzzy Matching Using Contains [message #590687 is a reply to message #590685] Mon, 22 July 2013 02:41 Go to previous message
Barbara Boehmer
Messages: 9100
Registered: November 2002
Location: California, USA
Senior Member
Quote:

I noticed that none of the examples show results where the similarity score is increased.


There are some examples where I increased the similarity score to 65.

Quote:

You mention that when the similarity score is lowered, the number of results increases. So, setting the similarity score at "1" will result in the maximum number of results?


Yes, unless the results are limited by other factors, such as the numterms.

Quote:

I guess I need to see/understand the results of all combinations, etc. (ie, increase both; increase one while decreasing the other; decrease one while increasing the other; decrease both).


The similarity score is how similar individual words are, not the score for the whole string of words. If you had a string of words in your column like "There are many government offices." and you searched for "governments", then the similarity score is how similar the word "governments" is to the word "government". The closer the match, the higher the score. When you set the similarity score parameter, it eliminates any results that are lower than that hidden score. So, the lower the similarity score is set, the more results you get. The higher the similarity score is set, the fewer results you get.

The numterms is the opposite. The lower the numterms is set, the fewer the results. The higher the numterms is set, the greater the results.

For maximum results, set the similarity score low and the numterms high. For limited results, set the similarity score high and the numterms low. If you set both low or both high, then one method will be limiting rows according to one criteria and the other will increase rows according to another criteria.

Quote:

I've come across client environments (legal environment where search terms are applied to databases containing boxes of documents that may be relevant to a particular litigation), where they have set parameters in their queries (ie, similarity score of 15 and numterms set at 25). Parameters like this confuse me, as I'm not sure how one would derive such specific paramenters (why not a score of 20 and numterms of 15, 30, 40?). But these parameter values are set and are run on all fuzzy searches - and NO weight is applied either. I would imagine coming up with such specific search parameters would entail some kind of reasoning and I'd like to understand whey (ie, we have a database of size "x" and we want to return no more than "y" percent of records, therefore use these parameters).


They may have determined that any similarity scores lower than 15 tend to be not similar enough to belong in the result set. Similarly they may only want the closest 25 matches in the expansion of terms. It usually takes some experimentation to find what works best. My inclination is to set the similarity score low and numterms high, so that you get maximum results, and then limit the number of rows based on descending score if desired. The biggest drawback to that method is the time it takes to return the results and the resources consumed. Not setting the weight just means that the hidden similarity score will not be used to effect the value of the displayed Oracle Text score.

Setting up your own test cases and experimenting may help clear up some of your confusion.
Previous Topic: while updating error recursive SQL level 1
Next Topic: Oracle Text Index Sync Problem on Update
Goto Forum:
  


Current Time: Sat Nov 23 06:21:53 CST 2024