Home » Server Options » Text & interMedia » CTXSYS fuzzy searching (DB 11.2, windows 2008 R2)
CTXSYS fuzzy searching [message #615972] |
Wed, 11 June 2014 07:31 |
|
JamesAlcroft
Messages: 2 Registered: June 2014
|
Junior Member |
|
|
Been suggested to post this here.
Could someone explain the following and whether we have done something wrong please.
Have a table, dict_prod, with a column (among other) of fuzzy_brand. We have the following index on this column:
create index IDX_TXT_DICT_BRAND
on DICT_PROD(FUZZY_BRAND)
indextype is ctxsys.context
parameters ('DATASTORE CTXSYS.DEFAULT_DATASTORE');
We are using the following SQL:
select fuzzy_brand from dict_prod WHERE CONTAINS(fuzzy_brand,'fuzzy({xxx},,,weight)', 1) > 20;
We get the following results (replacing xxx with the text below), which doesn't make much sense to me:
D-WORM doesn't fuzzy match
but
1-WORM and DW-WORM does
D%WORM doesn't fuzzy match
but
1%WORM does
D WORM doesn't fuzzy match
but
5 WORM and DW WORM does
Any advice/suggestion appreciated.
James
|
|
|
Re: CTXSYS fuzzy searching [message #616027 is a reply to message #615972] |
Wed, 11 June 2014 12:57 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
It would help to know what values exist in your data that you expect to find and not find using each of your search values.
There are various things that can affect the results:
1. You can increase the likelihood of matching by setting the parameters like so:
CONTAINS (fuzzy_brand, 'fuzzy ({xxx},1,5000,weight)', 1) > 0
You can at least do that initially to see what is returned, then determine appropriate limits.
2. Stopwords are not indexed. By default, if you do not specify a stoplist, then the ctxsys.default_stoplist is used. It may help to use ctxsys.empty_stoplist in your index parameters. You can at least do this initially to see if that is part of the problem. You can always replace the empty_stoplist with the default_stoplist or a user-defined stoplist.
3. By default, when you index "d-worm", the hyphen is considered a punctuation character, so it is indexed as two separate tokens "d" and "worm". If you set the hyphen as a printjoin, then it is indexed as one token "d-worm". If you set the hyphen as a skipjoin, then it is indexed as one token "dworm". These attributes can be set in a lexer that is used in the index parameters.
4. When you enclose your search string between { and } it treats all characters between the brackets as regular characters to be searched for. Without the brackets, the hyphen is considered a minus. With the brackets the hyphen and the percent sign are just regular characters. I don't know if this is what you want or not. You can use \ to escape individual characters in the search string.
You can see what is tokenized and indexed by selecting from the token_text column of the dr$your_index_name$i domain index table.
I have demonstrated different results below with no lexer, lexer with the hyphen as prinjoin, and lexer with hyphen as skipjjoin.
-- test environment:
SCOTT@orcl> SELECT banner FROM v$version
2 /
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE 11.2.0.1.0 Production
TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production
5 rows selected.
SCOTT@orcl> CREATE TABLE dict_prod
2 (fuzzy_brand VARCHAR2(60))
3 /
Table created.
SCOTT@orcl> INSERT ALL
2 INTO dict_prod (fuzzy_brand) VALUES ('D-WORM')
3 INTO dict_prod (fuzzy_brand) VALUES ('1-WORM')
4 INTO dict_prod (fuzzy_brand) VALUES ('DW-WORM')
5 INTO dict_prod (fuzzy_brand) VALUES ('D%WORM')
6 INTO dict_prod (fuzzy_brand) VALUES ('1%WORM')
7 INTO dict_prod (fuzzy_brand) VALUES ('D WORM')
8 INTO dict_prod (fuzzy_brand) VALUES ('5 WORM')
9 INTO dict_prod (fuzzy_brand) VALUES ('DW WORM')
10 INTO dict_prod (fuzzy_brand) VALUES ('de-worm')
11 INTO dict_prod (fuzzy_brand) VALUES ('deworm')
12 SELECT * FROM DUAL
13 /
10 rows created.
-- without lexer:
SCOTT@orcl> create index IDX_TXT_DICT_BRAND
2 on DICT_PROD(FUZZY_BRAND)
3 indextype is ctxsys.context
4 parameters
5 ('DATASTORE CTXSYS.DEFAULT_DATASTORE
6 STOPLIST CTXSYS.EMPTY_STOPLIST')
7 /
Index created.
SCOTT@orcl> SELECT token_text FROM dr$idx_txt_dict_brand$i
2 /
TOKEN_TEXT
----------------------------------------------------------------
1
5
D
DE
DEWORM
DW
WORM
7 rows selected.
SCOTT@orcl> -- with brackets:
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
D%WORM
D WORM
3 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({1-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
1%WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW-WORM
DW WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D%WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
D%WORM
D WORM
3 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({1%WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
1%WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
D%WORM
D WORM
3 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({5 WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
5 WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW-WORM
DW WORM
2 rows selected.
SCOTT@orcl> -- without brackets:
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (D\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
D%WORM
D WORM
3 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (1\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
1%WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (DW\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW-WORM
DW WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (D%WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
1-WORM
DW-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
de-worm
deworm
10 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (1%WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
1-WORM
DW-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
de-worm
9 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
D%WORM
D WORM
3 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({5 WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
5 WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW-WORM
DW WORM
2 rows selected.
-- with lexer setting hyphen as printjoin:
SCOTT@orcl> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_lex', 'BASIC_LEXER');
3 CTX_DDL.SET_ATTRIBUTE ('test_lex', 'PRINTJOINS', '-');
4 END;
5 /
PL/SQL procedure successfully completed.
SCOTT@orcl> DROP INDEX idx_txt_dict_brand
2 /
Index dropped.
SCOTT@orcl> create index IDX_TXT_DICT_BRAND
2 on DICT_PROD(FUZZY_BRAND)
3 indextype is ctxsys.context
4 parameters
5 ('DATASTORE CTXSYS.DEFAULT_DATASTORE
6 LEXER test_lex
7 STOPLIST CTXSYS.EMPTY_STOPLIST')
8 /
Index created.
SCOTT@orcl> SELECT token_text FROM dr$idx_txt_dict_brand$i
2 /
TOKEN_TEXT
----------------------------------------------------------------
1
1-WORM
5
D
D-WORM
DE-WORM
DEWORM
DW
DW-WORM
WORM
10 rows selected.
SCOTT@orcl> -- with brackets:
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({1-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW-WORM
de-worm
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D%WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D%WORM
D WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({1%WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1%WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D%WORM
D WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({5 WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
5 WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW WORM
1 row selected.
SCOTT@orcl> -- without brackets:
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (D\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (1\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (DW\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW-WORM
de-worm
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (D%WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
DW-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
de-worm
deworm
9 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (1%WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
6 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D%WORM
D WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({5 WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
5 WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW WORM
1 row selected.
-- with lexer setting hyphen as skipjoin:
SCOTT@orcl> BEGIN
2 CTX_DDL.DROP_PREFERENCE ('test_lex');
3 CTX_DDL.CREATE_PREFERENCE ('test_lex', 'BASIC_LEXER');
4 CTX_DDL.SET_ATTRIBUTE ('test_lex', 'SKIPJOINS', '-');
5 END;
6 /
PL/SQL procedure successfully completed.
SCOTT@orcl> DROP INDEX idx_txt_dict_brand
2 /
Index dropped.
SCOTT@orcl> create index IDX_TXT_DICT_BRAND
2 on DICT_PROD(FUZZY_BRAND)
3 indextype is ctxsys.context
4 parameters
5 ('DATASTORE CTXSYS.DEFAULT_DATASTORE
6 LEXER test_lex
7 STOPLIST CTXSYS.EMPTY_STOPLIST')
8 /
Index created.
SCOTT@orcl> SELECT token_text FROM dr$idx_txt_dict_brand$i
2 /
TOKEN_TEXT
----------------------------------------------------------------
1
1WORM
5
D
DEWORM
DW
DWORM
DWWORM
WORM
9 rows selected.
SCOTT@orcl> -- with brackets:
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
DW-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
7 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({1-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
6 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW-WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
DW-WORM
de-worm
deworm
4 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D%WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D%WORM
D WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({1%WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1%WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D%WORM
D WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({5 WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
5 WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW WORM
1 row selected.
SCOTT@orcl> -- without brackets:
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (D\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
DW-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
7 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (1\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
6 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (DW\-WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
DW-WORM
de-worm
deworm
4 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (D%WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D-WORM
DW-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
de-worm
deworm
9 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy (1%WORM,1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
1-WORM
D%WORM
1%WORM
D WORM
5 WORM
DW WORM
6 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({D WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
D%WORM
D WORM
2 rows selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({5 WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
5 WORM
1 row selected.
SCOTT@orcl> select fuzzy_brand from dict_prod
2 WHERE CONTAINS (fuzzy_brand, 'fuzzy ({DW WORM},1,5000,weight)', 1) > 0
3 /
FUZZY_BRAND
------------------------------------------------------------
DW WORM
1 row selected.
[Updated on: Wed, 11 June 2014 13:03] Report message to a moderator
|
|
|
|
|
|
|
Goto Forum:
Current Time: Fri Dec 27 00:14:25 CST 2024
|