Home » Server Options » Text & interMedia » Stoplists and Nickname lists (Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production)
Stoplists and Nickname lists [message #670903] |
Wed, 01 August 2018 08:12 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Now that I've got a working NDATA name matching solution, I am trying to introduce some additional functionality into it, namely the use of stopword lists and nickname substitution lists. Despite performing the steps I think are necessary, these lists don't seem to be taken into account, so I'm guessing the same issue exists for both. I've pasted examples below that demonstrate the issue, using similar steps to the tests from my previous post, assuming Barbara is the one who'll be responding.
I created a name in one table "John Smith" and "John almost Smith" under the assumption that 'almost' wouldn't be considered because it is part of the default English stoplist (according to the following link: https://docs.oracle.com/cd/B28359_01/text.111/b28304/astopsup.htm#CCREF1400 ). When that didn't occur, I tried to explicitly add it to my own stoplist, but experienced the same issue:
-- tables and data:
DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');
DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John almost Smith', 'P');
-- index on candidate_name_stage:
exec ctx_ddl.drop_preference ('c_ds');
exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name');
-- add stoplist
exec ctx_ddl.drop_stoplist('c_sl');
exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');
exec ctx_ddl.add_stopword('c_sl', 'almost');
CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore c_ds section group c_secgrp stoplist c_sl');
-- query:
column candidate_name format a25 word_wrapped;
column target_name format a25 word_wrapped;
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order by candidate_name, score(1) desc;
This results in "no rows selected" because the score isn't above 70 (I was expecting it to be 100 or close to 100 since without "almost" the names would match). For completeness, I'm also pasting how I tried to use the default stoplist, which also resulted in not the match I was expecting:
-- tables and data:
DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');
DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John almost Smith', 'P');
-- index on candidate_name_stage:
exec ctx_ddl.drop_preference ('c_ds');
exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name');
CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore c_ds section group c_secgrp stoplist CTXSYS.DEFAULT_STOPLIST');
-- query:
column candidate_name format a25 word_wrapped;
column target_name format a25 word_wrapped;
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order by candidate_name, score(1) desc;
Finally, I will move this to a separate topic if it's not related, but I'm also not seeing my loading of nicknames being taken into account and I'm wondering if it's the same underlying issue. I'm trying to do something like:
-- tables and data:
DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('Jonathan Smith');
DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John Smith', 'P');
-- index on candidate_name_stage:
exec ctx_ddl.drop_preference ('c_ds');
exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name');
-- Load thesaurus
HOST ctxload -user testuser/testuser -thes -name nicknames -file nicknames.txt
exec ctx_ddl.drop_preference('name_wl');
exec ctx_ddl.create_preference('name_wl', 'BASIC_WORDLIST');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_ALTERNATE_SPELLING', 'FALSE');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_BASE_LETTER', 'TRUE');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_THESAURUS', 'nicknames');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_JOIN_PARTICLES',
'de:di:la:da:el:del:qi:abd:los:la:dos:do:an:li:yi:yu:van:jon:un:sai:ben:al');
CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore c_ds section group c_secgrp wordlist name_wl');
-- query:
column candidate_name format a25 word_wrapped;
column target_name format a25 word_wrapped;
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order by candidate_name, score(1) desc;
My nicknames.txt thesaurus contains:
Jonathan
SYN John (Jonathan)
But, I get "no rows returned" because "John" and "Jonathan" are not being considered as a match despite the thesaurus entry.
Any ideas as to why this might be? Thank you in advance.
|
|
|
|
|
Re: Stoplists and Nickname lists [message #670907 is a reply to message #670904] |
Wed, 01 August 2018 13:30 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
I will address the stoplist issue first in this reply, then the other issue in a separate reply later.
I am still scratching my head a bit on the stoplist issue. I can demonstrate how to confirm that "almost" is in the default stoplist and "almost" is in your user-defined stoplist and that your stoplist with the word "almost" is part of your index. However, I can then also demonstrate that the word "almost" is tokenized and indexed. Then, I can demonstrate that searching for "John Smith" or "John almost Smith" or "John however Smith" will find all three with lower scores for some. I tested these because, under normal circumstances, there is a pecuiliarity with stopwords, such that even though it does not index it, it still expects a word, any word, in its place. This did not seem to hold true here as it did not consider "almost" and "however" equivalent. So, it appears that if a stopword is not in the search string, then it ignores any stopwords in the results and scores them the same. However, it also appears that if a stopword is in the search string, then it includes that stopword in the search and in the scoring.
Although I was unable to locate any section of the online documentation that mentions it, according to Oracle Text product manager Roger Ford, "NDATA fields do not use the stopword list." He stated this in his response in the following thread:
https://community.oracle.com/message/14582153#14582153
Please see my test below. I made several modifications to what you posted, in order to test and demonstrate some things.
-- check for almost in default stoplist:
SCOTT@orcl_12.1.0.2.0> select spw_word from ctx_stopwords where spw_stoplist = 'DEFAULT_STOPLIST' and spw_word like 'al%' order by spw_word
2 /
SPW_WORD
--------------------------------------------------------------------------------
all
almost
also
although
4 rows selected.
-- add stoplist with almost and check for it:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_stoplist('c_sl');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_stopword('c_sl', 'almost');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> select spw_word from ctx_user_stopwords where spw_stoplist = 'C_SL'
2 /
SPW_WORD
--------------------------------------------------------------------------------
almost
1 row selected.
-- tables and data:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John almost Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John however Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smith', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John almost Smith', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John however Smith', 'P');
1 row created.
-- index on candidate_name_stage:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('c_ds');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('c_secgrp');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore c_ds section group c_secgrp stoplist c_sl');
Index created.
-- describe index and confirm that stoplist and stopword almost are included:
SCOTT@orcl_12.1.0.2.0> select ctx_report.describe_index ('CANDIDATE_NAME_IDX') from dual
2 /
CTX_REPORT.DESCRIBE_INDEX('CANDIDATE_NAME_IDX')
--------------------------------------------------------------------------------
===========================================================================
INDEX DESCRIPTION
===========================================================================
index name: "SCOTT"."CANDIDATE_NAME_IDX"
index id: 6398
index type: context
base table: "SCOTT"."CANDIDATE_NAME_STAGE"
primary key column:
text column: CANDIDATE_NAME
text column type: VARCHAR2(350)
language column:
format column:
charset column:
configuration column:
Query Stats Enabled: NO
status: INDEXED
full optimize token:
full optimize count:
docid count: 3
nextid: 4
===========================================================================
INDEX OBJECTS
===========================================================================
datastore: MULTI_COLUMN_DATASTORE
columns: candidate_name name
filter: NULL_FILTER
section group: BASIC_SECTION_GROUP
ndata section: NAME
section tag: NAME
lexer: BASIC_LEXER
wordlist: BASIC_WORDLIST
stemmer: ENGLISH
fuzzy_match: GENERIC
stoplist: BASIC_STOPLIST
stop_word: almost
storage: BASIC_STORAGE
r_table_clause: lob (data) store as (cache)
i_index_clause: compress 2
1 row selected.
-- what words are tokenized and indexed:
SCOTT@orcl_12.1.0.2.0> select token_text from dr$candidate_name_idx$i order by token_text
2 /
TOKEN_TEXT
----------------------------------------------------------------
^alm
^alo
^amo
^hoe
^how
^hwe
^jhn
^joh
^jon
^lmo
^mit
^ohn
^owe
^sit
^smi
^smt
almo
alms
alos
amos
eer$
eve$
ever
evr$
hoev
howe
howv
hwev
ith$
jhn$
joh$
john
jon$
lmos
lmot
lmst
lost
mih$
mit$
mith
mos$
most
mot$
mst$
mth$
oeve
ohn$
ost$
owee
owev
owve
sith
smih
smit
smth
ver$
weer
weve
wevr
wver
60 rows selected.
-- query:
SCOTT@orcl_12.1.0.2.0> column searched_for format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> column found format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> select distinct target_name searched_for, candidate_name found, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 0
4 order by target_name, score(1) desc
5 /
SEARCHED_FOR FOUND T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smith P 100
John Smith John almost Smith P 100
John Smith John however Smith P 100
John almost Smith John almost Smith P 100
John almost Smith John Smith P 65
John almost Smith John however Smith P 65
John however Smith John however Smith P 100
John however Smith John Smith P 62
John however Smith John almost Smith P 62
9 rows selected.
|
|
|
Re: Stoplists and Nickname lists [message #670908 is a reply to message #670907] |
Wed, 01 August 2018 14:02 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
I'm thinking perhaps you posted your response before you saw my follow-ups indicating that I got the results I expected -- I apologize that you spent time on this if so (though I'm still interested in hearing your thoughts on the other issue). The following works:
-- tables and data:
DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John almost Smith');
DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John Smith', 'P');
-- index on candidate_name_stage:
exec ctx_ddl.drop_preference ('c_ds');
exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name');
-- add stoplist
exec ctx_ddl.drop_stoplist('c_sl');
exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');
exec ctx_ddl.add_stopword('c_sl', 'almost');
CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore c_ds section group c_secgrp stoplist c_sl');
-- query:
column candidate_name format a25 word_wrapped;
column target_name format a25 word_wrapped;
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order by candidate_name, score(1) desc;
Result:
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John almost Smith John Smith P 100
|
|
|
|
Re: Stoplists and Nickname lists [message #670910 is a reply to message #670908] |
Wed, 01 August 2018 14:42 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Please compare what you did to what I did below. In the nicknames.txt the second line begins with a tab. I made sure that any pre-existing nicknames thesaurus was droppedd. I loaded the thesaurus as the user ctxsys and added it to the knowledge base using ctxkbtc. The query after including that in the index returns a higher score than the query without the thesaurus.
-- c:\my_oracle_files\nicknames.txt:
Jonathan
SYN John (Jonathan)
-- tables and data:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Jonathan Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smith', 'P');
1 row created.
-- index on candidate_name_stage without thesaurus:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('c_ds');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('c_secgrp');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference('name_wl');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference('name_wl', 'BASIC_WORDLIST');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_ALTERNATE_SPELLING', 'FALSE');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_BASE_LETTER', 'TRUE');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_JOIN_PARTICLES', 'de:di:la:da:el:del:qi:abd:los:la:dos:do:an:li:yi:yu:van:jon:un:sai:ben:al');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore c_ds section group c_secgrp wordlist name_wl');
Index created.
-- query without thesaurus:
SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> column target_name format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
4 order by candidate_name, score(1) desc;
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
Jonathan Smith John Smith P 88
1 row selected.
-- Load thesaurus
SCOTT@orcl_12.1.0.2.0> connect ctxsys/ctxsys
Connected.
CTXSYS@orcl_12.1.0.2.0> exec ctx_thes.drop_thesaurus ('NICKNAMES')
PL/SQL procedure successfully completed.
CTXSYS@orcl_12.1.0.2.0> host ctxload -user ctxsys/ctxsys -thes -name nicknames -file C:\my_oracle_files\nicknames.txt
Connecting...
Creating thesaurus nicknames...
Thesaurus nicknames created...
Processing...
2 lines processed successfully
Beginning insert...2 lines inserted successfully
Disconnected
CTXSYS@orcl_12.1.0.2.0> host ctxkbtc -user ctxsys/ctxsys -name nicknames
Oracle Text knowledge base extension: thesaurus compiler
Connecting...
Deleting old extended knowledge base.
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drelsUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\dreldUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\dreliUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drelkUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drefdUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drefiUS.dat
Processing thesaurus: NICKNAMES
Processed 2 terms.
Done processing thesaurus: NICKNAMES
Compiling and writing new flat files.
Writing extended knowledge base to files.
.............................................
.......................................................................................................................
Oracle Text knowledge base successfully extended.
Disconnected
CTXSYS@orcl_12.1.0.2.0> connect scott/tiger
Connected.
SCOTT@orcl_12.1.0.2.0> select thp_phrase from ctx_thes_phrases where thp_thesaurus ='NICKNAMES';
THP_PHRASE
--------------------------------------------------------------------------------
JOHN
JONATHAN
2 rows selected.
-- add thesaurus to index:
SCOTT@orcl_12.1.0.2.0> drop index candidate_name_idx
2 /
Index dropped.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_THESAURUS', 'NICKNAMES');
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore c_ds section group c_secgrp wordlist name_wl');
Index created.
-- query with thesaurus:
SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> column target_name format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
4 order by candidate_name, score(1) desc;
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
Jonathan Smith John Smith P 97
1 row selected.
|
|
|
|
Re: Stoplists and Nickname lists [message #671772 is a reply to message #670903] |
Tue, 18 September 2018 20:59 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Again I have to resurrect an old thread, as I appear to have discovered something related to this thread. When we were discussing stopwords, you had pointed out that Roger Ford had said "NDATA fields do not use the stopword list." I think this is correct, and I think because of some other peculiarity it appeared as if our test was actually using the stopword list, but I don't think it actually was. We had compared "John Smith" and "John almost Smith" and assumed that because "almost" was part of the default stoplist, the reason the score came back as 100 is that "almost" was being removed from the comparison. However, I discovered that regardless of that "almost," the fact that "John" and "John" match and "Smith" and "Smith" match is what is causing that score to come back as 100. In other words, run the test with "John Smith" and "John garbage Smith" The score still comes back as 100! But this isn't because "garbage" is a stopword. For some other reason it isn't being considered in the comparison, resulting in a 100 score. Do you know why this is? To demonstrate this another way, try a test with 2 tokens that don't match at all, along with a stopword, e.g. "M almost" compared to "O almost" You would expect a score of 0 (since "almost" would be ignored, and "M vs "O" don't match at all). But instead, you get a score of 71, because "almost" is actually taken into account!
|
|
|
Re: Stoplists and Nickname lists [message #671773 is a reply to message #671772] |
Tue, 18 September 2018 22:02 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Part of what you are seeing is what I described in an answer in a previous thread. "John" is contained in "John almost Smith" and "John garbage Smith" and "Smith" is contained in "John almost Smith" and "John garbage Smith", so searching for "John Smith" using a "contains" query that uses ndata matches both 100%. It checks whether those words are contained, not whether the entire group of words matches.
You can create an index without ndata and then drop and recreate the same index with ndata and see the differences in the tokens created. Without ndata, stopwords like almost are not tokenized. With ndata, stopwords like almost are tokenized and therefore included in searching. So, if you compare "m almost" and "o almost", the two "almost" match, but the m and o do not match each other, so it is a partial match.
When tokenization using ndata is done, it creates multiple strings where ^ indicates starting and $ indicates ending and others may be in the middle. So, when you search for something using ndata, it checks to see if what you searched for is similar to any of those little strings.
Please see the demonstration below that shows indexing and resulting tokens without ndata, then indexing and resulting tokens with ndata.
-- table and data:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John almost Smith');
1 row created.
-- index on candidate_name_stage using stop list with "almost" WITHOUT ndata, and resulting tokens:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_stoplist('c_sl')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_stopword('c_sl', 'almost')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('stoplist c_sl');
Index created.
SCOTT@orcl_12.1.0.2.0> select token_text from dr$candidate_name_idx$i;
TOKEN_TEXT
----------------------------------------------------------------
JOHN
SMITH
2 rows selected.
-- index on candidate_name_stage using stop list with "almost" WITH ndata, and resulting tokens:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('c_ds')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('c_secgrp')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> drop index candidate_name_idx;
Index dropped.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore c_ds section group c_secgrp stoplist c_sl');
Index created.
SCOTT@orcl_12.1.0.2.0> select token_text from dr$candidate_name_idx$i;
TOKEN_TEXT
----------------------------------------------------------------
^alm
^alo
^amo
^jhn
^joh
^jon
^lmo
^mit
^ohn
^sit
^smi
^smt
almo
alms
alos
amos
ith$
jhn$
joh$
john
jon$
lmos
lmot
lmst
lost
mih$
mit$
mith
mos$
most
mot$
mst$
mth$
ohn$
ost$
sith
smih
smit
smth
39 rows selected.
|
|
|
|
Re: Stoplists and Nickname lists [message #671785 is a reply to message #671784] |
Wed, 19 September 2018 08:46 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
vka2b wrote on Wed, 19 September 2018 06:26Thank you for your response. Your example clearly illustrates what you have stated: "Without ndata, stopwords like almost are not tokenized. With ndata, stopwords like almost are tokenized and therefore included in searching." So does the fact that with ndata stopwords are tokenized mean that the following statement is true: "NDATA fields do not use the stopword list." If so, that changes my solution slightly as I thought I could use the stopword list with ndata. As I mentioned, I think I was thrown off because I took the fact that "John almost Smith" and "John Smith" matched at 100 to mean that "almost" was being ignored, but as you have stated, the reason for the 100 is because of the CONTAINS, not because "almost" is not being tokenized. Correct?
As Roger Ford said, "NDATA fields do not use the stopword list" and we can see that this is true.
With NDATA, stopwords like "almost" are not being ignored and are being tokenized and included in searches.
It is the combination of CONTAINS and NDATA that causes "John Smith" to match "John anything Smith" 100%. It is like searching for "John AND Smith" without NDATA.
So, although you can create a stoplist and include it in indexing along with NDATA without raising an error, the stoplist is ignored. So, functionally you cannot use a stoplist with NDATA.
|
|
|
|
Goto Forum:
Current Time: Sat Nov 23 03:40:15 CST 2024
|