Home » Server Options » Text & interMedia » Space in Lexer (Oracle, 10g, Win XP)
Space in Lexer [message #362158] |
Mon, 01 December 2008 03:57 |
getritesh
Messages: 235 Registered: July 2006 Location: INDIA
|
Senior Member |
|
|
Hello,
Check out the following scenario in which i have created Table, inserted rows in table and created a Domain index on one column
SQL> Create Table POLI(poli_no Varchar2(20));
Table created.
SQL> Insert Into POLI Values('ABC_009');
1 row created.
SQL> Insert Into POLI Values('B00010054207-001');
1 row created.
SQL> Insert Into POLI Values('B00010054207-003');
1 row created.
SQL> Insert Into POLI Values('CA 0005379-02-001');
1 row created.
SQL> Insert Into POLI Values('CA 0005444-03-002');
1 row created.
SQL> Commit;
Commit complete.
SQL> Select *
2 From poli;
POLI_NO
--------------------
ABC_009
B00010054207-001
B00010054207-003
CA 0005379-02-001
CA 0005444-03-002
SQL> CREATE INDEX I_TXT_POL_NO ON POLI
2 (POLI_NO)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS ('SYNC (ON COMMIT)');
Index created.
SQL> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'ABC_009') > 0;
no rows selected
SQL> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'CA 0005444-03-002') > 0;
no rows selected
SQL> exec ctx_ddl.create_preference('my_lexer','BASIC_LEXER');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.set_attribute('my_lexer','printjoins','-_~!@#$%^&*(){}[],=?\;|><.');
PL/SQL procedure successfully completed.
SQL> exec CTX_DDL.SET_ATTRIBUTE ('my_lexer', 'WHITESPACE', ' ');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.set_attribute('my_lexer','punctuations','.!?');
PL/SQL procedure successfully completed.
SQL> EXEC CTX_DDL.CREATE_STOPLIST ('my_stoplist', 'BASIC_STOPLIST');
PL/SQL procedure successfully completed.
SQL> Drop Index I_TXT_POL_NO;
Index dropped.
SQL> CREATE INDEX I_TXT_POL_NO ON POLI
2 (POLI_NO)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS ('LEXER my_lexer STOPLIST my_stoplist SYNC (ON COMMIT)');
Index created.
SQL> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'ABC_009') > 0;
POLI_NO
--------------------
ABC_009
SQL> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'CA 0005444-03-002') > 0
4 /
no rows selected
In poli_no 'CA 0005444-03-002' space after CA is causing the problem, please give me solution for this.How to add space in my_lexer
Thanks,
Ritesh
[Updated on: Mon, 01 December 2008 05:46] Report message to a moderator
|
|
|
Re: Space in Lexer [message #362272 is a reply to message #362158] |
Mon, 01 December 2008 11:54 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The problem has nothing to do with the space after CA. The problem is the hyphens that have special meaning in Oracle Text. In Oracle Text, a hyphen means minus. So, it subtracts the score returned by finding the value after the hyphen in the document from the score returned by finding the value before the hyphen, and if a document contains both values such score is likely <= 0. In order to have Oracle treat the hyphen like a regular character, you need to escape it, either by putting a backslash before each hyphen or by enclosing the entire token with curley brackets. Please see the demonstration below.
-- test environment you provided:
SCOTT@orcl_11g> Create Table POLI(poli_no Varchar2(20))
2 /
Table created.
SCOTT@orcl_11g> BEGIN
2 Insert Into POLI Values('ABC_009');
3 Insert Into POLI Values('B00010054207-001');
4 Insert Into POLI Values('B00010054207-003');
5 Insert Into POLI Values('CA 0005379-02-001');
6 Insert Into POLI Values('CA 0005444-03-002');
7 END;
8 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> BEGIN
2 ctx_ddl.create_preference('my_lexer','BASIC_LEXER');
3 ctx_ddl.set_attribute('my_lexer','printjoins','-_~!@#$%^&*(){}[],=?\;|><.');
4 CTX_DDL.SET_ATTRIBUTE ('my_lexer', 'WHITESPACE', ' ');
5 ctx_ddl.set_attribute('my_lexer','punctuations','.!?');
6 CTX_DDL.CREATE_STOPLIST ('my_stoplist', 'BASIC_STOPLIST');
7 END;
8 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> CREATE INDEX I_TXT_POL_NO ON POLI (POLI_NO)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4 ('LEXER my_lexer
5 STOPLIST my_stoplist
6 SYNC (ON COMMIT)')
7 /
Index created.
-- reproduction of problem:
SCOTT@orcl_11g> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'CA 0005444-03-002') > 0
4 /
no rows selected
-- corrected query using backslash to escape individual hyphens:
SCOTT@orcl_11g> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'CA 0005444\-03\-002') > 0
4 /
POLI_NO
--------------------
CA 0005444-03-002
-- corrected query using curley brackets around entire token:
SCOTT@orcl_11g> Select poli_no
2 From POLI
3 Where Contains(POLI_NO,'CA {0005444-03-002}') > 0
4 /
POLI_NO
--------------------
CA 0005444-03-002
|
|
|
|
|
Re: Space in Lexer [message #397003 is a reply to message #362158] |
Wed, 08 April 2009 10:34 |
mayur316
Messages: 4 Registered: April 2009
|
Junior Member |
|
|
Hi Barbara
I am facing almost similar problem and applied solution suggested by you but it's still not giving expected result(as per previously existing functionality of the application) for following test scenario in which wildcard(%) is placed in between search parameter.
I had one alternate solution ready using skipjoins but that would require buy in from client and seniors since i am new to oracle i don't know if such step is necessary or not.
Please suggest solution to following problem in which contains query returns only two rows while original like clause is returning five.
Thanks.
SQL> create table test (text varchar2(50));
Table created.
SQL> insert into test values ('ABC 10001-11 1777');
1 row created.
SQL> insert into test values ('ABC 10001-11 1777');
1 row created.
SQL> insert into test values ('ABC-20002-22-2888');
1 row created.
SQL> insert into test values ('UMB0001968-01-001');
1 row created.
SQL> insert into test values ('ABC@10001-11#1777');
1 row created.
SQL> insert into test values ('ABC 10001 1779');
1 row created.
SQL> insert into test values ('ABC-20002-22-17772');
1 row created.
SQL> commit;
Commit complete.
SQL> BEGIN
2 ctx_ddl.drop_preference('lextest');
3 ctx_ddl.create_preference('lextest','BASIC_LEXER');
4 ctx_ddl.set_attribute('lextest','printjoins','-_~!@#$^&*(){}[],=?\;|><.');
5 CTX_DDL.SET_ATTRIBUTE ('lextest', 'WHITESPACE', ' ');
6 ctx_ddl.set_attribute('lextest','punctuations','.!?');
7 CTX_DDL.CREATE_STOPLIST ('test_stoplist', 'BASIC_STOPLIST');
8 END;
9 /
PL/SQL procedure successfully completed.
SQL> create index test_text_idx on test (text)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 parameters ( 'LEXER lextest WORDLIST max_wordlist sync (on commit) STOPLIST test_stoplist' );
Index created.
SQL>
SQL> select text
2 from test
3 where contains(text,'%'||'ABC%177'||'%') > 0;
SQL>
SQL> select text
2 from test
3 where contains(text,'%'||'ABC%177'||'%') > 0;
TEXT
--------------------------------------------------
ABC@10001-11#1777
ABC-20002-22-17772
SQL>
SQL>
SQL> select text
2 from test
3 where text like '%'||'ABC%177'||'%'
4 /
TEXT
--------------------------------------------------
ABC 10001-11 1777
ABC 10001-11 1777
ABC@10001-11#1777
ABC 10001 1779
ABC-20002-22-17772
|
|
|
Re: Space in Lexer [message #397013 is a reply to message #397003] |
Wed, 08 April 2009 11:13 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Queries using CONTAINS are not the same as queries using LIKE. In the example that you provided, you got what you searched for. Context indexes separate the text in the column into tokens (words or groups of characters separated by spaces) and you can search for things within tokens and combinations of things within tokens. So, when you search for '%ABC%177%' you are searching for any row that contains any one set of characters between spaces that includes 'ABC' somewhere within that token, then '177' somewhere after that, within the same token. So, the two rows that you got are the only rows that meet that criteria. If you want rows that have a token with 'ABC' in it and the same or different token with '177' in it, then you need to search for '%ABC% AND %177%', as demonstrated below. Also, notice in the example below, that you can list the token_text values in the dr$...$i domain index table to see what values have been tokenized, indexed, and are searchable.
SCOTT@orcl_11g> create table test (text varchar2(50))
2 /
Table created.
SCOTT@orcl_11g> BEGIN
2 insert into test values ('ABC 10001-11 1777');
3 insert into test values ('ABC 10001-11 1777');
4 insert into test values ('ABC-20002-22-2888');
5 insert into test values ('UMB0001968-01-001');
6 insert into test values ('ABC@10001-11#1777');
7 insert into test values ('ABC 10001 1779');
8 insert into test values ('ABC-20002-22-17772');
9 END;
10 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> BEGIN
2 ctx_ddl.create_preference('lextest','BASIC_LEXER');
3 ctx_ddl.set_attribute('lextest','printjoins','-_~!@#$^&*(){}[],=?\;|><.');
4 CTX_DDL.SET_ATTRIBUTE ('lextest', 'WHITESPACE', ' ');
5 ctx_ddl.set_attribute('lextest','punctuations','.!?');
6 CTX_DDL.CREATE_STOPLIST ('test_stoplist', 'BASIC_STOPLIST');
7 END;
8 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> create index test_text_idx on test (text)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 parameters ( 'LEXER lextest STOPLIST test_stoplist')
4 /
Index created.
SCOTT@orcl_11g> -- These are the tokens that can be searched for:
SCOTT@orcl_11g> SELECT token_text FROM dr$test_text_idx$i
2 /
TOKEN_TEXT
----------------------------------------------------------------
10001
10001-11
1777
1779
ABC
ABC-20002-22-17772
ABC-20002-22-2888
ABC@10001-11#1777
UMB0001968-01-001
9 rows selected.
SCOTT@orcl_11g> -- If you want to find these results:
SCOTT@orcl_11g> select text from test
2 where text like '%'||'ABC%177'||'%'
3 /
TEXT
--------------------------------------------------
ABC 10001-11 1777
ABC 10001-11 1777
ABC@10001-11#1777
ABC 10001 1779
ABC-20002-22-17772
5 rows selected.
SCOTT@orcl_11g> -- then you need to use a search string like this one
SCOTT@orcl_11g> select text from test
2 where contains (text, '%ABC% AND %177%') > 0
3 /
TEXT
--------------------------------------------------
ABC 10001-11 1777
ABC 10001-11 1777
ABC@10001-11#1777
ABC 10001 1779
ABC-20002-22-17772
5 rows selected.
SCOTT@orcl_11g>
|
|
|
|
Goto Forum:
Current Time: Thu Jan 30 19:09:46 CST 2025
|