Home » Server Options » Text & interMedia » Stem operator ($) problem
Stem operator ($) problem [message #269471] |
Sat, 22 September 2007 12:46 |
amazing
Messages: 46 Registered: September 2007 Location: Venezuela
|
Member |
|
|
Hi. I have a problem with Oracle Text stem operator($). I’m developing my own “Spanish” text search application by the Oracle® Text Reference 10g Release 2 (10.2) B14218-01 (June 2005) documentation. So, I have the following structure:
A base table to store de text with two fields: the key (NUMBER) and another field (VARCHAR) to store the text.
These are my own preferences:
ctx_ddl.create_preference('FNTC_Lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter', 'yes');
ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter_type', 'GENERIC');
ctx_ddl.set_attribute('FNTC_Lexer', 'override_base_letter', 'false');
ctx_ddl.set_attribute('FNTC_Lexer', 'mixed_case', 'NO');
ctx_ddl.set_attribute('FNTC_Lexer', 'index_themes', 'NO');
ctx_ddl.set_attribute('FNTC_Lexer', 'index_stems', 'SPANISH');
ctx_ddl.set_attribute('FNTC_Lexer', 'index_text', 'YES');
-----------------------------------------------------------------
ctx_ddl.create_preference('FNTC_Word_List', 'BASIC_WORDLIST');
ctx_ddl.set_attribute('FNTC_Word_List', 'stemmer', 'SPANISH');
ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_match', 'SPANISH');
ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_score', '60');
ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_numresults', '100');
ctx_ddl.set_attribute('FNTC_Word_List', 'substring_index', 'true');
ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_index', 'true');
ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_min_length', '1');
ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_max_length', '50');
ctx_ddl.set_attribute('FNTC_Word_List', 'wildcard_maxterms', '50');
-----------------------------------------------------------------
ctx_ddl.create_stoplist('FNTC_Stop_List', 'BASIC_STOPLIST');
In this last case I use an empty stop list because hay I do not want to use any stop words.
This is the index structure:
create index IDX_DENOMINACION on dbusqueda(descripcion)
indextype is ctxsys.context
parameters('LEXER FNTC_Lexer WORDLIST FNTC_Word_List STOPLIST FNTC_Stop_List SYNC (ON COMMIT)');
I have this data in the base table:
ID DESCRIPCION
-----------------
1 PEPITO
2 PEPITA
3 PEPITE
4 PEPITIN
5 PEPITU
6 PEPE
So here is de problem. When I use the stem operator ($):
select score(1), t.*
from dbusqueda t
where contains(descripcion, '$pepito', 1) > 0 order by score(1) desc;
it should return all rows. Doesn’t it??
But only returns
ID DESCRIPTION
-----------------
1 PEPITO
6 PEPE
So I did the book’s stem example (chapter 3, Oracle Text CONTAINS Query Operators, page 3-36) to make me sure is problem of my solution. And I change my preferences attributes INDEX_STEMS, STEMMER and FUZZY_MATCH values to ENGLISH, change my base table data to values same as the example:
ID DESCRIPCION
-----------------
1 scream
2 screaming
3 screamed
then I rebuild the index
ALTER INDEX IDX_DENOMINACION REBUILD;
So, when I use the stem operator once again:
select score(1), t.*
from dbusqueda t
where contains(descripcion, '$scream', 1) > 0 order by score(1) desc;
it should return all rows. Doesn’t it??
But only returns
ID DESCRIPCION
-----------------
1 scream
When would have to be all the rows, just like the example.
So…what is wrong here…?
|
|
|
Re: Stem operator ($) problem [message #269488 is a reply to message #269471] |
Sat, 22 September 2007 19:34 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The stem operator selects words that stem from the same root word, not just words that start with the same letters. I believe that they must match as to whether they are nouns or verbs and masculine or feminine as well. So, I believe you got the expeted results for the spanish words that you used. I added some additional words to the test below.
When you change the attributes of your preferences, it is not sufficient to rebuild the index. You need to drop the index and recreate it. When the index is created, it fetches the current attribute values and stores those values, not the preference names. So, if you only rebuild the index, it uses the old stored values. It only fetches the new values if you drop and recreate the index. I have demonstrated that below.
-- initial test environment:
SCOTT@10gXE> SELECT banner FROM v$version
2 /
BANNER
----------------------------------------------------------------
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Product
PL/SQL Release 10.2.0.1.0 - Production
CORE 10.2.0.1.0 Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
SCOTT@10gXE> CREATE TABLE dbusqueda
2 (key NUMBER,
3 descripcion VARCHAR2(30))
4 /
Table created.
SCOTT@10gXE> INSERT ALL
2 INTO dbusqueda VALUES (1, 'PEPITO')
3 INTO dbusqueda VALUES (2, 'PEPITA')
4 INTO dbusqueda VALUES (3, 'PEPITE')
5 INTO dbusqueda VALUES (4, 'PEPITIN')
6 INTO dbusqueda VALUES (5, 'PEPITU')
7 INTO dbusqueda VALUES (6, 'PEPE')
8 INTO dbusqueda VALUES (7, 'El chilla.') -- verbo
9 INTO dbusqueda VALUES (8, 'El chilló.') -- verbo
10 INTO dbusqueda VALUES (9, 'Oí un chillido.') -- nombre
11 INTO dbusqueda VALUES (10, 'Oí algunos chillidos.') -- nombre
12 INTO dbusqueda VALUES (11, 'scream')
13 INTO dbusqueda VALUES (12, 'screaming')
14 INTO dbusqueda VALUES (13, 'screamed')
15 SELECT * FROM DUAL
16 /
13 rows created.
SCOTT@10gXE> BEGIN
2 ctx_ddl.create_preference('FNTC_Lexer', 'BASIC_LEXER');
3 ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter', 'yes');
4 ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter_type', 'GENERIC');
5 ctx_ddl.set_attribute('FNTC_Lexer', 'override_base_letter', 'false');
6 ctx_ddl.set_attribute('FNTC_Lexer', 'mixed_case', 'NO');
7 ctx_ddl.set_attribute('FNTC_Lexer', 'index_themes', 'NO');
8 ctx_ddl.set_attribute('FNTC_Lexer', 'index_stems', 'SPANISH');
9 ctx_ddl.set_attribute('FNTC_Lexer', 'index_text', 'YES');
10 ctx_ddl.create_preference('FNTC_Word_List', 'BASIC_WORDLIST');
11 ctx_ddl.set_attribute('FNTC_Word_List', 'stemmer', 'SPANISH');
12 ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_match', 'SPANISH');
13 ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_score', '60');
14 ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_numresults', '100');
15 ctx_ddl.set_attribute('FNTC_Word_List', 'substring_index', 'true');
16 ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_index', 'true');
17 ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_min_length', '1');
18 ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_max_length', '50');
19 ctx_ddl.set_attribute('FNTC_Word_List', 'wildcard_maxterms', '50');
20 ctx_ddl.create_stoplist('FNTC_Stop_List', 'BASIC_STOPLIST');
21 END;
22 /
PL/SQL procedure successfully completed.
SCOTT@10gXE> create index IDX_DENOMINACION on dbusqueda(descripcion)
2 indextype is ctxsys.context
3 parameters
4 ('LEXER FNTC_Lexer
5 WORDLIST FNTC_Word_List
6 STOPLIST FNTC_Stop_List
7 SYNC (ON COMMIT)')
8 /
Index created.
-- tests of spanish words:
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$pepito', 1) > 0
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
5 6 PEPE
5 1 PEPITO
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$chilla', 1) > 0 -- verbo
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
5 8 El chilló.
5 7 El chilla.
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$chilló', 1) > 0 -- verbo
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
5 8 El chilló.
5 7 El chilla.
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$chillido', 1) > 0 -- nombre
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
6 10 Oí algunos chillidos.
6 9 Oí un chillido.
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$chillidos', 1) > 0 -- nombre
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
6 10 Oí algunos chillidos.
6 9 Oí un chillido.
SCOTT@10gXE>
-- change attributes:
SCOTT@10gXE> BEGIN
2 ctx_ddl.set_attribute('FNTC_Lexer', 'index_stems', 'ENGLISH');
3 ctx_ddl.set_attribute('FNTC_Word_List', 'stemmer', 'ENGLISH');
4 ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_match', 'ENGLISH');
5 END;
6 /
PL/SQL procedure successfully completed.
-- just rebuilding the index is not sufficient:
SCOTT@10gXE> ALTER INDEX idx_denominacion REBUILD
2 /
Index altered.
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$scream', 1) > 0
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
6 11 scream
-- dropping and recreating the index works:
SCOTT@10gXE> DROP INDEX idx_denominacion
2 /
Index dropped.
SCOTT@10gXE> create index IDX_DENOMINACION on dbusqueda(descripcion)
2 indextype is ctxsys.context
3 parameters
4 ('LEXER FNTC_Lexer
5 WORDLIST FNTC_Word_List
6 STOPLIST FNTC_Stop_List')
7 /
Index created.
SCOTT@10gXE> select score(1), t.*
2 from dbusqueda t
3 where contains(descripcion, '$scream', 1) > 0
4 order by score(1) desc
5 /
SCORE(1) KEY DESCRIPCION
---------- ---------- ------------------------------
5 11 scream
5 13 screamed
5 12 screaming
SCOTT@10gXE>
|
|
|
|
Goto Forum:
Current Time: Thu Jan 30 18:58:01 CST 2025
|