Home » Server Options » Text & interMedia » MULTI_STOPLIST using WORLD_LEXER
MULTI_STOPLIST using WORLD_LEXER [message #391152] |
Wed, 11 March 2009 04:15 ![Go to next message Go to next message](/forum/theme/orafaq/images/down.png) |
leon_buijsman
Messages: 13 Registered: March 2009 Location: Rotterdam
|
Junior Member |
![leon_buijsman](/forum/theme/orafaq/images/yahoo.png)
|
|
We are indexing documents of different languages and will be using WORLD_LEXER. WORLD_LEXER does identify the language of a document automatically.
This is all OK, but what to do with the DEFAULT_STOPLIST. This is one language by default (English in our case). There is an option called MULTI_STOPLIST. According to the manual it requires you know the language of a document upfront (before indexing) and its use during queries appears to be "unknown": At query time, the session language setting determines the active stopwords, like it determines the active lexer when using the multi-lexer.
Nothing is stated about the use of MULTI_STOPLIST in combination with WORLD_LEXER.
My questions:
a) Does the MULTI_STOPLIST work together with WORLD_LEXER and is does it use the language the WORLD_LEXER has determined?
If not, is there any alternative apart from building a stoplist ourself with stopwords from different languages in one list?
b) What happens at query time when you are using a webapplication? Will it default to the session language which is English? Or is there some way we can influence that?
|
|
|
|
|
Re: MULTI_STOPLIST using WORLD_LEXER [message #391337 is a reply to message #391246] |
Wed, 11 March 2009 18:10 ![Go to previous message Go to previous message](/forum/theme/orafaq/images/up.png) |
![](/forum/images/custom_avatars/43710.gif) |
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
If you search the internet, you can find some products and ideas that you might be able to use such as:
http://www.lextek.com/langid/li/
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
However, I think I might be inclined to use the world_lexer to get the benefits of automatic language detection, but provide a language column that could be used by the multi_stoplist, and request, but not require, that the uploader of a document provide the language of the document. With the multi_stoplist, you can specify any words, such as "a" that you might want to be stopwords in all languages, including documents for which a language was not specified. Then you can specify other words for only individual languages, recognizing that these language-specific stopwords will apply to documents for which the language has been identified and the specified language matches the stopword language, regardless of the session language, and that these language-specific stopwords will apply to all documents whether or not the language is identified when the stopword language matches the session language. For example, the German word "die" is the counterpart of the English word "the", so you would want it to be a German stopword, but not an English stopword. So, you would get more accurate results if the document language and/or session language is specified, and results that include some stopwords when they are not specified. Please see the demonstration below that illustrates this.
SCOTT@orcl_11g> CREATE TABLE test_tab
2 (id_col NUMBER,
3 data_col VARCHAR2 (30),
4 lang_col VARCHAR2 (10))
5 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO test_tab VALUES (1, 'Live and Let Die', 'english')
3 INTO test_tab VALUES (2, 'Die Katze im Hut', 'german')
4 INTO test_tab VALUES (3, 'Live and Let Die', NULL)
5 INTO test_tab VALUES (4, 'Die Katze im Hut', NULL)
6 INTO test_tab VALUES (5, 'a', NULL)
7 INTO test_tab VALUES (6, 'a', 'english')
8 INTO test_tab VALUES (7, 'a', 'german')
9 SELECT * FROM DUAL
10 /
7 rows created.
SCOTT@orcl_11g> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_lex', 'WORLD_LEXER');
3 CTX_DDL.CREATE_STOPLIST ('test_stop', 'MULTI_STOPLIST');
4 CTX_DDL.ADD_STOPWORD ('test_stop', 'Die','german');
5 CTX_DDL.ADD_STOPWORD ('test_stop', 'a','all');
6 END;
7 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> CREATE INDEX test_idx ON test_tab (data_col)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4 ('LEXER test_lex
5 STOPLIST test_stop
6 LANGUAGE COLUMN lang_col')
7 /
Index created.
SCOTT@orcl_11g> SELECT token_text FROM dr$test_idx$i
2 /
TOKEN_TEXT
----------------------------------------------------------------
AND
DIE
HUT
IM
KATZE
LET
LIVE
7 rows selected.
SCOTT@orcl_11g> ALTER SESSION SET NLS_LANGUAGE = 'ENGLISH'
2 /
Session altered.
SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'die') > 0
2 /
ID_COL DATA_COL LANG_COL
---------- ------------------------------ ----------
1 Live and Let Die english
3 Live and Let Die
4 Die Katze im Hut
SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'a') > 0
2 /
no rows selected
SCOTT@orcl_11g> ALTER SESSION SET NLS_LANGUAGE = 'GERMAN'
2 /
Session altered.
SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'die') > 0
2 /
no rows selected
SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'a') > 0
2 /
no rows selected
SCOTT@orcl_11g>
|
|
|
Goto Forum:
Current Time: Thu Feb 06 23:28:10 CST 2025
|