Home » Server Options » Text & interMedia » Problem with Catsearch and special characters in UTF8 (Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod)
Problem with Catsearch and special characters in UTF8 [message #364835] |
Wed, 10 December 2008 03:54 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
Hi,
We have the following problems with catsearch:
In: searchAllSongs Sqlcode: -29902 SqlErrm: ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-51030: wildcard query expansion resulted in too many terms;
The searchString that is being sent to the DB is: note™*
When removing the copyright sign the query works fine.
I then decided put the trademarksign in the skipjoin and recreated the textindex. But the error is still being produced.
select dump('™') from dual;
Typ=96 Len=3: 226,132,162
Code which is generating the skipjoin:
ctx_ddl.set_attribute(mainViewName || '_' || suffix || '_SpecChLex', 'skipjoins', '`-=[];''\,./~!@#$%^&*()_+{}:"|<>?§´¨½¼¾¤£¤©™®');
When i check the user_source it seems that it has been compiled properly (see bold part which is representing the trademark sign):
SELECT DUMP(text, 1010) DMP FROM user_source e where text like '%skipjoin%'
Typ=1 Len=150 CharacterSet=AL32UTF8: 32,32,32,32,99,116,120,95,100,100,108,46,115,101,116,95,97,116,116,114,105,98,117,116,101,40,109,97,105,110,86,105,101,119,78,97,109, 101,32,124,124,32,39,95,39,32,124,124,32,115,117,102,102,105,120,32,124,124,32,39,95,83,112,101,99,67,104,76,101,120,39,44,32,39,115, 107,105,112,106,111,105,110,115,39,44,32,39,96,45,61,91,93,59,39,39,92,44,46,47,126,33,64,35,36,37,94,38,42,40,41,95,43,123,125,58,34 ,124,60,62,63,194,167,194,180,194,168,194,189,194,188,194,190,194,164,194,163,226,130,172,194,169,226,132,162,194,174,39,41,59,10
If a characther is in a skipjoin I thought that it would be removed before the search is executed. Is this wrong?
Hope somebody could help me or know if I should contact support for this matter.
Thanks,
Slavko
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #364918 is a reply to message #364835] |
Wed, 10 December 2008 08:16 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
More on above issue:
My investigations shows following:
if i search on:
note™*
it translates it to following search:
note *
which means we search on note AND "everything else" the everything else causing the "wildcard query
expansion error"
I would have thought if I did not have it in the skipjoin it would have searched on note™*.
After putting it in the skipjoin and recreating the index I expected it to search on note*.
However in both cases it is as if it replaces the trademark sign with a space.
This happens with other signs as well. Such as:
©¤®
Any idea ?
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #364967 is a reply to message #364918] |
Wed, 10 December 2008 12:55 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
What does the following query return? (substituting your index name):
SELECT COUNT (DISTINCT dr$token)
FROM dr$your_index_name$i
WHERE dr$token LIKE 'NOTE%';
and what is the value of wildcard_maxterms that you are using? If you have not assigned a value, then the default is 5000 for your version. In order for the query to succeed the wildcard_maxterms must be greater than the count returned by the above query. I would recommend creating a wordlist and setting the wildcard_maxterms to the maximum value of 15000 for your version. If that does not solve the problem then you will have to find a way to make your queries more restrictive. I am unable to reproduce anything that indicates that your query is searching for "NOTE *" with a space instead of "NOTE*" without a space. That should expand to all tokens that begin with "NOTE". However, I am using 11g, so there could be a 10g bug that I am not encountering. Please see my sample code below, using as much of your code as was provided and adding the basic_wordlist and wildcard_maxterms. I included one word with the trademark in the middle, just to show that the skipjoins is treating it properly by removing it, not replacing it with a space when indexing. However, it sounds like you are saying your indexing is working properly, just that the search string may not be interpreted properly.
Perhaps you could provide the results of:
SELECT CTX_REPORT.CREATE_INDEX_SCRIPT ('your_index_name') FROM DUAL;
so that I could do a more thorough check. There may be some conflicting parameters or some such thing.
SCOTT@orcl_11g> CREATE TABLE test_tab
2 (test_col VARCHAR2(60))
3 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO test_tab VALUES ('word1 word2 word3')
3 INTO test_tab VALUES ('word1 note™ word3')
4 INTO test_tab VALUES ('word1 note™d word3')
5 INTO test_tab VALUES ('word1 noted word3')
6 INTO test_tab VALUES ('word1 notes word3')
7 INTO test_tab VALUES ('word1 noteworthy word3')
8 SELECT * FROM DUAL
9 /
6 rows created.
SCOTT@orcl_11g> SET DEFINE OFF
SCOTT@orcl_11g> DECLARE
2 mainViewName VARCHAR2(4) := 'test';
3 suffix VARCHAR2(3) := 'idx';
4 BEGIN
5 CTX_DDL.CREATE_PREFERENCE
6 (mainViewName || '_' || suffix || '_SpecChLex',
7 'BASIC_LEXER');
8 ctx_ddl.set_attribute
9 (mainViewName || '_' || suffix || '_SpecChLex',
10 'skipjoins',
11 '`-=[];''\,./~!@#$%^&*()_+{}:"|<>?§´¨½¼¾¤£¤©™®');
12 CTX_DDL.CREATE_PREFERENCE
13 (mainViewName || '_' || suffix || '_wordlist',
14 'BASIC_WORDLIST');
15 CTX_DDL.SET_ATTRIBUTE
16 (mainViewName || '_' || suffix || '_wordlist',
17 'WILDCARD_MAXTERMS', 15000);
18 END;
19 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> CREATE INDEX test_idx ON test_tab (test_col)
2 INDEXTYPE IS CTXSYS.CTXCAT
3 PARAMETERS
4 ('LEXER test_idx_SpecChLex
5 WORDLIST test_idx_wordlist')
6 /
Index created.
SCOTT@orcl_11g> COLUMN wmt HEADING 'WILDCARD_MAX_TERMS|MUST BE GREATER THAN THIS'
SCOTT@orcl_11g> SELECT COUNT (DISTINCT dr$token) AS wmt
2 FROM dr$test_idx$i WHERE dr$token LIKE 'NOTE%'
3 /
WILDCARD_MAX_TERMS
MUST BE GREATER THAN THIS
-------------------------
4
SCOTT@orcl_11g> SELECT DISTINCT dr$token FROM dr$test_idx$i
2 /
DR$TOKEN
----------------------------------------------------------------
NOTE
NOTED
NOTES
NOTEWORTHY
WORD1
WORD2
WORD3
7 rows selected.
SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CATSEARCH (test_col, 'note™*', NULL) > 0
2 /
TEST_COL
------------------------------------------------------------
word1 note™ word3
word1 note™d word3
word1 noted word3
word1 notes word3
word1 noteworthy word3
SCOTT@orcl_11g>
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #365204 is a reply to message #364835] |
Thu, 11 December 2008 01:45 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
Hi Barbara and thanks for your quick reply,
I am pretty sure that it is replacing the sign with a space.
I did following search:
select * from mv_song_se
where catsearch(searchText, 'bruce©springsteen*', 'order by sortorder') > 0;
returning 487 rows;
select * from mv_song_se
where catsearch(searchText, 'bruce springsteen*', 'order by sortorder') > 0;
returning 487 rows;
select * from mv_song_se
where catsearch(searchText, 'brucespringsteen*', 'order by sortorder') > 0;
no rows returned;
It only happens with a few signs as I mentioned earlier. All other signs we are using are being handled properly. There is no difference having the sign in the skipjoin or not.
This behaviour only gives us the problem when the user searches with such a sign at the end of his searchstring (as we always append wildcard searches for strings longer than 2 characters)
user is entering the following searchstring: bruce©
The front end is appending the wildcard search which gives: bruce©*
The bug is making following happen: bruce *
What do you get from following query?
select dump('™') from dual;
As we have more than 2 million songs in our table concataneted with album name and artist we are getting the "wildcard query expansion" error.
SELECT CTX_REPORT.CREATE_INDEX_SCRIPT ('MV_SONG_SE_BATMAN_IND') FROM DUAL;
begin
ctx_ddl.create_preference('"MV_SONG_SE_BATMAN_IND_LEX"','BASIC_LEXER');
ctx_ddl.set_attribute('"MV_SONG_SE_BATMAN_IND_LEX"','SKIPJOINS','`-=[];'\,./~!@#$%^&*()_+{}:"|<>?§´¨½¼¾¤£¿©¿®');
end;
/
begin
ctx_ddl.create_preference('"MV_SONG_SE_BATMAN_IND_WDL"','BASIC_WORDLIST');
ctx_ddl.set_attribute('"MV_SONG_SE_BATMAN_IND_WDL"','STEMMER','ENGLISH');
ctx_ddl.set_attribute('"MV_SONG_SE_BATMAN_IND_WDL"','FUZZY_MATCH','GENERIC');
end;
/
begin
ctx_ddl.create_stoplist('"MV_SONG_SE_BATMAN_IND_SPL"','BASIC_STOPLIST');
end;
/
begin
ctx_ddl.create_preference('"MV_SONG_SE_BATMAN_IND_STO"','BASIC_STORAGE');
ctx_ddl.set_attribute('"MV_SONG_SE_BATMAN_IND_STO"','R_TABLE_CLAUSE','lob (data) store as (cache)');
ctx_ddl.set_attribute('"MV_SONG_SE_BATMAN_IND_STO"','I_INDEX_CLAUSE','compress 2');
end;
/
begin
ctx_ddl.create_index_set('"MV_SONG_SE_BATMAN_IND_IXS"');
ctx_ddl.add_index('"MV_SONG_SE_BATMAN_IND_IXS"','GENREID');
ctx_ddl.add_index('"MV_SONG_SE_BATMAN_IND_IXS"','SORTORDER');
end;
/
begin
ctx_output.start_log('MV_SONG_SE_BATMAN_IND_LOG');
end;
/
create index "MEDIA_CORE"."MV_SONG_SE_BATMAN_IND"
on "MEDIA_CORE"."MV_SONG_SE_BATMAN"
("SEARCHTEXT")
indextype is ctxsys.ctxcat
parameters('
lexer "MV_SONG_SE_BATMAN_IND_LEX"
wordlist "MV_SONG_SE_BATMAN_IND_WDL"
stoplist "MV_SONG_SE_BATMAN_IND_SPL"
storage "MV_SONG_SE_BATMAN_IND_STO"
index set "MV_SONG_SE_BATMAN_IND_IXS"
')
/
begin
ctx_output.end_log;
end;
/
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #365218 is a reply to message #364835] |
Thu, 11 December 2008 02:35 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
Also following your test I am not getting the same results which leads me to believe that it really is a 10g bug:
SELECT DISTINCT dr$token FROM dr$test_idx$i;
[COLOR=red]D[/COLOR]
NOTE
NOTED
NOTES
NOTEWORTHY
WORD1
WORD2
WORD3
SELECT * FROM test_tab WHERE CATSEARCH (test_col, 'note™*', NULL) > 0;
word1 note™ word3
word1 note™d word3
Which suggests it has indexed the words according:
word1 note d word3
instead of
word1 noted word3
Bug or what do you think?
Slavko
SELECT CTX_REPORT.CREATE_INDEX_SCRIPT ('test_idx') FROM DUAL;
begin
ctx_ddl.create_preference('"TEST_IDX_LEX"','BASIC_LEXER');
ctx_ddl.set_attribute('"TEST_IDX_LEX"','SKIPJOINS','`-=[];'\,./~!@#$%^*()_+{}:"|<>?§£¤©™®');
end;
/
begin
ctx_ddl.create_preference('"TEST_IDX_WDL"','BASIC_WORDLIST');
ctx_ddl.set_attribute('"TEST_IDX_WDL"','WILDCARD_MAXTERMS','15000');
end;
/
begin
ctx_ddl.create_stoplist('"TEST_IDX_SPL"','BASIC_STOPLIST');
{removed the stopwords}
end;
/
begin
ctx_ddl.create_preference('"TEST_IDX_STO"','BASIC_STORAGE');
ctx_ddl.set_attribute('"TEST_IDX_STO"','R_TABLE_CLAUSE','lob (data) store as (cache)');
ctx_ddl.set_attribute('"TEST_IDX_STO"','I_INDEX_CLAUSE','compress 2');
end;
/
begin
ctx_ddl.create_index_set('"TEST_IDX_IXS"');
end;
/
begin
ctx_output.start_log('TEST_IDX_LOG');
end;
/
create index "MEDIA_CORE"."TEST_IDX"
on "MEDIA_CORE"."TEST_TAB"
("TEST_COL")
indextype is ctxsys.ctxcat
parameters('
lexer "TEST_IDX_LEX"
wordlist "TEST_IDX_WDL"
stoplist "TEST_IDX_SPL"
storage "TEST_IDX_STO"
index set "TEST_IDX_IXS"
')
/
begin
ctx_output.end_log;
end;
/
|
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #366311 is a reply to message #365218] |
Fri, 12 December 2008 13:46 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
I saw your posts on asktom.oracle.com that included your trace files:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:440419921146#1365474700346010440
I can see that behind the scenes, catsearch is doing a select from the domain index tables using a bind variable something like this:
SCOTT@orcl_11g> VARIABLE lkexpr VARCHAR2(30)
SCOTT@orcl_11g> EXEC :lkexpr := 'NOTE%'
PL/SQL procedure successfully completed.
LKEXPR
--------------------------------
NOTE%
SCOTT@orcl_11g> SELECT/*+DYNAMIC_SAMPLING(0) INDEX(T "DR$TEST_IDX$X")*/ DISTINCT DR$TOKEN FROM
2 "DR$TEST_IDX$I" T WHERE DR$TOKEN LIKE :lkexpr ESCAPE '\' and
3 (DR$TOKEN_TYPE IN (0, 4, 6, 604) OR (DR$TOKEN_TYPE BETWEEN 16 AND 74) OR
4 (DR$TOKEN_TYPE BETWEEN 616 AND 674))
5 /
DR$TOKEN
----------------------------------------------------------------
NOTE
NOTED
NOTES
NOTEWORTHY
SCOTT@orcl_11g>
I suspect that your index is being created properly but something is happening to the :lkexpr bind variable being used, such that it does not contain the value 'NOTE%' as it should. This could be due to some overriding parameter such as whitespace attribute or a conflict with the character set or a bug.
I see that Tom Kyte referred you to the OTN forums. I recommend that you post your problem on the OTN Text forum:
http://forums.oracle.com/forums/forum.jspa?forumID=71
Oracle employee Roger Ford regularly responds there. He has been involved in Oracle Text for many years and I believe he is the current product manager. He should be able to determine whether the problem is due to a bug or not. You might want to include links to this thread and asktom for his reference to save yourself some repetition.
[Updated on: Fri, 12 December 2008 13:50] Report message to a moderator
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #375777 is a reply to message #366311] |
Sun, 14 December 2008 09:02 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
Hi Barbara,
I did some test last friday and this seems to have solved the problem:
ctx_ddl.set_attribute ('TEST_IDX_LEX', 'WHITESPACE', ' ');
I did the test on the small test table I am getting correct results from
SELECT DISTINCT dr$token FROM dr$test_idx$i;
Why do you think the index was being populated properly before? The SELECT DISTINCT dr$token FROM dr$test_idx$i; gave me wrong results before.
However thanks a lot. It seems to have solved my problem. I will update askTom once I have done the test on my real table. Still thinks it is a bug which probably will be solved when we upgrade to 11g. I will install 11g as well just to test as well.
Once again thanks a lot !!
Best regards,
Slavko
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #375791 is a reply to message #375777] |
Sun, 14 December 2008 13:20 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
I am glad your problem is solved. I thought the index was O.K. because I thought your posted results were the same as mine. I missed the D. So, It looks like the problem was just that, by default, it was treating the trademark symbol as whitespace. It is documented that when there are conflicts between parameters such as whitespace and skipjoins, priority is given to one and the other is ignored. However, unless you have changed the default parameters, only the space character and the tab character should be treated as whitespace.
The default lexer can be changed using ctx_adm. You can confirm whether your system is using the original default lexer or not like so:
SCOTT@orcl_11g> SELECT par_value
2 FROM ctx_parameters
3 WHERE par_name = 'DEFAULT_LEXER'
4 /
PAR_VALUE
--------------------------------------------------------------------------------
CTXSYS.DEFAULT_LEXER
You can also check whether any values within the default lexer have been changed, like so:
SCOTT@orcl_11g> SELECT prv_attribute, prv_value
2 FROM ctx_preference_values
3 WHERE prv_owner || '.' || prv_preference =
4 (SELECT par_value
5 FROM ctx_parameters
6 WHERE par_name = 'DEFAULT_LEXER')
7 /
PRV_ATTRIBUTE
------------------------------
PRV_VALUE
--------------------------------------------------------------------------------
WHITESPACE
SCOTT@orcl_11g>
|
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #381888 is a reply to message #364835] |
Tue, 20 January 2009 04:10 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
Hi Barabra,
Its been a while but have ben busy with other tasks unfortunately. However the WHITESPACE does not work. I did this test on my laptop which has WE and not UTF8 character set (works well on WE but not UTF8). I will post my problem on the OTN forum you suggested. Another totally different question if its ok. When people search for Beyoncé I want them to find Beyoncé and Beyonce. What is the easiest way to implement this? I cannot change all data to Beyoncé.
Thanks,
|
|
|
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #382117 is a reply to message #364835] |
Wed, 21 January 2009 00:40 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
We have offered the customer to introduce BASIC_LEXER attribute BASE_LETTER to YES. So they want to know exactly which letters will be transformed (to see if it will cover their needs). So I have tried to find this table to give them a report of the letters. I have not been able to find it in the documentation nor through googling.
Slavko
|
|
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #382136 is a reply to message #382121] |
Wed, 21 January 2009 01:49 |
|
Barbara Boehmer
Messages: 9104 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
You can derive all of the conversions by loading all possible ascii characters into a test table, indexing them using a lexer with base_letter, then selecting from a join of the test table and the domain index table, as demonstrated below.
SCOTT@orcl_11g> CREATE TABLE test_tab (test_col VARCHAR2(1))
2 /
Table created.
SCOTT@orcl_11g> INSERT INTO test_tab
2 SELECT CHR (LEVEL)
3 FROM DUAL
4 CONNECT BY LEVEL <= 256
5 /
256 rows created.
SCOTT@orcl_11g> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_lex', 'BASIC_LEXER');
3 CTX_DDL.SET_ATTRIBUTE ('test_lex', 'BASE_LETTER', 'YES');
4 CTX_DDL.SET_ATTRIBUTE ('test_lex', 'MIXED_CASE', 'YES');
5 END;
6 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> CREATE INDEX test_idx ON test_tab (test_col)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4 ('LEXER test_lex
5 STOPLIST CTXSYS.EMPTY_STOPLIST')
6 /
Index created.
SCOTT@orcl_11g> COLUMN base_letter FORMAT A11
SCOTT@orcl_11g> COLUMN original_values FORMAT A30
SCOTT@orcl_11g> SELECT i.token_text AS base_letter,
2 WM_CONCAT (t.test_col) AS original_values
3 FROM test_tab t, dr$test_idx$i i
4 WHERE CONTAINS (t.test_col, i.token_text) > 0
5 AND t.test_col <> i.token_text
6 GROUP BY i.token_text
7 ORDER BY i.token_text
8 /
BASE_LETTER ORIGINAL_VALUES
----------- ------------------------------
A À,Á,Ã,Â,Ä,Å
C Ç
E È,Ê,É,Ë
I Ì,Ï,Í,Î
N Ñ
O Ò,Õ,Ø,Ö,Ô,Ó
S ¦
U Ù,Ú,Û,Ü
Y ¾,Ý
Z ´
a à,á,ã,å,ä,â
c ç
e è,ê,ë,é
i ì,í,î,ï
n ñ
o ò,õ,ô,ó,ö,ø
s ¨
u ù,ú,ü,û
y ý,ÿ
z ¸
20 rows selected.
SCOTT@orcl_11g>
|
|
|
|
Re: Problem with Catsearch and special characters in UTF8 [message #382199 is a reply to message #364835] |
Wed, 21 January 2009 06:01 |
sbrkic
Messages: 16 Registered: December 2008
|
Junior Member |
|
|
Fixed it. I had to modify according to the following. Thanks a lot.
INSERT INTO t_tab
SELECT CHR (LEVEL), level
FROM DUAL
WHERE length (CHR (LEVEL)) = 1
CONNECT BY LEVEL <= 100000 /* lot of characters in the 50000 region */
;
SELECT i.token_text AS base_letter,
WM_CONCAT (t.test_col) AS original_values
FROM t_tab t, dr$t_idx$i i
WHERE CONTAINS (t.test_col, i.token_text) > 0
AND ascii(t.test_col) <> ascii(i.token_text)
GROUP BY i.token_text
ORDER BY i.token_text;
|
|
|
|
Goto Forum:
Current Time: Thu Jan 30 13:26:44 CST 2025
|