Home » Server Options » Text & interMedia » Problem with special chars in BLOB data type using contains keyword (oracle,9i, windows XP professional 2002)
Problem with special chars in BLOB data type using contains keyword [message #446334] |
Sun, 07 March 2010 22:38 |
hasham.alis
Messages: 1 Registered: March 2010
|
Junior Member |
|
|
Facing problem, when part searching with special chars in BLOB datatype. It is considering the non alpha-numeric chars as a separtor in a provided string
EX:
SELECT *
FROM RESUME_TEST P,grst_candidate d
WHERE d.candidate_id = p.candidate_id
AND CONTAINS(P.CAND_RESUME,'%VB.NET%',1) > 0
Strings: , VB.NET , PL/SQL AS/400 , C etc..
Followed the below approaches
1) created a table:
Syntax: create table resume_Test(cand_id number(10),cand_resume blob);
2) inserted the values into this table upto 60,000
3) created a context index
3.1 created preferences
Syntax:
BEGIN
ctx_ddl.create_preference('try_lexer3','BASIC_LEXER');
ctx_ddl.set_attribute('try_lexer3','printjoins','-_~!@#$%^&*(){}[],=?\;|><.+');
END;
3.2 created context index
Syntax:
CREATE INDEX CANDRESUME_CTX_IDX ON resume_test (cand_resume)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('LEXER try_lexer3 memory 500M');
4) while executing this index, it is taking much time approx 6 hrs(plz explain why it is taking time)
5) Problems:
5.1 when searching with string(VB.NET , PL/SQL AS/400 , C etc..) it is considering the special char as a separator
5.2 used escape char (\) also, but no effect
5.3 when searching with single char, it is giving error (ORA-29902,ORA-20000,DRG-51030)
5.4 getting the above error with wild card chars (& ,_, (),{},[])
So, please explain the clear scenarios, why am getting this error , and how to get the proper results.
|
|
|
Re: Problem with special chars in BLOB data type using contains keyword [message #446449 is a reply to message #446334] |
Mon, 08 March 2010 08:20 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Since the two tables in your query are joined, there should be indexes on the columns (cand_id) used in joining, or even better primary and foreign keys if appropriate. That enables the query to execute faster.
You left the forward slash "/" out of your printjoins values, so that would be why the "PL" and "SQL" of "PL/SQL" were tokenized separately, since it considered "/" as a separator.
If you want to search on single characters, then you need to use an empty stoplist when creating the index. If you do not specify a stoplist, then the default stoplist is used, which considers all single letters as stopwords, so searching for just a single letter, would be like searching for an empty string, which would cause an error.
Please see the following section of the online documentation for tips on how long to expect indexing to take and how to decrease the time:
http://download.oracle.com/docs/cd/B10501_01/text.920/a96517/aoptim.htm#1008973
After indexing, you can select the token_text column from the dr$...$i domain index table that is created by the indexing, to see how things are tokenized and check whether your sinlge characters are tokenized and whether your printjoins are working as expected.
You should gather current statistics on both tables, to enable the optimizer to choose the best execution plan. You can use dbms_stats.gather_table_stats to do this.
When searching, it is not necessary to use %. If you search for 'VB.NET' without the %, it will locate the row with 'VB.NET' anywhere in the text of the column. The % would be for searching for words that have that somwhere in the middle, like 'AVB.NETC'. The %, especially the leading one, slows down the query greatly. If you do need that leading wildcard, then you shold look into substring indexing to optimize those querie.
Instead of "select *", you should select the individual columns or your blob column will not be legible. You can use utl_raw.cast_to_varchar2 or ctx_doc.snippet or some such thing to display the value of the blob column in a legible manner.
You should use a bind variable for your search and assign the different values to be searched for to your bind variable, instead of searching for the literale values. That way the query in the SGA can be re-used, instead of re-parsing, which can save a lot of execution time.
Pleae see the demonstration below.
SCOTT@orcl_11g> -- create tables with primary and foreign keys
SCOTT@orcl_11g> -- (the keys will speed queries that join the tables by them):
SCOTT@orcl_11g> CREATE TABLE grst_candidate
2 (cand_id NUMBER (10),
3 cand_name VARCHAR2 (10),
4 CONSTRAINT cand_pk PRIMARY KEY (cand_id))
5 /
Table created.
SCOTT@orcl_11g> CREATE TABLE resume_Test
2 (cand_id NUMBER (10),
3 cand_resume BLOB,
4 CONSTRAINT cand_fk FOREIGN KEY (cand_id)
5 REFERENCES grst_candidate (cand_id))
6 /
Table created.
SCOTT@orcl_11g> -- insert data:
SCOTT@orcl_11g> INSERT ALL
2 INTO grst_candidate VALUES (1, 'cand1')
3 INTO grst_candidate VALUES (2, 'cand2')
4 INTO grst_candidate VALUES (3, 'cand3')
5 INTO grst_candidate VALUES (4, 'cand4')
6 SELECT * FROM DUAL
7 /
4 rows created.
SCOTT@orcl_11g> INSERT ALL
2 INTO resume_test VALUES (1, UTL_RAW.CAST_TO_RAW ('word1 VB.NET word3'))
3 INTO resume_test VALUES (2, UTL_RAW.CAST_TO_RAW ('word1 PL/SQL word3'))
4 INTO resume_test VALUES (3, UTL_RAW.CAST_TO_RAW ('word1 AS/400 word3'))
5 INTO resume_test VALUES (4, UTL_RAW.CAST_TO_RAW ('word1 C word3'))
6 SELECT * FROM DUAL
7 /
4 rows created.
SCOTT@orcl_11g> -- create lexer preference and set printjoins attribute
SCOTT@orcl_11g> -- (added / to prinjoin values):
SCOTT@orcl_11g> BEGIN
2 CTX_DDL.CREATE_PREFERENCE
3 ('try_lexer3',
4 'BASIC_LEXER');
5 CTX_DDL.SET_ATTRIBUTE
6 ('try_lexer3',
7 'PRINTJOINS',
8 '-_~!@#$%^&*(){}[],=?\;|><.+/');
9 END;
10 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> -- create index using lexer
SCOTT@orcl_11g> -- and empty stoplist (so it won't treat single letters as stopwords):
SCOTT@orcl_11g> CREATE INDEX candresume_ctx_idx
2 ON resume_test (cand_resume)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS
5 ('LEXER try_lexer3
6 STOPLIST CTXSYS.EMPTY_STOPLIST')
7 /
Index created.
SCOTT@orcl_11g> -- view what is indexed;
SCOTT@orcl_11g> SELECT token_text
2 FROM dr$candresume_ctx_idx$i
3 /
TOKEN_TEXT
----------------------------------------------------------------
AS/400
C
PL/SQL
VB.NET
WORD1
WORD3
6 rows selected.
SCOTT@orcl_11g> -- gather statistics
SCOTT@orcl_11g> -- (will enable the optimizer to choose the best execution plan):
SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS (USER, 'GRST_CANDIDATE')
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS (USER, 'RESUME_TEST')
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> -- select without % (% is not necessary and slows it down greatly)
SCOTT@orcl_11g> -- (* is not selected; cand_resume is converted for display)
SCOTT@orcl_11g> -- (a bind variable is used, so that the query in the SGA can be re-used with re-parsing):
SCOTT@orcl_11g> VARIABLE search_string VARCHAR2 (100)
SCOTT@orcl_11g> EXEC :search_string := 'VB.NET'
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> COLUMN cand_resume FORMAT A45 WORD_WRAPPED
SCOTT@orcl_11g> SELECT p.cand_id,
2 d.cand_name,
3 UTL_RAW.CAST_TO_VARCHAR2 (p.cand_resume) AS cand_resume
4 FROM resume_test p, grst_candidate d
5 WHERE d.cand_id = p.cand_id
6 AND CONTAINS (p.cand_resume, :search_string, 1) > 0
7 /
CAND_ID CAND_NAME CAND_RESUME
---------- ---------- ---------------------------------------------
1 cand1 word1 VB.NET word3
SCOTT@orcl_11g> EXEC :search_string := 'PL/SQL'
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> /
CAND_ID CAND_NAME CAND_RESUME
---------- ---------- ---------------------------------------------
2 cand2 word1 PL/SQL word3
SCOTT@orcl_11g> EXEC :search_string := 'AS/400'
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> /
CAND_ID CAND_NAME CAND_RESUME
---------- ---------- ---------------------------------------------
3 cand3 word1 AS/400 word3
SCOTT@orcl_11g> EXEC :search_string := 'C'
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> /
CAND_ID CAND_NAME CAND_RESUME
---------- ---------- ---------------------------------------------
4 cand4 word1 C word3
SCOTT@orcl_11g>
|
|
|
|
|
Re: Problem with special chars in BLOB data type using contains keyword [message #478309 is a reply to message #478307] |
Thu, 07 October 2010 15:08 |
ppatel
Messages: 3 Registered: October 2010
|
Junior Member |
|
|
Thanks for you quick reply.
I did the following:
1. Created the lexer preference and attribute
BEGIN
CTX_DDL.CREATE_PREFERENCE ('LEXER1', 'BASIC_LEXER');
CTX_DDL.SET_ATTRIBUTE ('LEXER1',
'PRINTJOINS',
'-_~!@#$%^&*(){}[]=?\;|><.+/');
END;
2. Created the index
CREATE INDEX tab_idx03
ON test_table (test_data)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS
('LEXER LEXER1
STOPLIST CTXSYS.EMPTY_STOPLIST');
3. I query the index for certain value:
select * from dr$tab_idx03$i where TOKEN_TEXT = ( '#05202-BLUE');
This is what I get:
TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
#05202-BLUE 0 140525 140525 1
4. I run the queries across my test table:
Here is what I ran:
SELECT * from test_table
WHERE CONTAINS (test_data,
'#05202-BLUE',1)>0;
--
No rows returned
--
SELECT * from test_table
WHERE test_data like '%#05202-BLUE%';
output:
----
PAINT-DABBER ARRO-MARK #05202-BLUE, 2
----
Appreciate all of your help.
Thanks.
|
|
|
Re: Problem with special chars in BLOB data type using contains keyword [message #478310 is a reply to message #478309] |
Thu, 07 October 2010 15:55 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The hyphen has special meaning in Oracle Text; It means MINUS. So, if you search for 'a-b' it takes the score from a and subtracts the score from b from it, so if the string contains a and b, then the score is 0, so using contains ... > 0 does not find it. In order for Oracle Text to treat such special characters in a search string as normal text, you have to escape them, either by placing a \ in front of each such character or putting { and } around the whole string. Please see the reproduction and solutions below.
-- test environment:
SCOTT@orcl_11gR2> CREATE TABLE test_table
2 (test_data VARCHAR2 (60))
3 /
Table created.
SCOTT@orcl_11gR2> INSERT INTO test_table (test_data)
2 VALUES ('PAINT-DABBER ARRO-MARK #05202-BLUE, 2')
3 /
1 row created.
SCOTT@orcl_11gR2> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('LEXER1', 'BASIC_LEXER');
3 CTX_DDL.SET_ATTRIBUTE
4 ('LEXER1',
5 'PRINTJOINS',
6 '-_~!@#$%^&*(){}[]=?\;|><.+/');
7 END;
8 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11gR2> CREATE INDEX tab_idx03
2 ON test_table (test_data)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS
5 ('LEXER LEXER1
6 STOPLIST CTXSYS.EMPTY_STOPLIST')
7 /
Index created.
SCOTT@orcl_11gR2> select token_text from dr$tab_idx03$i
2 /
TOKEN_TEXT
----------------------------------------------------------------
#05202-BLUE
2
ARRO-MARK
PAINT-DABBER
4 rows selected.
-- reproduction of problem:
SCOTT@orcl_11gR2> SELECT * from test_table
2 WHERE CONTAINS (test_data, '#05202-BLUE', 1) > 0
3 /
no rows selected
-- solutions:
SCOTT@orcl_11gR2> SELECT * from test_table
2 WHERE CONTAINS (test_data, '#05202\-BLUE', 1) > 0
3 /
TEST_DATA
------------------------------------------------------------
PAINT-DABBER ARRO-MARK #05202-BLUE, 2
1 row selected.
SCOTT@orcl_11gR2> SELECT * from test_table
2 WHERE CONTAINS (test_data, '{#05202-BLUE}', 1) > 0
3 /
TEST_DATA
------------------------------------------------------------
PAINT-DABBER ARRO-MARK #05202-BLUE, 2
1 row selected.
SCOTT@orcl_11gR2>
|
|
|
|
Goto Forum:
Current Time: Tue Nov 26 22:20:55 CST 2024
|