Home » Server Options » Text & interMedia » Oracle Text Index using Soundex (11g)
Oracle Text Index using Soundex [message #501168] |
Sat, 26 March 2011 14:06 |
|
kymmi13
Messages: 9 Registered: March 2011
|
Junior Member |
|
|
Hi,
I'm a bit confused on why my soundex query is not returning what I expect. From what I have read, soundex uses the English phonetic language to return similar sounding searches.
The index I have is a CTXSYS.CTXCAT
Here's a couple of examples:
This works fine:
select distinct search_term from search s
where
catsearch(s.search_term, '<query><textquery grammar="context">!smythe</textquery></query>', '')>0;
RETURNS:
Smith
Smitt
However, if I try "lough" (looking for Rob Lowe):
RETURNS:
Rouse
Lorch
Longo
Can someone explain that to me? How do I configure the query or the index in order for "lough" to be interpreted as "low" or "lowe"??
Thanks in advance.
|
|
|
|
Re: Oracle Text Index using Soundex [message #501172 is a reply to message #501169] |
Sat, 26 March 2011 15:11 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The traditional soundex algorithm is described here:
http://en.wikipedia.org/wiki/Soundex
The soundex of a word consists of the first letter of the word, followed by numbers substituted for consonants except h and w, padded with zeroes, until there are three numbers. So, for "Lowe", the first letter is L and there are no other consonants that are coded, so it is padded with 3 zeroes and the soundex for "Lowe" is L000. For "Lough", the first letter is L, 2 is substituted for g and there are no other consonants that are coded, so it is padded with 2 zeroes, and the soundex for "Lough" is L200. Since the soundex L000 for Lowe is not the same as the soundex L200 for Lough, searching for one does not find the other. The soundex that is used in SQL uses this algorithm. The soundex that is used in Oracle Text context grammar is different, and requires that the two words be much more similar, not just in the first letter and first 3 coded consonants. Please see the demonstration below that shows traditional soundex values and the results of searches using traditional soundex values and the results of searches using context grammar.
-- test table and data:
SCOTT@orcl_11gR2> create table search
2 (search_term varchar2 (30))
3 /
Table created.
SCOTT@orcl_11gR2> insert all
2 into search values ('Lewis')
3 into search values ('Laske')
4 into search values ('Lowsky')
5 into search values ('Leckie')
6 into search values ('Luce')
7 into search values ('Lock')
8 into search values ('Lacayo')
9 into search values ('Lashay')
10 into search values ('Lakha')
11 into search values ('Laws')
12 into search values ('Lake')
13 into search values ('Lacy')
14 into search values ('Lowe')
15 into search values ('Lorch')
16 into search values ('Longo')
17 select * from dual
18 /
15 rows created.
-- traditional soundex values:
SCOTT@orcl_11gR2> select soundex (search_term),
2 search_term
3 from search
4 order by 1, 2
5 /
SOUN SEARCH_TERM
---- ------------------------------
L000 Lowe
L200 Lacayo
L200 Lacy
L200 Lake
L200 Lakha
L200 Lashay
L200 Laske
L200 Laws
L200 Leckie
L200 Lewis
L200 Lock
L200 Lowsky
L200 Luce
L520 Longo
L620 Lorch
15 rows selected.
-- search using traditional soundex algorithm:
SCOTT@orcl_11gR2> select search_term
2 from search
3 where soundex (search_term) = soundex ('Lough')
4 order by 1
5 /
SEARCH_TERM
------------------------------
Lacayo
Lacy
Lake
Lakha
Lashay
Laske
Laws
Leckie
Lewis
Lock
Lowsky
Luce
12 rows selected.
-- searches using context grammar:
SCOTT@orcl_11gR2> create index context_idx
2 on search (search_term)
3 indextype is ctxsys.context
4 /
Index created.
SCOTT@orcl_11gR2> select search_term
2 from search
3 where contains (search_term, '!Lough') > 0
4 /
SEARCH_TERM
------------------------------
Lorch
Longo
2 rows selected.
SCOTT@orcl_11gR2> create index ctxcat_idx
2 on search (search_term)
3 indextype is ctxsys.ctxcat
4 /
Index created.
SCOTT@orcl_11gR2> select search_term
2 from search
3 where catsearch
4 (search_term,
5 '<query>
6 <textquery grammar=''context''>
7 !Lough
8 </textquery>
9 </query>', '') > 0
10 /
SEARCH_TERM
------------------------------
Lorch
Longo
2 rows selected.
SCOTT@orcl_11gR2>
|
|
|
|
|
|
Re: Oracle Text Index using Soundex [message #501186 is a reply to message #501182] |
Sat, 26 March 2011 16:02 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The Jaro-Winkler algorithm was originally created for the U.S. census to try to identify duplicate names, so people were not counted twice. It is described here:
http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Jaro_Winkler_Similarity takes that and expresses it as a percentage of 100.
There are other algorithms, such as Levenshtein distance, Damereau-Levenshtein, and metaphone. You may want to experiment with them and combine them. All of these methods are going to be slower than a search using a context or ctxcat index. You can use a very general search with a context index to return a small result set, then order that result set using one or more of the other methods. You and your users need to recognize that no search algorithm is going to be perfect and, if they don't find what they are looking for, then they may need to broaden their search. If you are searching for first and last names, you can use things like ACCUM with a context grammar search, to cause results that include both names to score higher, but include results that only contain one, so that if you search for "Rob Lough" you still might find "Rob Lowe".
[Updated on: Sat, 26 March 2011 16:04] Report message to a moderator
|
|
|
|
|
Re: Oracle Text Index using Soundex [message #501191 is a reply to message #501190] |
Sat, 26 March 2011 20:01 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
To search for soundex or fuzzy:
SCOTT@orcl_11gR2> select search_term
2 from search
3 where catsearch
4 (search_term,
5 '<query>
6 <textquery grammar=''context''>
7 !Lough OR ?Lough
8 </textquery>
9 </query>', '') > 0
10 /
SEARCH_TERM
------------------------------
Lorch
Longo
2 rows selected.
SCOTT@orcl_11gR2>
The third parameter of catsearch on a ctxcat index is for a structured query if you have a subindex created on a structured column. It is similar to the new sdata feature with context indexes and contains.
If you were using a context index, contains would return a score and you could use > some number to limit the result set or order by the descending score in an inner query and limit the rows using rownum in an outer query. These things and many others are not available with ctxcat indexes and catsearch. There is also a problem with ctxcat indexes and catsearch, when the optimizer chooses functional invocation, resulting in an error without returning rows. I always recommend context and contains instead of ctxcat and catsearch. With the new features, you can make the context index transactional and use sdata for structured data, so it has most of the advantages of catsearch and a whole lot more capabilities, without resorting to a query template to enable some of the context features.
|
|
|
|
Re: Oracle Text Index using Soundex [message #501207 is a reply to message #501203] |
Sun, 27 March 2011 12:25 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Quote:
1. Can I still use the lexer skipjoins?
Yes, as demonstrated in the example at the bottom of this post. The search finds L'owe as well as Lowe.
Quote:
2. Will there be a performance degradation? I've read that ctxcat is for varchar columns whereas context is for clobs.
I don't think there should be a performance degradation, but you should test and see. If you find there is a degradation, there are probably things that can be done to speed it up.
A ctxcat index cannot be used to index a clob. A context index can be used on either varchar2 or clob and other data types. I used a varchar2 column in the example at the bottom of this post.
Quote:
3. How do I make the context index transactional?
By adding the word transactional to your index parameters, as demonstrated in the example at the bottom of this post.
Quote:
4. Can you give me an example (or point to me where online I can read up on) of sdata for structured data?
I have included usage of an sdata column in the example at the bottom of this post. In the example, I added a "division" column of number datatype, which is structured data. Without sdata, if you wanted to filter your search by such a column, you would have a separate non-text index on that column, and the optimizer would likely choose one index or the other to execute your query. By adding "filter by division" during index creation, it makes that column part of the index, so that any search on the text column and the sdata column only requires one index hit and results in very fast searches. In the example below, it finds only those rows in division 1, by adding the sdata criteria to the contains clause. There are many more options. This is just a very simple example. I have added 75,978 rows of data by selecting from all_objects, in order to provide a realistic example, and demonstrated, using autotrace, how it uses only one index hit.
You can find more information about sdata in the Oracle Text Reference and Oracle Text Application Developer's Gudie of the searchable online documentation, for example:
http://download.oracle.com/docs/cd/E11882_01/text.112/e16593/cqoper.htm#CCREF2089
In the example below, I have also combined usage of accum and soundex and fuzzy and jaro_winkler_similarity to show how you can use the first three in a contains clause to return a small result set with one hit on a context index in an inner sub-query, then use the last one in an outer sub-query to order the rows from the inner query.
SCOTT@orcl_11gR2> create table search
2 (division number,
3 search_term varchar2 (30))
4 /
Table created.
SCOTT@orcl_11gR2> insert all
2 into search values (1, 'Lewis')
3 into search values (1, 'Laske')
4 into search values (1, 'Lowsky')
5 into search values (1, 'Leckie')
6 into search values (1, 'Luce')
7 into search values (1, 'Lock')
8 into search values (1, 'Lacayo')
9 into search values (1, 'Lashay')
10 into search values (1, 'Lakha')
11 into search values (1, 'Laws')
12 into search values (1, 'Lake')
13 into search values (1, 'Lacy')
14 into search values (1, 'Rob Lowe')
15 into search values (1, 'Rob L''owe')
16 into search values (1, 'Lorch')
17 into search values (1, 'Longo')
18 into search values (1, 'Doug')
19 into search values (1, 'Hugh')
20 into search values (2, 'Rob Lowe')
21 into search values (2, 'Rob L''owe')
22 into search values (2, 'Lorch')
23 into search values (2, 'Longo')
24 select * from dual
25 /
22 rows created.
SCOTT@orcl_11gR2> insert into search
2 select object_id, object_name
3 from all_objects
4 /
75978 rows created.
SCOTT@orcl_11gR2> commit
2 /
Commit complete.
SCOTT@orcl_11gR2> begin
2 ctx_ddl.create_preference ('mylex', 'BASIC_LEXER');
3 ctx_ddl.set_attribute
4 ('mylex',
5 'skipjoins',
6 '`?#\[]{}~!@$%^&*()-_=+|;:,"<>./''');
7 ctx_ddl.create_preference ('mywordlist', 'BASIC_WORDLIST');
8 end;
9 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11gR2> create index context_idx
2 on search (search_term)
3 indextype is ctxsys.context
4 filter by division
5 parameters
6 ('lexer mylex
7 wordlist mywordlist
8 transactional')
9 /
Index created.
SCOTT@orcl_11gR2> set autotrace on explain
SCOTT@orcl_11gR2> select utl_match.jaro_winkler_similarity
2 (search_term, 'Rob Lough') jws,
3 score (1), division, search_term
4 from (select score (1), division, search_term
5 from search
6 where contains
7 (search_term,
8 '(Rob accum (!Lough OR ?Lough))
9 and sdata (division = 1)',
10 1) > 0)
11 order by jws desc, score (1) desc
12 /
JWS SCORE(1) DIVISION SEARCH_TERM
---------- ---------- ---------- ------------------------------
88 8 1 Rob Lowe
86 8 1 Rob L'owe
54 8 1 Longo
43 8 1 Lorch
4 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2149790340
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 1680 | 13 (8)| 00:00:01 |
| 1 | SORT ORDER BY | | 40 | 1680 | 13 (8)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| SEARCH | 40 | 1680 | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CONTEXT_IDX | | | 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("CTXSYS"."CONTAINS"("SEARCH_TERM",'(Rob accum (!Lough OR ?Lough))
and sdata (division = 1)',1)>0)
Note
-----
- dynamic sampling used for this statement (level=2)
SCOTT@orcl_11gR2>
|
|
|
Re: Oracle Text Index using Soundex [message #501210 is a reply to message #501207] |
Sun, 27 March 2011 13:06 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
You might be able to get the same execution plan without a sub-query, as shown below. However, sometimes Oracle Text is a little quirky and it is best to separate the text query from other things. What you want to avoid is having it apply the jaro_winkler_similarity to all of the rows, before it limits the rows using the text index.
SCOTT@orcl_11gR2> select utl_match.jaro_winkler_similarity
2 (search_term, 'Rob Lough') jws,
3 score (1), division, search_term
4 from search
5 where contains
6 (search_term,
7 '(Rob accum (!Lough OR ?Lough))
8 and sdata (division = 1)',
9 1) > 0
10 order by jws desc, score (1) desc
11 /
JWS SCORE(1) DIVISION SEARCH_TERM
---------- ---------- ---------- ------------------------------
88 8 1 Rob Lowe
86 8 1 Rob L'owe
54 8 1 Longo
43 8 1 Lorch
4 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2149790340
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 39 | 1638 | 13 (8)| 00:00:01 |
| 1 | SORT ORDER BY | | 39 | 1638 | 13 (8)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| SEARCH | 39 | 1638 | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CONTEXT_IDX | | | 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("CTXSYS"."CONTAINS"("SEARCH_TERM",'(Rob accum (!Lough OR ?Lough))
and sdata (division = 1)',1)>0)
Note
-----
- dynamic sampling used for this statement (level=2)
SCOTT@orcl_11gR2>
|
|
|
|
|
|
|
|
|
Re: Oracle Text Index using Soundex [message #502144 is a reply to message #502040] |
Mon, 04 April 2011 12:19 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
You might also be interested in the new ndata available in 11.2 demonstrated below. It requires a section with tags. If you do not already have tags in your data, you can easily accomplish this by adding a multi_column_datastore, as in the example below.
SCOTT@orcl_11gR2> create table search
2 (search_term varchar2 (30))
3 /
Table created.
SCOTT@orcl_11gR2> insert all
2 into search values ('Lewis')
3 into search values ('Laske')
4 into search values ('Lowsky')
5 into search values ('Leckie')
6 into search values ('Luce')
7 into search values ('Lock')
8 into search values ('Lacayo')
9 into search values ('Lashay')
10 into search values ('Lakha')
11 into search values ('Laws')
12 into search values ('Lake')
13 into search values ('Lacy')
14 into search values ('Lowe')
15 into search values ('L''owe')
16 into search values ('Lorch')
17 into search values ('Longo')
18 into search values ('Doug')
19 into search values ('Hugh')
20 into search values ('Rob Lowe')
21 into search values ('Rob L''owe')
22 select * from dual
23 /
20 rows created.
SCOTT@orcl_11gR2> begin
2 ctx_ddl.create_preference ('mylex', 'BASIC_LEXER');
3 ctx_ddl.set_attribute
4 ('mylex',
5 'skipjoins',
6 '`?#\[]{}~!@$%^&*()-_=+|;:,"<>./''');
7 ctx_ddl.create_preference ('mywordlist', 'BASIC_WORDLIST');
8 ctx_ddl.create_section_group ('for_ndata', 'basic_section_group');
9 ctx_ddl.add_ndata_section ('for_ndata', 'nd', 'nd');
10 ctx_ddl.create_preference ('mcds', 'multi_column_datastore');
11 ctx_ddl.set_attribute ('mcds', 'columns', 'search_term nd');
12 end;
13 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11gR2> create index context_idx
2 on search (search_term)
3 indextype is ctxsys.context
4 parameters
5 ('lexer mylex
6 wordlist mywordlist
7 section group for_ndata
8 datastore mcds
9 transactional')
10 /
Index created.
SCOTT@orcl_11gR2> select score (1), search_term
2 from search
3 where contains
4 (search_term,
5 'ndata (nd, Rob Lough, noorder, proximity)',
6 1) > 0
7 order by score (1) desc
8 /
SCORE(1) SEARCH_TERM
---------- ------------------------------
47 Rob L'owe
47 Rob Lowe
2 rows selected.
SCOTT@orcl_11gR2>
[Updated on: Mon, 04 April 2011 13:17] Report message to a moderator
|
|
|
|
Goto Forum:
Current Time: Thu Dec 26 23:46:33 CST 2024
|