Home » Server Options » Text & interMedia » CONTAIN function syntax usage
CONTAIN function syntax usage [message #598719] |
Thu, 17 October 2013 01:46 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
hello,
I want to know the syntax for using CONTAIN function
1.How to use contain for two values
SELECT score(1),name FROM TABLE
WHERE CONTAINS(name, 'MYNAME', 1 ) > 0
and CONTAINS(last_name, 'MYLASTNAME', 1 ) > 0
ORDER BY SCORE(1) DESC;
2.How to pass cursor value into CONTAIN function.
CONTAINS(v1.name,v2.name, 1 ) > 0 gives me error
DRG-50901: text query parser syntax error on line 1, column 1
3.I also want to know what ,1) means?
Please help me to understand.
|
|
|
|
Re: CONTAIN function syntax usage [message #598782 is a reply to message #598719] |
Thu, 17 October 2013 12:55 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
1. If you are going to use two contains clauses, then you need two context indexes, as shown below.
SCOTT@orcl12c> CREATE TABLE table_name
2 (name VARCHAR2(15),
3 last_name VARCHAR2(15))
4 /
Table created.
SCOTT@orcl12c> INSERT ALL
2 INTO table_name VALUES ('MYNAME', 'MYLASTNAME')
3 INTO table_name VALUES ('Barbara', 'Boehmer')
4 SELECT * FROM DUAL
5 /
2 rows created.
SCOTT@orcl12c> CREATE INDEX name_idx ON table_name (name)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 /
Index created.
SCOTT@orcl12c> CREATE INDEX last_name_idx ON table_name (last_name)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 /
Index created.
SCOTT@orcl12c> SELECT SCORE(1), SCORE(2), name, last_name
2 FROM table_name
3 WHERE CONTAINS (name, 'MYNAME', 1) > 0
4 AND CONTAINS (last_name, 'MYLASTNAME', 2 ) > 0
5 ORDER BY SCORE(1) DESC
6 /
SCORE(1) SCORE(2) NAME LAST_NAME
---------- ---------- --------------- ---------------
4 4 MYNAME MYLASTNAME
1 row selected.
However, if you use a multi_column_datastore and section group, you can use just one contains clause with only one index hit, which is a more efficient query, as shown below.
SCOTT@orcl12c> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
3 CTX_DDL.SET_ATTRIBUTE ('test_mcds', 'COLUMNS', 'name, last_name');
4 END;
5 /
PL/SQL procedure successfully completed.
SCOTT@orcl12c> ALTER TABLE table_name ADD (names VARCHAR2(1))
2 /
Table altered.
SCOTT@orcl12c> CREATE INDEX names_idx ON table_name (names)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4 ('DATASTORE test_mcds
5 SECTION GROUP CTXSYS.AUTO_SECTION_GROUP')
6 /
Index created.
SCOTT@orcl12c> SELECT SCORE(3), name, last_name
2 FROM table_name
3 WHERE CONTAINS
4 (names,
5 'MYNAME WITHIN name AND
6 MYLASTNAME WITHIN last_name',
7 3 ) > 0
8 ORDER BY SCORE(3) DESC
9 /
SCORE(3) NAME LAST_NAME
---------- --------------- ---------------
4 MYNAME MYLASTNAME
1 row selected.
2. The following demonstrates using cursor values in a contains clause.
SCOTT@orcl12c> SET SERVEROUTPUT ON FORMAT WRAPPED
SCOTT@orcl12c> BEGIN
2 DBMS_OUTPUT.PUT_LINE
3 (' SCORE(4) NAME LAST_NAME');
4 DBMS_OUTPUT.PUT_LINE
5 ('---------- --------------- ---------------');
6 FOR v1 IN (SELECT 'MYNAME' name FROM DUAL) LOOP
7 FOR v2 IN (SELECT 'MYLASTNAME' name FROM DUAL) LOOP
8 FOR v3 IN
9 (SELECT SCORE(4), name, last_name
10 FROM table_name
11 WHERE CONTAINS
12 (names,
13 v1.name || ' WITHIN name AND ' ||
14 v2.name || ' WITHIN last_name',
15 4) > 0)
16 LOOP
17 DBMS_OUTPUT.PUT_LINE
18 (LPAD (v3."SCORE(4)", 10) || ' ' ||
19 RPAD (v3.name, 16) || v3.last_name);
20 END LOOP;
21 END LOOP;
22 END LOOP;
23 END;
24 /
SCORE(4) NAME LAST_NAME
---------- --------------- ---------------
4 MYNAME MYLASTNAME
PL/SQL procedure successfully completed.
3. http://docs.oracle.com/cd/E16655_01/text.121/e17747/csql.htm#CCREF0107
|
|
|
Re: CONTAIN function syntax usage [message #598783 is a reply to message #598739] |
Thu, 17 October 2013 13:25 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
ranki wrote on Thu, 17 October 2013 03:43
... Apart from that CONTAIN function fails to fetch if a space is inserted between column value and if any extra alphabet is added to value.
'ABCD' is fetching
'AB CD' not fetching.
'ABBCD' not fetching.
Please help me for understanding this.
Your question is unclear. Are the values that you posted the values in your table's column or the values of the strings that you are searching for? In general, Oracle Text is token-based, meaning that it searches for tokens, which are strings of text characters between break characters, which usually means words between spaces. So, if you search for 'ABCD', it searches for any value that contains the token/word 'ABCD'. If you search for 'AB CD', it searches for any value that contains the token/word 'AB' followed by the token/word 'CD' separated by a space or other break character, such as a comma. If you search for 'ABBCD', then it searches for any value containing the token/word 'ABBCD'. There are features such as NDATA that will allow you to search for close matches with and without spaces and extra characters and such. It helps if you post a more complete example of what you are trying to do. If you are searching for names that are similarly spelled and may contain spaces or be first before last or last before first, then that is what NDATA is designed for. The following is an extension of the previous example using NDATA. Note that if the strings are too short, it may not find one that has an extra character or space. The more characters that match, the more likely it will be in the result set and the higher the score.
SCOTT@orcl12c> INSERT ALL
2 INTO table_name (name, last_name) VALUES ('VXABCDYZ', 'MYLASTNAME')
3 INTO table_name (name, last_name) VALUES ('VXAB CDYZ', 'MYLASTNAME')
4 INTO table_name (name, last_name) VALUES ('VXABBCDYZ', 'MYLASTNAME')
5 SELECT * FROM DUAL
6 /
3 rows created.
SCOTT@orcl12c> BEGIN
2 CTX_DDL.CREATE_SECTION_GROUP ('namegroup', 'BASIC_SECTION_GROUP');
3 CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'NAME', 'NAME');
4 CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'LAST_NAME', 'LAST_NAME');
5 END;
6 /
PL/SQL procedure successfully completed.
SCOTT@orcl12c> DROP INDEX names_idx
2 /
Index dropped.
SCOTT@orcl12c> CREATE INDEX names_idx ON table_name (names)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4 ('DATASTORE test_mcds
5 SECTION GROUP namegroup')
6 /
Index created.
SCOTT@orcl12c> SELECT SCORE(5), name, last_name
2 FROM table_name
3 WHERE CONTAINS
4 (names,
5 'NDATA (name, VXABCDYZ) AND
6 NDATA (last_name, MYLASTNAME)',
7 5) > 0
8 ORDER BY SCORE(5) DESC
9 /
SCORE(5) NAME LAST_NAME
---------- --------------- ---------------
100 VXABCDYZ MYLASTNAME
91 VXABBCDYZ MYLASTNAME
91 VXAB CDYZ MYLASTNAME
3 rows selected.
|
|
|
|
|
Re: CONTAIN function syntax usage [message #598898 is a reply to message #598893] |
Fri, 18 October 2013 17:38 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
If you have the following test environment consisting of table, data, multi_column_datastore, section group with ndata sections, and context index:
SCOTT@orcl12c> CREATE TABLE table_name
2 (name VARCHAR2(15),
3 last_name VARCHAR2(15))
4 /
Table created.
SCOTT@orcl12c> INSERT ALL
2 INTO table_name (name, last_name) VALUES ('MYNAME', 'MYLASTNAME')
3 INTO table_name (name, last_name) VALUES ('Barbara', 'Boehmer')
4 INTO table_name (name, last_name) VALUES ('VXABCDYZ', 'MYLASTNAME')
5 INTO table_name (name, last_name) VALUES ('VXAB CDYZ', 'MYLASTNAME')
6 INTO table_name (name, last_name) VALUES ('VXABBCDYZ', 'MYLASTNAME')
7 SELECT * FROM DUAL
8 /
5 rows created.
SCOTT@orcl12c> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
3 CTX_DDL.SET_ATTRIBUTE ('test_mcds', 'COLUMNS', 'name, last_name');
4 CTX_DDL.CREATE_SECTION_GROUP ('namegroup', 'BASIC_SECTION_GROUP');
5 CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'NAME', 'NAME');
6 CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'LAST_NAME', 'LAST_NAME');
7 END;
8 /
PL/SQL procedure successfully completed.
SCOTT@orcl12c> ALTER TABLE table_name ADD (names VARCHAR2(1))
2 /
Table altered.
SCOTT@orcl12c> CREATE INDEX names_idx ON table_name (names)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4 ('DATASTORE test_mcds
5 SECTION GROUP namegroup')
6 /
Index created.
Then you can pass cursor values like so:
SCOTT@orcl12c> SET SERVEROUTPUT ON FORMAT WRAPPED
SCOTT@orcl12c> BEGIN
2 DBMS_OUTPUT.PUT_LINE
3 (' SCORE(4) NAME LAST_NAME');
4 DBMS_OUTPUT.PUT_LINE
5 ('---------- --------------- ---------------');
6 FOR v1 IN (SELECT 'VXABCDYZ' name FROM DUAL) LOOP
7 FOR v2 IN (SELECT 'MYLASTNAME' name FROM DUAL) LOOP
8 FOR v3 IN
9 (SELECT SCORE(5), name, last_name
10 FROM table_name
11 WHERE CONTAINS
12 (names,
13 'NDATA (name, ' || v1.name || ') AND ' ||
14 'NDATA (last_name, ' || v2.name || ')',
15 5) > 0)
16 LOOP
17 DBMS_OUTPUT.PUT_LINE
18 (LPAD (v3."SCORE(5)", 10) || ' ' ||
19 RPAD (v3.name, 16) || v3.last_name);
20 END LOOP;
21 END LOOP;
22 END LOOP;
23 END;
24 /
SCORE(4) NAME LAST_NAME
---------- --------------- ---------------
100 VXABCDYZ MYLASTNAME
91 VXAB CDYZ MYLASTNAME
91 VXABBCDYZ MYLASTNAME
PL/SQL procedure successfully completed.
However, as a general rule, any time that you can accomplish the same thing in SQL alone, instead of using PL/SQL it is likely to be more efficient. So, the following SQL that accomplishes the same thing as the PL/SQL above, without looping through cursors, is more efficient.
SCOTT@orcl12c> SELECT SCORE(5), t.name, t.last_name
2 FROM table_name t,
3 (SELECT 'VXABCDYZ' name FROM DUAL) v1,
4 (SELECT 'MYLASTNAME' name FROM DUAL) v2
5 WHERE CONTAINS
6 (names,
7 'NDATA (name, ' || v1.name || ') AND ' ||
8 'NDATA (last_name, ' || v2.name || ')',
9 5) > 0
10 /
SCORE(5) NAME LAST_NAME
---------- --------------- ---------------
100 VXABCDYZ MYLASTNAME
91 VXAB CDYZ MYLASTNAME
91 VXABBCDYZ MYLASTNAME
3 rows selected.
It would probably help if you explained exactly what you are trying to do from a user standpoint, instead of just what you think you need to do from a programmer standpoint. Many times people ask if they can do something the hard way, but if we knew what they were trying to accomplish, we could point them toward an easier way.
|
|
|
|
Re: CONTAIN function syntax usage [message #598921 is a reply to message #598904] |
Sat, 19 October 2013 17:45 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Since you are searching for duplicates already in the data, you will need to use a self-join. I am guessing that none of your data is validated and they can enter anything in any column. So, I would probably concatenate the names into one full_name column and concatenate the address data into one full_address column in the multi_column_datastore, then create ndata sections for each. Even though ndata is designed primarily for names, it think it works well for your address data here as well and allows you to use one index to search for both and provide a combined score. You may find that this is sufficient. However, if you then search using contains and jaro_winkler_similarity and edit_distance_similarity, the optimizer is smart enough to use the domain index first, then apply the other filters to that result set. You can also use ctx_query.count_hits to determine the number of rows that match the criteria, which is the number of accounts that are similar to one another. I have provided a complete example below, in which I have used three different scoring methods and limited the rows by those. You may want to use just part of it. I have displayed only the number of accounts, matching id's and scores, but you could display the name and address values as well if you like. You could also wrap an outer query around it and limit it by the value from ctx_query.count_hits, so you could list just those accounts with at least a certain number of duplicates. I have also provided a simplified query below that does that, providing a simple list of just the number of accounts and account_id, starting with the ones with the most similar accounts, which could be used a sort of a potential ban list. I urge you to be cautious in verifying before banning, as there are many names and addresses that may be similar, but are actually different people. I was surprised to find out how many people have the identical first and last name as I do, some of which even live in the same state. So, be careful.
SCOTT@orcl12c> CREATE TABLE table_name
2 (account_id NUMBER,
3 name VARCHAR2(15),
4 middle VARCHAR2(15),
5 last_name VARCHAR2(15),
6 address1 VARCHAR2(20),
7 address2 VARCHAR2(15),
8 city VARCHAR2(15))
9 /
Table created.
SCOTT@orcl12c> INSERT ALL
2 INTO table_name VALUES
3 (1, 'VXABCDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Parkway', '', 'Redwood Shores')
4 INTO table_name VALUES
5 (2, 'VXAB CDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Pkwy.', '', 'Redwood Shores')
6 INTO table_name VALUES
7 (3, 'VXABBCDYZ', 'M.', 'MYLASTNAME', '500 Oracle', '', 'R.S.')
8 INTO table_name VALUES
9 (4, 'B.', 'A.', 'Boehmer', '123 Someplace St.', '', 'Somewhere')
10 INTO table_name VALUES
11 (5, 'Barbara', '', 'Boehmer', '123 Someplace Street', '', 'Somewhere City')
12 SELECT * FROM DUAL
13 /
5 rows created.
SCOTT@orcl12c> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
3 CTX_DDL.SET_ATTRIBUTE
4 ('test_mcds',
5 'COLUMNS',
6 'name || '' '' || middle || '' '' || last_name full_name,
7 address1 || '' '' || address2 || '' '' || city full_address');
8 CTX_DDL.CREATE_SECTION_GROUP ('test_sg', 'BASIC_SECTION_GROUP');
9 CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_name', 'full_name');
10 CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_address', 'full_address');
11 END;
12 /
PL/SQL procedure successfully completed.
SCOTT@orcl12c> ALTER TABLE table_name ADD (names_and_addresses VARCHAR2(1))
2 /
Table altered.
SCOTT@orcl12c> CREATE INDEX table_name_idx
2 ON table_name (names_and_addresses)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS
5 ('DATASTORE test_mcds
6 SECTION GROUP test_sg')
7 /
Index created.
SCOTT@orcl12c> BREAK ON accounts ON account_id SKIP 1
SCOTT@orcl12c> SELECT CTX_QUERY.COUNT_HITS
2 ('table_name_idx',
3 'NDATA (full_name, ' || t2.name || ' ' || t2.middle || ' ' || t2.last_name || ') AND ' ||
4 'NDATA (full_address, ' || t2.address1 || ' ' || t2.address2 || ' ' || t2.city || ')')
5 AS accounts,
6 t1.account_id,
7 t2.account_id duplicates,
8 SCORE(1) AS score,
9 UTL_MATCH.JARO_WINKLER_SIMILARITY
10 (t1.name || ' ' || t1.middle || ' ' || t1.last_name,
11 t2.name || ' ' || t2.middle || ' ' || t2.last_name)
12 AS jws,
13 UTL_MATCH.EDIT_DISTANCE_SIMILARITY
14 (t1.address1 || ' ' || t1.address2 || ' ' || t1.city,
15 t2.address1 || ' ' || t2.address2 || ' ' || t2.city)
16 AS eds
17 FROM table_name t1, table_name t2
18 WHERE t1.account_id != t2.account_id
19 AND CONTAINS
20 (t1.names_and_addresses,
21 'NDATA (full_name, ' || t2.name || ' ' || t2.middle || ' ' || t2.last_name || ') AND ' ||
22 'NDATA (full_address, ' || t2.address1 || ' ' || t2.address2 || ' ' || t2.city || ')',
23 1) > 30
24 AND UTL_MATCH.JARO_WINKLER_SIMILARITY
25 (t1.name || ' ' || t1.middle || ' ' || t1.last_name,
26 t2.name || ' ' || t2.middle || ' ' || t2.last_name) > 60
27 AND UTL_MATCH.EDIT_DISTANCE_SIMILARITY
28 (t1.address1 || ' ' || t1.address2 || ' ' || t1.city,
29 t2.address1 || ' ' || t2.address2 || ' ' || t2.city) > 40
30 ORDER BY accounts DESC, account_id, score DESC, jws DESC, eds DESC
31 /
ACCOUNTS ACCOUNT_ID DUPLICATES SCORE JWS EDS
---------- ---------- ---------- ---------- ---------- ----------
3 1 2 84 95 89
3 70 90 42
2 1 85 95 89
3 70 86 44
3 2 39 86 44
1 38 90 42
2 4 5 52 69 75
5 4 53 69 75
8 rows selected.
SCOTT@orcl12c> CLEAR BREAKS
SCOTT@orcl12c> SELECT accounts, account_id
2 FROM (SELECT MAX
3 (CTX_QUERY.COUNT_HITS
4 ('table_name_idx',
5 'NDATA (full_name, ' || t2.name || ' ' || t2.middle
6 || ' ' || t2.last_name || ') AND ' ||
7 'NDATA (full_address, ' || t2.address1 || ' '
8 || t2.address2 || ' ' || t2.city || ')'))
9 AS accounts,
10 t1.account_id
11 FROM table_name t1, table_name t2
12 WHERE t1.account_id != t2.account_id
13 AND CONTAINS
14 (t1.names_and_addresses,
15 'NDATA (full_name, ' || t2.name || ' ' || t2.middle
16 || ' ' || t2.last_name || ') AND ' ||
17 'NDATA (full_address, ' || t2.address1 || ' '
18 || t2.address2 || ' ' || t2.city || ')',
19 1) > 30
20 GROUP BY t1.account_id)
21 WHERE accounts >= 2
22 ORDER BY accounts DESC, account_id
23 /
ACCOUNTS ACCOUNT_ID
---------- ----------
3 1
3 2
3 3
2 4
2 5
5 rows selected.
|
|
|
Re: CONTAIN function syntax usage [message #599003 is a reply to message #598921] |
Mon, 21 October 2013 06:24 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
Barbara,Thanks a lot for your detailed test case. Actually my data is validated at the front end but to be extra cautious I am asked to ignore spaces and TYPO. And as you have suggested I will be cautious while taking the list.As I am matching full address hopefully there will be not be any mistakes.
Now I am stuck with the INDEX creation . My data is around 20 million and the index is taking very long to create.
space taken till now is more than 40gb and still hanging. Any suggestion for fast index creation please?
Index creation does not come out and checked space is available.Index status shows valid. Tried three times but getting error
ORA-29861: domain index is marked LOADING/FAILED/UNUSABLE if I test update the table although I know index creation was incomplete.
[Updated on: Mon, 21 October 2013 11:47] Report message to a moderator
|
|
|
|
|
|
|
Re: CONTAIN function syntax usage [message #599646 is a reply to message #599639] |
Sun, 27 October 2013 23:55 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
Barbara,
I have noticed listed address which has comma,hyphen,() are giving error. If I remove those characters and replace with space it is working fine. slash "/" is working fine.
I have tried to
1.NO 101(3) 5TH CROSS 7TH MAIN
2.25-14-909-1
3.S/O, D.K. ABCD
4.1-77,WXYV
5.1-267-1(1)
I have tried to set attribute. But comma still gives me error.
begin
ctx_ddl.create_preference ('test_lex1', 'basic_lexer');
ctx_ddl.set_attribute ('test_lex1', 'whitespace', '/\|-_+,&''');
end;
[Updated on: Sun, 27 October 2013 23:56] Report message to a moderator
|
|
|
|
Re: CONTAIN function syntax usage [message #599789 is a reply to message #599771] |
Tue, 29 October 2013 00:40 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
Sorry for no test case provided. please find the test case
My table has the data given below.
ac_id FULL_NAME ADDRESS
----- --------------------------------------------------
1 MR.AB,CD NO 101 (3) 5TH CROSS 7TH MAIN
2 VXAB CDYZ 25-14-909/B STREET#5
3 VXABBCDYZ S/O, D.K. ABCB
4 ABCD,XYZ FLAT 301 STREET NO 5
5 ABCD,XYZ 1024,RESEARCH BLVD
6 BARBARA FLAT:301 STREET NO: 5
1.Even if name, middle_name, last name are provided some times they enter full name in name column itself with a comma.
2.Address has special characters like brace(),hash#,comma and semicolumn. When I replace all these with white space then
the query
provided by you does not give any error.
SELECT accounts, account_id
FROM (SELECT MAX
(CTX_QUERY.COUNT_HITS
('table_name_idx',
'NDATA (full_name, ' || t2.name || ' ' || t2.middle
|| ' ' || t2.last_name || ') AND ' ||
'NDATA (full_address, ' || t2.address1 || ' '
|| t2.address2 || ' ' || t2.city || ')'))
AS accounts,
t1.account_id
FROM table_name t1, table_name t2
WHERE t1.account_id != t2.account_id
AND CONTAINS
(t1.names_and_addresses,
'NDATA (full_name, ' || t2.name || ' ' || t2.middle
|| ' ' || t2.last_name || ') AND ' ||
'NDATA (full_address, ' || t2.address1 || ' '
|| t2.address2 || ' ' || t2.city || ')',
1) > 30
GROUP BY t1.account_id)
WHERE accounts >= 2
ORDER BY accounts DESC, account_id
[Updated on: Tue, 29 October 2013 02:01] Report message to a moderator
|
|
|
Re: CONTAIN function syntax usage [message #599881 is a reply to message #599789] |
Tue, 29 October 2013 15:52 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
You need to re-read the forum guidelines, especially the part about how to provide a proper test case. Your test case should have included create table and insert statements, so that anyone can run them on their system. You didn't even provide data for your actual columns, just the concatenated full_name and full_address. Also, none of the data your provided is close enough to constitute a duplicate.
The ndata doesn't seem to be compatible with some other things, such as lexers. So, it appears that you need to remove the special characters from the columns before indexing. Also, because you are doing a self-join, using the same column values to search, you need to remove the special characters. Although duplicate data storage is generally undesirable, in this case using shadow columns may be the best workaround. I have provide an example below. I have added shadow columns for the full_name and full_address. I have also provided an update statement to update existing data and a trigger to update new data. I have used the translate function to remove the special characters from these columns. Then I have simplified the rest (preferences, index, and query) to use just the shadow columns.
SCOTT@orcl12c> -- table and existing data:
SCOTT@orcl12c> CREATE TABLE table_name
2 (account_id NUMBER,
3 name VARCHAR2(15),
4 middle VARCHAR2(15),
5 last_name VARCHAR2(15),
6 address1 VARCHAR2(30),
7 address2 VARCHAR2(15),
8 city VARCHAR2(15))
9 /
Table created.
SCOTT@orcl12c> INSERT ALL
2 INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
3 (1, 'VXABCDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Parkway', '', 'Redwood Shores')
4 INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
5 (2, 'VXAB CDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Pkwy.', '', 'Redwood Shores')
6 INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
7 (3, 'VXABBCDYZ', 'M.', 'MYLASTNAME', '500 Oracle', '', 'R.S.')
8 INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
9 (4, 'B.', 'A.', 'Boehmer', '123 Someplace St.', '', 'Somewhere')
10 INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
11 (5, 'Barbara', '', 'Boehmer', '123 Someplace Street', '', 'Somewhere City')
12 SELECT * FROM DUAL
13 /
5 rows created.
SCOTT@orcl12c> COMMIT
2 /
Commit complete.
SCOTT@orcl12c> -- add shadow columns:
SCOTT@orcl12c> ALTER TABLE table_name ADD
2 (full_name VARCHAR2(45),
3 full_address VARCHAR2(50))
4 /
Table altered.
SCOTT@orcl12c> -- update existing rows:
SCOTT@orcl12c> UPDATE table_name
2 SET full_name =
3 TRANSLATE
4 (name || ' ' || middle || ' ' || last_name,
5 '.,()-/#:',
6 ' '),
7 full_address =
8 TRANSLATE
9 (address1 || ' ' || address2 || ' ' || city,
10 '.,()-/#:',
11 ' ')
12 /
5 rows updated.
SCOTT@orcl12c> COMMIT
2 /
Commit complete.
SCOTT@orcl12c> -- create trigger to update new rows:
SCOTT@orcl12c> CREATE OR REPLACE TRIGGER table_name_biur
2 BEFORE INSERT OR UPDATE ON table_name
3 FOR EACH ROW
4 BEGIN
5 :NEW.full_name :=
6 TRANSLATE
7 (:NEW.name || ' ' || :NEW.middle || ' ' || :NEW.last_name,
8 '.,()-/#:',
9 ' ');
10 :NEW.full_address :=
11 TRANSLATE
12 (:NEW.address1 || ' ' || :NEW.address2 || ' ' || :NEW.city,
13 '.,()-/#:',
14 ' ');
15 END table_name_biur;
16 /
Trigger created.
SCOTT@orcl12c> SHOW ERRORS
No errors.
SCOTT@orcl12c> -- new data:
SCOTT@orcl12c> INSERT ALL
2 INTO table_name (account_id, name, address1) VALUES (6, 'MR.AB,CD', 'NO 101 (3) 5TH CROSS 7TH MAIN')
3 INTO table_name (account_id, name, address1) VALUES (7, 'VXAB CDYZ', '25-14-909/B STREET#5')
4 INTO table_name (account_id, name, address1) VALUES (8, 'VXABBCDYZ', 'S/O, D.K. ABCB')
5 INTO table_name (account_id, name, address1) VALUES (9, 'ABCD,XYZ', 'FLAT 301 STREET NO 5')
6 INTO table_name (account_id, name, address1) VALUES (10, 'ABCD,XYZ', '1024,RESEARCH BLVD')
7 INTO table_name (account_id, name, address1) VALUES (11, 'BARBARA', 'FLAT:301 STREET NO: 5')
8 INTO table_name (account_id, name, address1) VALUES (12, 'XYZ,ABCD', '1024,RESEARCH BLVD')
9 SELECT * FROM DUAL
10 /
7 rows created.
SCOTT@orcl12c> COMMIT
2 /
Commit complete.
SCOTT@orcl12c> -- preferences, dummy column, and index:
SCOTT@orcl12c> BEGIN
2 CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
3 CTX_DDL.SET_ATTRIBUTE
4 ('test_mcds', 'COLUMNS', 'full_name, full_address');
5 CTX_DDL.CREATE_SECTION_GROUP ('test_sg', 'BASIC_SECTION_GROUP');
6 CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_name', 'full_name');
7 CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_address', 'full_address');
8 END;
9 /
PL/SQL procedure successfully completed.
SCOTT@orcl12c> ALTER TABLE table_name ADD (names_and_addresses VARCHAR2(1))
2 /
Table altered.
SCOTT@orcl12c> CREATE INDEX table_name_idx
2 ON table_name (names_and_addresses)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS
5 ('DATASTORE test_mcds
6 SECTION GROUP test_sg')
7 /
Index created.
SCOTT@orcl12c> -- query:
SCOTT@orcl12c> SELECT accounts, account_id
2 FROM (SELECT MAX
3 (CTX_QUERY.COUNT_HITS
4 ('table_name_idx',
5 'NDATA (full_name, ' || t2.full_name || ') AND ' ||
6 'NDATA (full_address, ' || t2.full_address || ')'))
7 AS accounts,
8 t1.account_id
9 FROM table_name t1, table_name t2
10 WHERE t1.account_id != t2.account_id
11 AND CONTAINS
12 (t1.names_and_addresses,
13 'NDATA (full_name, ' || t2.full_name || ') AND ' ||
14 'NDATA (full_address, ' || t2.full_address || ')',
15 1) > 30
16 GROUP BY t1.account_id)
17 WHERE accounts >= 2
18 ORDER BY accounts DESC, account_id
19 /
ACCOUNTS ACCOUNT_ID
---------- ----------
3 1
3 2
3 3
2 4
2 5
2 10
2 12
7 rows selected.
|
|
|
Re: CONTAIN function syntax usage [message #599969 is a reply to message #599881] |
Wed, 30 October 2013 10:00 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
Thanks barbara I have got very good information about Oracle text with your help. But I am feeling I am stuck at the same point where I have started.MY whole intension was to speed up the process of identifying the customer with multiple accounts.Solution provided by you is working fine with small data,when it comes to large data say 20 million it is again hanging same like when I tried jaro winkler.
I have checked explain plans for both test table and my production table.I have attached.
[Updated on: Wed, 30 October 2013 12:38] Report message to a moderator
|
|
|
Re: CONTAIN function syntax usage [message #599987 is a reply to message #599969] |
Wed, 30 October 2013 13:43 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
In order to check for duplicates it has to do a full table scan, so with that much data, it is going to take a long time, no matter what you do. However, in theory, the following queries might be quicker. Please test them and let us know.
SELECT accounts, account_id
FROM (SELECT CTX_QUERY.COUNT_HITS
('table_name_idx',
'NDATA (full_name, ' || full_name || ') AND ' ||
'NDATA (full_address, ' || full_address || ')')
AS accounts,
account_id
FROM table_name)
WHERE accounts >= 5
ORDER BY accounts DESC, account_id;
SELECT COUNT(*) accounts, t1.account_id
FROM table_name t1, table_name t2
WHERE CONTAINS
(t1.names_and_addresses,
'NDATA (full_name, ' || t2.full_name || ') AND ' ||
'NDATA (full_address, ' || t2.full_address || ')') > 0
GROUP BY t1.account_id
HAVING COUNT(*) >= 5
ORDER BY accounts DESC, account_id;
|
|
|
Re: CONTAIN function syntax usage [message #600031 is a reply to message #599987] |
Thu, 31 October 2013 00:48 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
Barbara,
Finally I have concluded that only query which comes out in 20 to 30 sec after parsing one value into this query is fastest of all options.
SELECT COUNT(*) accounts, account_id
FROM table_name
WHERE CONTAINS
(names_and_addresses,
'NDATA (full_name, Barbara Boehmer) AND NDATA (full_address, REDWOOD CITY)',5) > 0
GROUP BY account_id
HAVING COUNT(*) >= 5
I will write a cursor and pass one record and get the required duplicates and insert into another table if customer exceeds the limit. Finally concluded First time finding duplicates is going to take long time .After that I have to include some procedure to identify as and when the customer account is created.
[Updated on: Thu, 31 October 2013 00:48] Report message to a moderator
|
|
|
Goto Forum:
Current Time: Fri Dec 27 01:05:36 CST 2024
|