OraFAQ Forum: Text & interMedia » CONTAIN function syntax usage

Home » Server Options » Text & interMedia » CONTAIN function syntax usage

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

CONTAIN function syntax usage [message #598719]

Thu, 17 October 2013 01:46

ranki
Messages: 19
Registered: October 2013

Junior Member

hello,
I want to know the syntax for using CONTAIN function
1.How to use contain for two values
SELECT score(1),name FROM TABLE
WHERE CONTAINS(name, 'MYNAME', 1 ) > 0
and CONTAINS(last_name, 'MYLASTNAME', 1 ) > 0
ORDER BY SCORE(1) DESC;

2.How to pass cursor value into CONTAIN function.
CONTAINS(v1.name,v2.name, 1 ) > 0 gives me error
DRG-50901: text query parser syntax error on line 1, column 1

3.I also want to know what ,1) means?

Please help me to understand.

Report message to a moderator

Re: CONTAIN function syntax usage [message #598739 is a reply to message #598719]

Thu, 17 October 2013 05:43

ranki
Messages: 19
Registered: October 2013

Junior Member

hello,
Thank you for moving my thread to Oracle text. I have gone through Barbara's answers for CONTAIN function.
My first point is cleared. But remaining 2 points are still to be cleared.
Apart from that CONTAIN function fails to fetch if a space is inserted between column value and if any extra alphabet is added to value.
'ABCD' is fetching
'AB CD' not fetching.
'ABBCD' not fetching.

Please help me for understanding this.

Report message to a moderator

Re: CONTAIN function syntax usage [message #598782 is a reply to message #598719]

Thu, 17 October 2013 12:55

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

1. If you are going to use two contains clauses, then you need two context indexes, as shown below.

SCOTT@orcl12c> CREATE TABLE table_name
  2    (name	   VARCHAR2(15),
  3  	last_name  VARCHAR2(15))
  4  /

Table created.

SCOTT@orcl12c> INSERT ALL
  2  INTO table_name VALUES ('MYNAME', 'MYLASTNAME')
  3  INTO table_name VALUES ('Barbara', 'Boehmer')
  4  SELECT * FROM DUAL
  5  /

2 rows created.

SCOTT@orcl12c> CREATE INDEX name_idx ON table_name (name)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  /

Index created.

SCOTT@orcl12c> CREATE INDEX last_name_idx ON table_name (last_name)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  /

Index created.

SCOTT@orcl12c> SELECT SCORE(1), SCORE(2), name, last_name
  2  FROM   table_name
  3  WHERE  CONTAINS (name, 'MYNAME',  1) > 0
  4  AND    CONTAINS (last_name, 'MYLASTNAME', 2 ) > 0
  5  ORDER  BY SCORE(1) DESC
  6  /

  SCORE(1)   SCORE(2) NAME            LAST_NAME
---------- ---------- --------------- ---------------
         4          4 MYNAME          MYLASTNAME

1 row selected.

However, if you use a multi_column_datastore and section group, you can use just one contains clause with only one index hit, which is a more efficient query, as shown below.

SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
  3    CTX_DDL.SET_ATTRIBUTE ('test_mcds', 'COLUMNS', 'name, last_name');
  4  END;
  5  /

PL/SQL procedure successfully completed.

SCOTT@orcl12c> ALTER TABLE table_name ADD (names  VARCHAR2(1))
  2  /

Table altered.

SCOTT@orcl12c> CREATE INDEX names_idx ON table_name (names)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  PARAMETERS
  4    ('DATASTORE	test_mcds
  5  	 SECTION GROUP	CTXSYS.AUTO_SECTION_GROUP')
  6  /

Index created.

SCOTT@orcl12c> SELECT SCORE(3), name, last_name
  2  FROM   table_name
  3  WHERE  CONTAINS
  4  	      (names,
  5  	       'MYNAME WITHIN name AND
  6  		MYLASTNAME WITHIN last_name',
  7  	       3 ) > 0
  8  ORDER  BY SCORE(3) DESC
  9  /

  SCORE(3) NAME            LAST_NAME
---------- --------------- ---------------
         4 MYNAME          MYLASTNAME

1 row selected.

2. The following demonstrates using cursor values in a contains clause.

SCOTT@orcl12c> SET SERVEROUTPUT ON FORMAT WRAPPED
SCOTT@orcl12c> BEGIN
  2    DBMS_OUTPUT.PUT_LINE
  3  	 ('  SCORE(4) NAME	      LAST_NAME');
  4    DBMS_OUTPUT.PUT_LINE
  5  	 ('---------- --------------- ---------------');
  6    FOR v1 IN (SELECT 'MYNAME' name FROM DUAL) LOOP
  7  	 FOR v2 IN (SELECT 'MYLASTNAME' name FROM DUAL) LOOP
  8  	   FOR v3 IN
  9  	     (SELECT SCORE(4), name, last_name
 10  	      FROM   table_name
 11  	      WHERE  CONTAINS
 12  		       (names,
 13  			v1.name || ' WITHIN name AND ' ||
 14  			v2.name || ' WITHIN last_name',
 15  			4) > 0)
 16  	   LOOP
 17  	     DBMS_OUTPUT.PUT_LINE
 18  	       (LPAD (v3."SCORE(4)", 10) || ' ' ||
 19  		RPAD (v3.name, 16) || v3.last_name);
 20  	   END LOOP;
 21  	 END LOOP;
 22    END LOOP;
 23  END;
 24  /
  SCORE(4) NAME            LAST_NAME
---------- --------------- ---------------
         4 MYNAME          MYLASTNAME

PL/SQL procedure successfully completed.

3. http://docs.oracle.com/cd/E16655_01/text.121/e17747/csql.htm#CCREF0107

Report message to a moderator

Re: CONTAIN function syntax usage [message #598783 is a reply to message #598739]

Thu, 17 October 2013 13:25

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

ranki wrote on Thu, 17 October 2013 03:43

... Apart from that CONTAIN function fails to fetch if a space is inserted between column value and if any extra alphabet is added to value.
'ABCD' is fetching
'AB CD' not fetching.
'ABBCD' not fetching.

Please help me for understanding this.

Your question is unclear. Are the values that you posted the values in your table's column or the values of the strings that you are searching for? In general, Oracle Text is token-based, meaning that it searches for tokens, which are strings of text characters between break characters, which usually means words between spaces. So, if you search for 'ABCD', it searches for any value that contains the token/word 'ABCD'. If you search for 'AB CD', it searches for any value that contains the token/word 'AB' followed by the token/word 'CD' separated by a space or other break character, such as a comma. If you search for 'ABBCD', then it searches for any value containing the token/word 'ABBCD'. There are features such as NDATA that will allow you to search for close matches with and without spaces and extra characters and such. It helps if you post a more complete example of what you are trying to do. If you are searching for names that are similarly spelled and may contain spaces or be first before last or last before first, then that is what NDATA is designed for. The following is an extension of the previous example using NDATA. Note that if the strings are too short, it may not find one that has an extra character or space. The more characters that match, the more likely it will be in the result set and the higher the score.

SCOTT@orcl12c> INSERT ALL
  2  INTO table_name (name, last_name) VALUES ('VXABCDYZ', 'MYLASTNAME')
  3  INTO table_name (name, last_name) VALUES ('VXAB CDYZ', 'MYLASTNAME')
  4  INTO table_name (name, last_name) VALUES ('VXABBCDYZ', 'MYLASTNAME')
  5  SELECT * FROM DUAL
  6  /

3 rows created.

SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_SECTION_GROUP ('namegroup', 'BASIC_SECTION_GROUP');
  3    CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'NAME', 'NAME');
  4    CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'LAST_NAME', 'LAST_NAME');
  5  END;
  6  /

PL/SQL procedure successfully completed.

SCOTT@orcl12c> DROP INDEX names_idx
  2  /

Index dropped.

SCOTT@orcl12c> CREATE INDEX names_idx ON table_name (names)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  PARAMETERS
  4    ('DATASTORE	test_mcds
  5  	 SECTION GROUP	namegroup')
  6  /

Index created.

SCOTT@orcl12c> SELECT SCORE(5), name, last_name
  2  FROM   table_name
  3  WHERE  CONTAINS
  4  	      (names,
  5  	       'NDATA (name, VXABCDYZ) AND
  6  		NDATA (last_name, MYLASTNAME)',
  7  	       5) > 0
  8  ORDER  BY SCORE(5) DESC
  9  /

  SCORE(5) NAME            LAST_NAME
---------- --------------- ---------------
       100 VXABCDYZ        MYLASTNAME
        91 VXABBCDYZ       MYLASTNAME
        91 VXAB CDYZ       MYLASTNAME

3 rows selected.

Report message to a moderator

Re: CONTAIN function syntax usage [message #598856 is a reply to message #598783]

Fri, 18 October 2013 07:31

ranki
Messages: 19
Registered: October 2013

Junior Member

Barbara,
Thank you so much for your excellent explanation. Ignoring space and type test cases is perfect for my requirement.
I will try and update back.

Report message to a moderator

Re: CONTAIN function syntax usage [message #598893 is a reply to message #598856]

Fri, 18 October 2013 15:16

ranki
Messages: 19
Registered: October 2013

Junior Member

Barbara,
I have tested and it is working fine and also very fast.Can you let me know if cursor value parsing holds good for NDATA also ?Can you provide one example please.
SELECT SCORE(5), name, last_name
2 FROM table_name
3 WHERE CONTAINS
4 (names,
5 'NDATA (name, VXABCDYZ) AND
6 NDATA (last_name, MYLASTNAME)',
7 5) > 0
8 ORDER BY SCORE(5) DESC

Report message to a moderator

Re: CONTAIN function syntax usage [message #598898 is a reply to message #598893]

Fri, 18 October 2013 17:38

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

If you have the following test environment consisting of table, data, multi_column_datastore, section group with ndata sections, and context index:

SCOTT@orcl12c> CREATE TABLE table_name
  2    (name	   VARCHAR2(15),
  3  	last_name  VARCHAR2(15))
  4  /

Table created.

SCOTT@orcl12c> INSERT ALL
  2  INTO table_name (name, last_name) VALUES ('MYNAME', 'MYLASTNAME')
  3  INTO table_name (name, last_name) VALUES ('Barbara', 'Boehmer')
  4  INTO table_name (name, last_name) VALUES ('VXABCDYZ', 'MYLASTNAME')
  5  INTO table_name (name, last_name) VALUES ('VXAB CDYZ', 'MYLASTNAME')
  6  INTO table_name (name, last_name) VALUES ('VXABBCDYZ', 'MYLASTNAME')
  7  SELECT * FROM DUAL
  8  /

5 rows created.

SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
  3    CTX_DDL.SET_ATTRIBUTE ('test_mcds', 'COLUMNS', 'name, last_name');
  4    CTX_DDL.CREATE_SECTION_GROUP ('namegroup', 'BASIC_SECTION_GROUP');
  5    CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'NAME', 'NAME');
  6    CTX_DDL.ADD_NDATA_SECTION ('NAMEGROUP', 'LAST_NAME', 'LAST_NAME');
  7  END;
  8  /

PL/SQL procedure successfully completed.

SCOTT@orcl12c> ALTER TABLE table_name ADD (names  VARCHAR2(1))
  2  /

Table altered.

SCOTT@orcl12c> CREATE INDEX names_idx ON table_name (names)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  PARAMETERS
  4    ('DATASTORE	test_mcds
  5  	 SECTION GROUP	namegroup')
  6  /

Index created.

Then you can pass cursor values like so:

SCOTT@orcl12c> SET SERVEROUTPUT ON FORMAT WRAPPED
SCOTT@orcl12c> BEGIN
  2    DBMS_OUTPUT.PUT_LINE
  3  	 ('  SCORE(4) NAME	      LAST_NAME');
  4    DBMS_OUTPUT.PUT_LINE
  5  	 ('---------- --------------- ---------------');
  6    FOR v1 IN (SELECT 'VXABCDYZ' name FROM DUAL) LOOP
  7  	 FOR v2 IN (SELECT 'MYLASTNAME' name FROM DUAL) LOOP
  8  	   FOR v3 IN
  9  	     (SELECT SCORE(5), name, last_name
 10  	      FROM   table_name
 11  	      WHERE  CONTAINS
 12  		       (names,
 13  			'NDATA (name, ' || v1.name || ') AND ' ||
 14  			'NDATA (last_name, ' || v2.name || ')',
 15  			5) > 0)
 16  	   LOOP
 17  	     DBMS_OUTPUT.PUT_LINE
 18  	       (LPAD (v3."SCORE(5)", 10) || ' ' ||
 19  		RPAD (v3.name, 16) || v3.last_name);
 20  	   END LOOP;
 21  	 END LOOP;
 22    END LOOP;
 23  END;
 24  /
  SCORE(4) NAME            LAST_NAME
---------- --------------- ---------------
       100 VXABCDYZ        MYLASTNAME
        91 VXAB CDYZ       MYLASTNAME
        91 VXABBCDYZ       MYLASTNAME

PL/SQL procedure successfully completed.

However, as a general rule, any time that you can accomplish the same thing in SQL alone, instead of using PL/SQL it is likely to be more efficient. So, the following SQL that accomplishes the same thing as the PL/SQL above, without looping through cursors, is more efficient.

SCOTT@orcl12c> SELECT SCORE(5), t.name, t.last_name
  2  FROM   table_name t,
  3  	    (SELECT 'VXABCDYZ' name FROM DUAL) v1,
  4  	    (SELECT 'MYLASTNAME' name FROM DUAL) v2
  5  WHERE  CONTAINS
  6  	      (names,
  7  	       'NDATA (name, ' || v1.name || ') AND ' ||
  8  	       'NDATA (last_name, ' || v2.name || ')',
  9  	       5) > 0
 10  /

  SCORE(5) NAME            LAST_NAME
---------- --------------- ---------------
       100 VXABCDYZ        MYLASTNAME
        91 VXAB CDYZ       MYLASTNAME
        91 VXABBCDYZ       MYLASTNAME

3 rows selected.

It would probably help if you explained exactly what you are trying to do from a user standpoint, instead of just what you think you need to do from a programmer standpoint. Many times people ask if they can do something the hard way, but if we knew what they were trying to accomplish, we could point them toward an easier way.

Report message to a moderator

Re: CONTAIN function syntax usage [message #598904 is a reply to message #598898]

Sat, 19 October 2013 01:33

ranki
Messages: 19
Registered: October 2013

Junior Member

Thank you Barbara..
I am trying to fetch customer with multiple accounts from a very large table more than 20 millions.
When I use JARO WINKLE or EDIT DISTANCE there are performance issues.
CONTAINS and NDATA seems to be very good to fetch similar records first in a faster way.
My table CUSTOMERS
ACCOUNT ID
NAME
MIDDLE
LAST_NAME
ADDRESS1
ADDRESS2
CITY

Now I am trying to get records of same customer by checking name,middle_name,last_name and then compare address.
If name and address almost matches then I will blacklist this customer if he has more no of accounts.

If required I want to use jaro wrinkle or edit distance from the list I have already taken out using CONTAIN.

Report message to a moderator

Re: CONTAIN function syntax usage [message #598921 is a reply to message #598904]

Sat, 19 October 2013 17:45

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

Since you are searching for duplicates already in the data, you will need to use a self-join. I am guessing that none of your data is validated and they can enter anything in any column. So, I would probably concatenate the names into one full_name column and concatenate the address data into one full_address column in the multi_column_datastore, then create ndata sections for each. Even though ndata is designed primarily for names, it think it works well for your address data here as well and allows you to use one index to search for both and provide a combined score. You may find that this is sufficient. However, if you then search using contains and jaro_winkler_similarity and edit_distance_similarity, the optimizer is smart enough to use the domain index first, then apply the other filters to that result set. You can also use ctx_query.count_hits to determine the number of rows that match the criteria, which is the number of accounts that are similar to one another. I have provided a complete example below, in which I have used three different scoring methods and limited the rows by those. You may want to use just part of it. I have displayed only the number of accounts, matching id's and scores, but you could display the name and address values as well if you like. You could also wrap an outer query around it and limit it by the value from ctx_query.count_hits, so you could list just those accounts with at least a certain number of duplicates. I have also provided a simplified query below that does that, providing a simple list of just the number of accounts and account_id, starting with the ones with the most similar accounts, which could be used a sort of a potential ban list. I urge you to be cautious in verifying before banning, as there are many names and addresses that may be similar, but are actually different people. I was surprised to find out how many people have the identical first and last name as I do, some of which even live in the same state. So, be careful.

SCOTT@orcl12c> CREATE TABLE table_name
  2    (account_id  NUMBER,
  3  	name	    VARCHAR2(15),
  4  	middle	    VARCHAR2(15),
  5  	last_name   VARCHAR2(15),
  6  	address1    VARCHAR2(20),
  7  	address2    VARCHAR2(15),
  8  	city	    VARCHAR2(15))
  9  /

Table created.

SCOTT@orcl12c> INSERT ALL
  2  INTO table_name VALUES
  3    (1, 'VXABCDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Parkway', '', 'Redwood Shores')
  4  INTO table_name VALUES
  5    (2, 'VXAB CDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Pkwy.', '', 'Redwood Shores')
  6  INTO table_name VALUES
  7    (3, 'VXABBCDYZ', 'M.', 'MYLASTNAME', '500 Oracle', '', 'R.S.')
  8  INTO table_name VALUES
  9    (4, 'B.', 'A.', 'Boehmer', '123 Someplace St.', '', 'Somewhere')
 10  INTO table_name VALUES
 11    (5, 'Barbara', '', 'Boehmer', '123 Someplace Street', '', 'Somewhere City')
 12  SELECT * FROM DUAL
 13  /

5 rows created.

SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
  3    CTX_DDL.SET_ATTRIBUTE
  4  	 ('test_mcds',
  5  	  'COLUMNS',
  6  	  'name || '' '' || middle || '' '' || last_name full_name,
  7  	   address1 || '' '' || address2 || '' '' || city full_address');
  8    CTX_DDL.CREATE_SECTION_GROUP ('test_sg', 'BASIC_SECTION_GROUP');
  9    CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_name', 'full_name');
 10    CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_address', 'full_address');
 11  END;
 12  /

PL/SQL procedure successfully completed.

SCOTT@orcl12c> ALTER TABLE table_name ADD (names_and_addresses	VARCHAR2(1))
  2  /

Table altered.

SCOTT@orcl12c> CREATE INDEX table_name_idx
  2  ON table_name (names_and_addresses)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  PARAMETERS
  5    ('DATASTORE	test_mcds
  6  	 SECTION GROUP	test_sg')
  7  /

Index created.

SCOTT@orcl12c> BREAK ON accounts ON account_id SKIP 1
SCOTT@orcl12c> SELECT CTX_QUERY.COUNT_HITS
  2  	      ('table_name_idx',
  3  	       'NDATA (full_name, ' || t2.name || ' ' || t2.middle || ' ' || t2.last_name || ') AND ' ||
  4  	       'NDATA (full_address, ' || t2.address1 || ' ' || t2.address2 || ' ' || t2.city || ')')
  5  	      AS accounts,
  6  	    t1.account_id,
  7  	    t2.account_id duplicates,
  8  	    SCORE(1) AS score,
  9  	    UTL_MATCH.JARO_WINKLER_SIMILARITY
 10  	      (t1.name || ' ' || t1.middle || ' ' || t1.last_name,
 11  	       t2.name || ' ' || t2.middle || ' ' || t2.last_name)
 12  	      AS jws,
 13  	    UTL_MATCH.EDIT_DISTANCE_SIMILARITY
 14  	      (t1.address1 || ' ' || t1.address2 || ' ' || t1.city,
 15  	       t2.address1 || ' ' || t2.address2 || ' ' || t2.city)
 16  	      AS eds
 17  FROM   table_name t1, table_name t2
 18  WHERE  t1.account_id != t2.account_id
 19  AND    CONTAINS
 20  	      (t1.names_and_addresses,
 21  	       'NDATA (full_name, ' || t2.name || ' ' || t2.middle || ' ' || t2.last_name || ') AND ' ||
 22  	       'NDATA (full_address, ' || t2.address1 || ' ' || t2.address2 || ' ' || t2.city || ')',
 23  	       1) > 30
 24  AND    UTL_MATCH.JARO_WINKLER_SIMILARITY
 25  	      (t1.name || ' ' || t1.middle || ' ' || t1.last_name,
 26  	       t2.name || ' ' || t2.middle || ' ' || t2.last_name) > 60
 27  AND    UTL_MATCH.EDIT_DISTANCE_SIMILARITY
 28  	      (t1.address1 || ' ' || t1.address2 || ' ' || t1.city,
 29  	       t2.address1 || ' ' || t2.address2 || ' ' || t2.city) > 40
 30  ORDER  BY accounts DESC, account_id, score DESC, jws DESC, eds DESC
 31  /

  ACCOUNTS ACCOUNT_ID DUPLICATES      SCORE        JWS        EDS
---------- ---------- ---------- ---------- ---------- ----------
         3          1          2         84         95         89
                               3         70         90         42

                    2          1         85         95         89
                               3         70         86         44

                    3          2         39         86         44
                               1         38         90         42

         2          4          5         52         69         75

                    5          4         53         69         75


8 rows selected.

SCOTT@orcl12c> CLEAR BREAKS
SCOTT@orcl12c> SELECT accounts, account_id
  2  FROM   (SELECT MAX
  3  		      (CTX_QUERY.COUNT_HITS
  4  			('table_name_idx',
  5  			 'NDATA (full_name, ' || t2.name || ' ' || t2.middle
  6  				 || ' ' || t2.last_name || ') AND ' ||
  7  			 'NDATA (full_address, ' || t2.address1 || ' '
  8  				 || t2.address2 || ' ' || t2.city || ')'))
  9  		      AS accounts,
 10  		    t1.account_id
 11  	     FROM   table_name t1, table_name t2
 12  	     WHERE  t1.account_id != t2.account_id
 13  	     AND    CONTAINS
 14  		      (t1.names_and_addresses,
 15  		       'NDATA (full_name, ' || t2.name || ' ' || t2.middle
 16  			       || ' ' || t2.last_name || ') AND ' ||
 17  		       'NDATA (full_address, ' || t2.address1 || ' '
 18  			       || t2.address2 || ' ' || t2.city || ')',
 19  		       1) > 30
 20  	     GROUP  BY t1.account_id)
 21  WHERE  accounts >= 2
 22  ORDER  BY accounts DESC, account_id
 23  /

  ACCOUNTS ACCOUNT_ID
---------- ----------
         3          1
         3          2
         3          3
         2          4
         2          5

5 rows selected.

Report message to a moderator

Re: CONTAIN function syntax usage [message #599003 is a reply to message #598921]

Mon, 21 October 2013 06:24

ranki
Messages: 19
Registered: October 2013

Junior Member

Barbara,Thanks a lot for your detailed test case. Actually my data is validated at the front end but to be extra cautious I am asked to ignore spaces and TYPO. And as you have suggested I will be cautious while taking the list.As I am matching full address hopefully there will be not be any mistakes.
Now I am stuck with the INDEX creation . My data is around 20 million and the index is taking very long to create.
space taken till now is more than 40gb and still hanging. Any suggestion for fast index creation please?

Index creation does not come out and checked space is available.Index status shows valid. Tried three times but getting error
ORA-29861: domain index is marked LOADING/FAILED/UNUSABLE if I test update the table although I know index creation was incomplete.

[Updated on: Mon, 21 October 2013 11:47]

Report message to a moderator

Re: CONTAIN function syntax usage [message #599049 is a reply to message #599003]

Mon, 21 October 2013 16:07

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

The error that you received may just indicate that the index is not through "loading". In other words, it is not done being created, so it can't be used yet. The following link is to a section of the online documentation that has some suggestions about things that you can do to affect the speed of index creation.

http://docs.oracle.com/cd/E16655_01/text.121/e17748/aoptim.htm#i1006756

Report message to a moderator

Re: CONTAIN function syntax usage [message #599110 is a reply to message #599049]

Tue, 22 October 2013 04:51

ranki
Messages: 19
Registered: October 2013

Junior Member

Thank you, I have rebuild the Index and it was created. But creation index three times failed. Rebuild index worked.
Not sure of the reason but my problem is resolved.

Report message to a moderator

Re: CONTAIN function syntax usage [message #599638 is a reply to message #599110]

Sun, 27 October 2013 10:01

ranki
Messages: 19
Registered: October 2013

Junior Member

hi Barbara,
How to handle special characters in the column values ?I am getting DRG-50901: text query parser syntax error on line 1, column 81 error if comma "," or hyphen "-" is there in the address. How to handle this please let me know.

how to ignore special characters in CTXSYS search.

Report message to a moderator

Re: CONTAIN function syntax usage [message #599639 is a reply to message #599638]

Sun, 27 October 2013 13:04

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

Please provide a copy and paste of a run of a complete test case that uses those characters and produces the problem.

Report message to a moderator

Re: CONTAIN function syntax usage [message #599646 is a reply to message #599639]

Sun, 27 October 2013 23:55

ranki
Messages: 19
Registered: October 2013

Junior Member

Barbara,
I have noticed listed address which has comma,hyphen,() are giving error. If I remove those characters and replace with space it is working fine. slash "/" is working fine.

I have tried to
1.NO 101(3) 5TH CROSS 7TH MAIN
2.25-14-909-1
3.S/O, D.K. ABCD
4.1-77,WXYV
5.1-267-1(1)

I have tried to set attribute. But comma still gives me error.
begin
ctx_ddl.create_preference ('test_lex1', 'basic_lexer');
ctx_ddl.set_attribute ('test_lex1', 'whitespace', '/\|-_+,&''');
end;

[Updated on: Sun, 27 October 2013 23:56]

Report message to a moderator

Re: CONTAIN function syntax usage [message #599771 is a reply to message #599646]

Mon, 28 October 2013 15:46

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

As I said before, you need to provide a test case that reproduces the problem. I don't know whether what you have provided is the data from your table or the strings that you are searching for. I don't know what your contains query looks like. I am not going to guess at all the various possibilities of what you may be doing wrong. The longer that you postpone providing a proper test case, the longer you will wait for an answer, especially when we seem to post at different times of day, so it takes a full 24 hours for each question and answer. Please read the forums guidelines for how to post a question in such a manner that provides what we need, in order to provide an answer appropriate to your situation.

Report message to a moderator

Re: CONTAIN function syntax usage [message #599789 is a reply to message #599771]

Tue, 29 October 2013 00:40

ranki
Messages: 19
Registered: October 2013

Junior Member

Sorry for no test case provided. please find the test case
My table has the data given below.

ac_id FULL_NAME ADDRESS
----- --------------------------------------------------
1 MR.AB,CD NO 101 (3) 5TH CROSS 7TH MAIN
2 VXAB CDYZ 25-14-909/B STREET#5
3 VXABBCDYZ S/O, D.K. ABCB
4 ABCD,XYZ FLAT 301 STREET NO 5
5 ABCD,XYZ 1024,RESEARCH BLVD
6 BARBARA FLAT:301 STREET NO: 5

1.Even if name, middle_name, last name are provided some times they enter full name in name column itself with a comma.

2.Address has special characters like brace(),hash#,comma and semicolumn. When I replace all these with white space then
the query
provided by you does not give any error.
SELECT accounts, account_id
FROM (SELECT MAX
(CTX_QUERY.COUNT_HITS
('table_name_idx',
'NDATA (full_name, ' || t2.name || ' ' || t2.middle
|| ' ' || t2.last_name || ') AND ' ||
'NDATA (full_address, ' || t2.address1 || ' '
|| t2.address2 || ' ' || t2.city || ')'))
AS accounts,
t1.account_id
FROM table_name t1, table_name t2
WHERE t1.account_id != t2.account_id
AND CONTAINS
(t1.names_and_addresses,
'NDATA (full_name, ' || t2.name || ' ' || t2.middle
|| ' ' || t2.last_name || ') AND ' ||
'NDATA (full_address, ' || t2.address1 || ' '
|| t2.address2 || ' ' || t2.city || ')',
1) > 30
GROUP BY t1.account_id)
WHERE accounts >= 2
ORDER BY accounts DESC, account_id

[Updated on: Tue, 29 October 2013 02:01]

Report message to a moderator

Re: CONTAIN function syntax usage [message #599881 is a reply to message #599789]

Tue, 29 October 2013 15:52

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

You need to re-read the forum guidelines, especially the part about how to provide a proper test case. Your test case should have included create table and insert statements, so that anyone can run them on their system. You didn't even provide data for your actual columns, just the concatenated full_name and full_address. Also, none of the data your provided is close enough to constitute a duplicate.

The ndata doesn't seem to be compatible with some other things, such as lexers. So, it appears that you need to remove the special characters from the columns before indexing. Also, because you are doing a self-join, using the same column values to search, you need to remove the special characters. Although duplicate data storage is generally undesirable, in this case using shadow columns may be the best workaround. I have provide an example below. I have added shadow columns for the full_name and full_address. I have also provided an update statement to update existing data and a trigger to update new data. I have used the translate function to remove the special characters from these columns. Then I have simplified the rest (preferences, index, and query) to use just the shadow columns.

SCOTT@orcl12c> -- table and existing data:
SCOTT@orcl12c> CREATE TABLE table_name
  2    (account_id  NUMBER,
  3  	name	    VARCHAR2(15),
  4  	middle	    VARCHAR2(15),
  5  	last_name   VARCHAR2(15),
  6  	address1    VARCHAR2(30),
  7  	address2    VARCHAR2(15),
  8  	city	    VARCHAR2(15))
  9  /

Table created.

SCOTT@orcl12c> INSERT ALL
  2  INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
  3    (1, 'VXABCDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Parkway', '', 'Redwood Shores')
  4  INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
  5    (2, 'VXAB CDYZ', 'MYMID', 'MYLASTNAME', '500 Oracle Pkwy.', '', 'Redwood Shores')
  6  INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
  7    (3, 'VXABBCDYZ', 'M.', 'MYLASTNAME', '500 Oracle', '', 'R.S.')
  8  INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
  9    (4, 'B.', 'A.', 'Boehmer', '123 Someplace St.', '', 'Somewhere')
 10  INTO table_name (account_id, name, middle, last_name, address1, address2, city) VALUES
 11    (5, 'Barbara', '', 'Boehmer', '123 Someplace Street', '', 'Somewhere City')
 12  SELECT * FROM DUAL
 13  /

5 rows created.

SCOTT@orcl12c> COMMIT
  2  /

Commit complete.

SCOTT@orcl12c> -- add shadow columns:
SCOTT@orcl12c> ALTER TABLE table_name ADD
  2    (full_name     VARCHAR2(45),
  3  	full_address  VARCHAR2(50))
  4  /

Table altered.

SCOTT@orcl12c> -- update existing rows:
SCOTT@orcl12c> UPDATE table_name
  2  SET    full_name =
  3  	      TRANSLATE
  4  		(name || ' ' || middle || ' ' || last_name,
  5  		 '.,()-/#:',
  6  		 '	  '),
  7  	    full_address =
  8  	      TRANSLATE
  9  		(address1 || ' ' || address2 || ' ' || city,
 10  		 '.,()-/#:',
 11  		 '	  ')
 12  /

5 rows updated.

SCOTT@orcl12c> COMMIT
  2  /

Commit complete.

SCOTT@orcl12c> -- create trigger to update new rows:
SCOTT@orcl12c> CREATE OR REPLACE TRIGGER table_name_biur
  2    BEFORE INSERT OR UPDATE ON table_name
  3    FOR EACH ROW
  4  BEGIN
  5    :NEW.full_name :=
  6  	 TRANSLATE
  7  	  (:NEW.name || ' ' || :NEW.middle || ' ' || :NEW.last_name,
  8  	   '.,()-/#:',
  9  	   '	    ');
 10    :NEW.full_address :=
 11  	 TRANSLATE
 12  	  (:NEW.address1 || ' ' || :NEW.address2 || ' ' || :NEW.city,
 13  	   '.,()-/#:',
 14  	   '	    ');
 15  END table_name_biur;
 16  /

Trigger created.

SCOTT@orcl12c> SHOW ERRORS
No errors.
SCOTT@orcl12c> -- new data:
SCOTT@orcl12c> INSERT ALL
  2  INTO table_name (account_id, name, address1) VALUES (6, 'MR.AB,CD', 'NO 101 (3) 5TH CROSS 7TH MAIN')
  3  INTO table_name (account_id, name, address1) VALUES (7, 'VXAB CDYZ', '25-14-909/B STREET#5')
  4  INTO table_name (account_id, name, address1) VALUES (8, 'VXABBCDYZ', 'S/O, D.K. ABCB')
  5  INTO table_name (account_id, name, address1) VALUES (9, 'ABCD,XYZ', 'FLAT 301 STREET NO 5')
  6  INTO table_name (account_id, name, address1) VALUES (10, 'ABCD,XYZ', '1024,RESEARCH BLVD')
  7  INTO table_name (account_id, name, address1) VALUES (11, 'BARBARA', 'FLAT:301 STREET NO: 5')
  8  INTO table_name (account_id, name, address1) VALUES (12, 'XYZ,ABCD', '1024,RESEARCH BLVD')
  9  SELECT * FROM DUAL
 10  /

7 rows created.

SCOTT@orcl12c> COMMIT
  2  /

Commit complete.

SCOTT@orcl12c> -- preferences, dummy column, and index:
SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
  3    CTX_DDL.SET_ATTRIBUTE
  4  	 ('test_mcds', 'COLUMNS', 'full_name, full_address');
  5    CTX_DDL.CREATE_SECTION_GROUP ('test_sg', 'BASIC_SECTION_GROUP');
  6    CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_name', 'full_name');
  7    CTX_DDL.ADD_NDATA_SECTION ('test_sg', 'full_address', 'full_address');
  8  END;
  9  /

PL/SQL procedure successfully completed.

SCOTT@orcl12c> ALTER TABLE table_name ADD (names_and_addresses	VARCHAR2(1))
  2  /

Table altered.

SCOTT@orcl12c> CREATE INDEX table_name_idx
  2  ON table_name (names_and_addresses)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  PARAMETERS
  5    ('DATASTORE	test_mcds
  6  	 SECTION GROUP	test_sg')
  7  /

Index created.

SCOTT@orcl12c> -- query:
SCOTT@orcl12c> SELECT accounts, account_id
  2  FROM   (SELECT MAX
  3  		      (CTX_QUERY.COUNT_HITS
  4  			('table_name_idx',
  5  			 'NDATA (full_name, ' || t2.full_name || ') AND ' ||
  6  			 'NDATA (full_address, ' || t2.full_address || ')'))
  7  		      AS accounts,
  8  		    t1.account_id
  9  	     FROM   table_name t1, table_name t2
 10  	     WHERE  t1.account_id != t2.account_id
 11  	     AND    CONTAINS
 12  		      (t1.names_and_addresses,
 13  		       'NDATA (full_name, ' || t2.full_name || ') AND ' ||
 14  		       'NDATA (full_address, ' || t2.full_address || ')',
 15  		       1) > 30
 16  	     GROUP  BY t1.account_id)
 17  WHERE  accounts >= 2
 18  ORDER  BY accounts DESC, account_id
 19  /

  ACCOUNTS ACCOUNT_ID
---------- ----------
         3          1
         3          2
         3          3
         2          4
         2          5
         2         10
         2         12

7 rows selected.

Report message to a moderator

Re: CONTAIN function syntax usage [message #599969 is a reply to message #599881]

Wed, 30 October 2013 10:00

ranki
Messages: 19
Registered: October 2013

Junior Member

Thanks barbara I have got very good information about Oracle text with your help. But I am feeling I am stuck at the same point where I have started.MY whole intension was to speed up the process of identifying the customer with multiple accounts.Solution provided by you is working fine with small data,when it comes to large data say 20 million it is again hanging same like when I tried jaro winkler.
I have checked explain plans for both test table and my production table.I have attached.

Attachment: explain_plan.txt
(Size: 6.38KB, Downloaded 2619 times)

[Updated on: Wed, 30 October 2013 12:38]

Report message to a moderator

Re: CONTAIN function syntax usage [message #599987 is a reply to message #599969]

Wed, 30 October 2013 13:43

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

In order to check for duplicates it has to do a full table scan, so with that much data, it is going to take a long time, no matter what you do. However, in theory, the following queries might be quicker. Please test them and let us know.

SELECT accounts, account_id
FROM   (SELECT CTX_QUERY.COUNT_HITS 
                 ('table_name_idx',
                  'NDATA (full_name, ' || full_name || ') AND ' ||
                  'NDATA (full_address, ' || full_address || ')')
                 AS accounts,
               account_id
        FROM    table_name)
WHERE   accounts >= 5
ORDER   BY accounts DESC, account_id;

SELECT COUNT(*) accounts, t1.account_id
FROM   table_name t1, table_name t2
WHERE  CONTAINS 
         (t1.names_and_addresses,
          'NDATA (full_name, ' || t2.full_name || ') AND ' ||
          'NDATA (full_address, ' || t2.full_address || ')') > 0
GROUP  BY t1.account_id
HAVING COUNT(*) >= 5
ORDER  BY accounts DESC, account_id;

Report message to a moderator

Re: CONTAIN function syntax usage [message #600031 is a reply to message #599987]

Thu, 31 October 2013 00:48

ranki
Messages: 19
Registered: October 2013

Junior Member

Barbara,
Finally I have concluded that only query which comes out in 20 to 30 sec after parsing one value into this query is fastest of all options.

SELECT COUNT(*) accounts, account_id
FROM table_name
WHERE CONTAINS
(names_and_addresses,
'NDATA (full_name, Barbara Boehmer) AND NDATA (full_address, REDWOOD CITY)',5) > 0
GROUP BY account_id
HAVING COUNT(*) >= 5

I will write a cursor and pass one record and get the required duplicates and insert into another table if customer exceeds the limit. Finally concluded First time finding duplicates is going to take long time .After that I have to include some procedure to identify as and when the customer account is created.

[Updated on: Thu, 31 October 2013 00:48]

Report message to a moderator

Previous Topic:	Domain index >stemming concept
Next Topic:	Oracle text related internal procedure taking a lot of time in our Prod. database

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Mon Jan 12 00:20:27 CST 2026