Home » Developer & Programmer » Data Integration » Categorising information,
Categorising information, [message #321408] Tue, 20 May 2008 03:30 Go to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
Hey everyone,

Recently I've started learning some pl-SQL and I've been writing some small functions and procedures, and I've run into a problem i cant get my hand around, and i was hoping someone here could help me get back on track.

The problem, in my database I've got a colom containing a lot of different information (mostly its a part of a sentence) as i want to use this information in a report i thought it was nice to get this all categorised.

I have made a function using SOUND-EX and a table containing keywords, so i could match the information with the keywords and when this would match i would know the category. unfortunately this doesn't work and doesn't give the results I'm looking for. I have discussed the Sound-ex with some other people and we concluded the sound-ex wasn't really the best way to go.

My burning Q. is, Does anyone know a good way to get this Random information categorised? It would be awesome if you could get me started on the code, or just be able to give me some id's on how to handle this.

I have attached the Sound-ex code we have used.

Thanks in advance for helping and reading this.
  • Attachment: cat_role.sql
    (Size: 1.42KB, Downloaded 2410 times)

[Updated on: Tue, 20 May 2008 03:34]

Report message to a moderator

Re: Categorising information, [message #321409 is a reply to message #321408] Tue, 20 May 2008 03:32 Go to previous messageGo to next message
Maaher
Messages: 7065
Registered: December 2001
Senior Member
Ok, what do you mean with 'categorising'? If you want to search for text, Oracle...Text is the way to go.

MHE
Re: Categorising information, [message #321412 is a reply to message #321408] Tue, 20 May 2008 03:39 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
thx for the fast reply awesome,

ive just added the soundex code to my previous post.

The catogorising can be seen as the following, In the current information each record in the database contains random information (this is a textfield and will be manual inserted)
i want to add an extra colom containing a catogory for this information.

for example, if the text would contain the word programmer i want the catagory to be developer, if the text would contain adviser or consultant i want the catogory to be consultant.

i hope this is more clear,

Greetings
Re: Categorising information, [message #321429 is a reply to message #321412] Tue, 20 May 2008 04:32 Go to previous messageGo to next message
Maaher
Messages: 7065
Registered: December 2001
Senior Member
And how would you expect Oracle to match 'advisor' with 'consultant' or 'programmer' with 'developer'?

And I don't think soundex would help. I still think that Oracle Text might help you out.

MHE
Re: Categorising information, [message #321436 is a reply to message #321408] Tue, 20 May 2008 04:54 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
did you look at the code i have added? my id behind the code was that if the text sounded like a certain keyword it would return the category, but as i said it doesnt work, so iam looking for a new way.

Quote:
I still think that Oracle Text might help you out.


could you be more specific iam not sure what you meen here, ive been looking around and do you meen Oracle ConText? unfortionally iam not to far into the codes and for a part its still more trying code then understanding some parts.
Re: Categorising information, [message #321437 is a reply to message #321408] Tue, 20 May 2008 05:01 Go to previous messageGo to next message
dr.s.raghunathan
Messages: 540
Registered: February 2008
Senior Member

you create a table with the following
you find word occurance in terms of number with words...
sort in desending order place it separate table..
you have new column with varchar2(2000)..
delete all less freqent words and catagorise as others...


now go back to your orinal text file read it upto period (.)
link it with created file for the first find occurance...

some extent it can be categoriesed...

if this is not offering any lead to your problems..
send a data in text format...
we will make an attempt to resolve it

yours
dr.s.raghunathan


Re: Categorising information, [message #321458 is a reply to message #321436] Tue, 20 May 2008 05:48 Go to previous messageGo to next message
Maaher
Messages: 7065
Registered: December 2001
Senior Member
dunthon wrote on Tue, 20 May 2008 11:54
did you look at the code i have added? my id behind the code was that if the text sounded like a certain keyword it would return the category, but as i said it doesnt work, so iam looking for a new way.
Yes, I've seen your code and it's the reason why I was puzzled about the use of SOUNDEX. It's a common misconception: SOUNDEX has nothing to do with SOUND. For the complete SOUNDEX algorithm, I suggest you look at WikPedia but it comes down to this: the first character remains the same. Vowels are disregarded. Consonants are replaced by a numeric value (b becomes 1, etc...). Duplicate digits are removed (33 becomes 3 etc...). Only the first letter and the first 3 digits are returned. So, you will find that both 'Repelsteeltje' and 'Rafelgaren' will return the same SOUNDEX value.

dunthon wrote on Tue, 20 May 2008 11:54
could you be more specific iam not sure what you meen here, ive been looking around and do you meen Oracle ConText? unfortionally iam not to far into the codes and for a part its still more trying code then understanding some parts.
Oracle Text provides a way to search with synonyms. You can even use fuzzy logic. But Oracle Text is not intuitive or easy to implement at first. It will demand some effort from your part.

An alternative could be -as already suggested- to use a table with category/keyword pairs. That way, you can assign categories through a standard join query. But all depends on the size of the keyword set. It will require manual input.

Edit: synonyms and Oracle Text: CTX_THES

Oracle Text

MHE

[Updated on: Tue, 20 May 2008 05:53]

Report message to a moderator

Re: Categorising information, [message #321476 is a reply to message #321408] Tue, 20 May 2008 06:53 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
iam actually getting more and more confused Sad ive been reading some of the oracle tekst and got the feeling that its not really a smart option to go with as it probably is still to difficult for me.

as i already created a sidetable containing keywords and categories
il try this option this afternoon and see what happens il get back to you Wink

Thanks for the helpfull reactions so far
Re: Categorising information, [message #321570 is a reply to message #321476] Tue, 20 May 2008 14:13 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9101
Registered: November 2002
Location: California, USA
Senior Member
What you are describing is called Document Classification and is part of Oracle Text. Oracle Text used to be called Intermedia, and prior to that it was called Oracle Context. Documents can be classified or categorized based on keywords that you supply or by using a thesaurus. I have provided a simple example below using a sidetable containing categories and keywords as you have described that you already have. You need to use the word OR between each keyword. Bear in mind that documents may fit into multiple categories, so instead of having one category column you should have a separate table to join the documents and categories. You should use primary and foreign keys to make queries faster.

-- If you have data like this:
SCOTT@orcl_11g> CREATE TABLE information
  2    (info_id    NUMBER,
  3  	text	   CLOB,
  4  	CONSTRAINT info_id_pk PRIMARY KEY (info_id))
  5  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO information VALUES (1, 'He is an experienced programmer.')
  3  INTO information VALUES (2, 'I am a developer, not a DBA.')
  4  INTO information VALUES (3, 'We need an advisor.')
  5  INTO information VALUES (4, 'Hiring a consultant can be expensive.')
  6  INTO information VALUES (5, 'Are you a programmer, developer, advisor, or consultant?')
  7  SELECT * FROM DUAL
  8  /

5 rows created.


-- and you have a sideteable of categories for keywords like this:
SCOTT@orcl_11g> CREATE TABLE sidetable
  2    (cat_id	   NUMBER,
  3  	category   VARCHAR2 (30),
  4  	keywords   VARCHAR2 (30),
  5  	CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO sidetable VALUES (1, 'developer', 'programmer OR developer')
  3  INTO sidetable VALUES (2, 'consultant', 'advisor OR consultant')
  4  SELECT * FROM DUAL
  5  /

2 rows created.


-- then you can classify your data by category like this:
SCOTT@orcl_11g> CREATE TABLE info_cat
  2    (info_id NUMBER,
  3  	cat_id	NUMBER,
  4  	CONSTRAINT info_id_fk FOREIGN KEY (info_id) REFERENCES information (info_id),
  5  	CONSTRAINT cat_id_fk  FOREIGN KEY (cat_id)  REFERENCES sidetable (cat_id))
  6  /

Table created.

SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
  2  INDEXTYPE IS CTXSYS.CTXRULE
  3  /

Index created.

SCOTT@orcl_11g> BEGIN
  2    FOR doc IN
  3  	 (SELECT info_id, text
  4  	  FROM	 information)
  5    LOOP
  6  	 FOR cat IN
  7  	   (SELECT cat_id
  8  	    FROM   sidetable
  9  	    WHERE  MATCHES (keywords, doc.text) > 0)
 10  	 LOOP
 11  	   INSERT INTO info_cat VALUES (doc.info_id, cat.cat_id);
 12  	 END LOOP;
 13    END LOOP;
 14  END;
 15  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> SELECT * FROM info_cat
  2  /

   INFO_ID     CAT_ID
---------- ----------
         1          1
         2          1
         3          2
         4          2
         5          1
         5          2

6 rows selected.


-- so you can query like this:
SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT s.category, i.text
  2  FROM   sidetable s, information i, info_cat c
  3  WHERE  s.cat_id = c.cat_id
  4  AND    c.info_id = i.info_id
  5  ORDER  BY category
  6  /

CATEGORY                       TEXT
------------------------------ ------------------------------
consultant                     We need an advisor.
consultant                     Hiring a consultant can be
                               expensive.

consultant                     Are you a programmer,
                               developer, advisor, or
                               consultant?

developer                      He is an experienced
                               programmer.

developer                      Are you a programmer,
                               developer, advisor, or
                               consultant?

developer                      I am a developer, not a DBA.

6 rows selected.

SCOTT@orcl_11g> 

Re: Categorising information, [message #321728 is a reply to message #321570] Wed, 21 May 2008 03:03 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
@everyone - Thanks for your reply 's and help so far.

@Barbara Boehmer - this is a really great post and really helpful i believe it is almost everything i need. and its actually way easy-er set up as i thought it would be. Still got a Q. left.

I have send you a private message, so i can add some docs i don't feel comfortable with posting it on forums.
Re: Categorising information, [message #321874 is a reply to message #321728] Wed, 21 May 2008 12:05 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9101
Registered: November 2002
Location: California, USA
Senior Member
It is best to post follow-up in the forum thread, rather than by PM. You don't have to use your actual table names, column names, or data, just enough to simulate the problem. If I understood your PM correctly, the gist of it is that you need only one category per row in your data file and that you would like it to pick the more specific category. For example, if you have the word "Oracle" and the word "consultant" you would want it to pick the category "Oracle consultant" instead of just "consultant". You also want "UNKNOWN" for those that do not match any categories.

There are three basic methods of classification: the one that I already demonstrated, another that requires that you provide it some examples for training, and another that creates the categories for you. I think the method that I provided comes the closest to what you want and the other two might not get you the categories that you want or put things in the right categories and are very complicated to set up. The only problem with the method that I used is that if you just add match_score to it, it only produces 0 for non-matches or 100 for matches. In order to have a full range of scores, you have to use the second method with svm_classifier and cls_train and provide it with examples. It gets really complicated and still might not be too accurate.

I think the best workaround for ranking the matches would be to do so in your sidetable, using the cat_id. So, you would make sure that "Oracle consultant" has a lower cat_id than "consultant". Alternatively, or in addition to, you could rank by the length of the category, figuring that a longer name is likely to be more specific. You would then take the first row after the ranking. In the case of identical rankings, which may or may not occur, it would arbitrarily pick one of the first identically ranked rows. You will need to either convert your table of separate keyword columns to one column, and you can use either OR or AND to specify your requirements or add another column to specify whether the condition is AND or OR and concatenate them in your query. You can even add combinations with parentheses and stemming and such. Stemming allows a search of the singular form, like consultant to find the plural, like consultants. Basically, you can use most of the query features available in a typical CONTAINS search. To achieve the UNKNOWN, you can just use outer joins and NVL.

Please see the revised demonstration below.

SCOTT@orcl_11g> CREATE TABLE information
  2    (info_id    NUMBER,
  3  	text	   CLOB,
  4  	CONSTRAINT info_id_pk PRIMARY KEY (info_id))
  5  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO information VALUES (1, 'Oracle consultants')
  3  INTO information VALUES (2, 'Oracle')
  4  INTO information VALUES (3, 'consultant')
  5  INTO information VALUES (4, 'consultant working for Oracle')
  6  INTO information VALUES (5, 'java developer')
  7  INTO information VALUES (6, 'FoxPro programmers')
  8  INTO information VALUES (7, 'input clerk')
  9  SELECT * FROM DUAL
 10  /

7 rows created.

SCOTT@orcl_11g> CREATE TABLE sidetable
  2    (cat_id	   NUMBER,
  3  	category   VARCHAR2 (30),
  4  	keywords   VARCHAR2 (30),
  5  	CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO sidetable VALUES (1, 'Oracle consultant', 'Oracle AND $consultant')
  3  INTO sidetable VALUES (2, 'Consultant', '$advisor OR $consultant')
  4  INTO sidetable VALUES (3, 'Developer', '$programmer OR $developer')
  5  SELECT * FROM DUAL
  6  /

3 rows created.

SCOTT@orcl_11g> CREATE TABLE info_cat
  2    (info_id NUMBER,
  3  	cat_id	NUMBER,
  4  	CONSTRAINT info_id_fk FOREIGN KEY (info_id) REFERENCES information (info_id),
  5  	CONSTRAINT cat_id_fk  FOREIGN KEY (cat_id)  REFERENCES sidetable (cat_id))
  6  /

Table created.

SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
  2  INDEXTYPE IS CTXSYS.CTXRULE
  3  /

Index created.

SCOTT@orcl_11g> BEGIN
  2    FOR doc IN
  3  	 (SELECT info_id, text
  4  	  FROM	 information)
  5    LOOP
  6  	 FOR cat IN
  7  	   (SELECT cat_id
  8  	    FROM   (SELECT cat_id
  9  		    FROM   sidetable
 10  		    WHERE  MATCHES (keywords, doc.text) > 0
 11  		    ORDER  BY cat_id, LENGTH (category) DESC)
 12  	    WHERE  ROWNUM = 1)
 13  	 LOOP
 14  	   INSERT INTO info_cat VALUES (doc.info_id, cat.cat_id);
 15  	 END LOOP;
 16    END LOOP;
 17  END;
 18  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> SELECT * FROM info_cat
  2  /

   INFO_ID     CAT_ID
---------- ----------
         1          1
         3          2
         4          1
         5          3
         6          3

SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT NVL (s.category, 'UNKNOWN') AS category, i.text
  2  FROM   information i, info_cat c, sidetable s
  3  WHERE  i.info_id = c.info_id (+)
  4  AND    c.cat_id = s.cat_id (+)
  5  ORDER  BY category
  6  /

CATEGORY                       TEXT
------------------------------ ------------------------------
Consultant                     consultant
Developer                      java developer
Developer                      FoxPro programmers
Oracle consultant              Oracle consultants
Oracle consultant              consultant working for Oracle
UNKNOWN                        Oracle
UNKNOWN                        input clerk

7 rows selected.

SCOTT@orcl_11g> 



Re: Categorising information, [message #321879 is a reply to message #321728] Wed, 21 May 2008 12:40 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9101
Registered: November 2002
Location: California, USA
Senior Member
I though about this a little more. Since you mentioned this is a data warehouse environment, and since you only want one category, it might be better to add the category column as you originally wanted, and skip the info_cat table. You could use a trigger to automatically populate the category when new rows are inserted. Please see the demo below.

SCOTT@orcl_11g> CREATE TABLE information
  2    (info_id    NUMBER,
  3  	text	   CLOB,
  4  	category   VARCHAR2 (30),
  5  	CONSTRAINT info_id_pk PRIMARY KEY (info_id))
  6  /

Table created.

SCOTT@orcl_11g> CREATE TABLE sidetable
  2    (cat_id	   NUMBER,
  3  	category   VARCHAR2 (30),
  4  	keywords   VARCHAR2 (30),
  5  	CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO sidetable VALUES (1, 'Oracle consultant', 'Oracle AND $consultant')
  3  INTO sidetable VALUES (2, 'Consultant', '$advisor OR $consultant')
  4  INTO sidetable VALUES (3, 'Developer', '$programmer OR $developer')
  5  SELECT * FROM DUAL
  6  /

3 rows created.

SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
  2  INDEXTYPE IS CTXSYS.CTXRULE
  3  /

Index created.

SCOTT@orcl_11g> CREATE OR REPLACE TRIGGER info_air
  2    BEFORE INSERT ON information
  3    FOR EACH ROW
  4  DECLARE
  5    v_category information.category%TYPE;
  6  BEGIN
  7    SELECT category
  8    INTO   v_category
  9    FROM   (SELECT category
 10  	       FROM   sidetable
 11  	       WHERE  MATCHES (keywords, :NEW.text) > 0
 12  	       ORDER  BY cat_id, LENGTH (category) DESC)
 13    WHERE  ROWNUM = 1;
 14    :NEW.category := v_category;
 15  EXCEPTION
 16    WHEN NO_DATA_FOUND THEN
 17  	 :NEW.category := 'UNKNOWN';
 18  END;
 19  /

Trigger created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO information (info_id, text) VALUES (1, 'Oracle consultants')
  3  INTO information (info_id, text) VALUES (2, 'Oracle')
  4  INTO information (info_id, text) VALUES (3, 'consultant')
  5  INTO information (info_id, text) VALUES (4, 'consultant working for Oracle')
  6  INTO information (info_id, text) VALUES (5, 'java developer')
  7  INTO information (info_id, text) VALUES (6, 'FoxPro programmers')
  8  INTO information (info_id, text) VALUES (7, 'input clerk')
  9  SELECT * FROM DUAL
 10  /

7 rows created.

SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM information
  2  /

   INFO_ID TEXT                           CATEGORY
---------- ------------------------------ ------------------------------
         1 Oracle consultants             Oracle consultant
         2 Oracle                         UNKNOWN
         3 consultant                     Consultant
         4 consultant working for Oracle  Oracle consultant
         5 java developer                 Developer
         6 FoxPro programmers             Developer
         7 input clerk                    UNKNOWN

7 rows selected.

SCOTT@orcl_11g>

Re: Categorising information, [message #321902 is a reply to message #321408] Wed, 21 May 2008 15:20 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
lots of great posts once again Thanks for your time and energy Smile

Yes its indeed a datawarehouse environment, the last option looks interresting, but i dont get the use of the trigger.
my set up of the datawarehouse consist of two parts A. the cleaning Area (CA) and B. The Datawarehouse (dwh).

Iam getting my data imported from the CA to the DWH by the use of a mapping, wich will be started manually. by the use of a trigger would that meen it should be placed on the dwhfield?

but wouldnt it be easyer to instead of using a trigger to create a function or procedure that will be placed into the mapping? So that wenn the data gets transferd from the CA to the DWH it would first goes into the procedure and then moves on, so the category field also gets filled?

to sum it up. would it be easyer/better? to use the last part of code you posted, but instead of the trigger use a procedure/function like you mentioned before?


I also got a few Q. about the code as iam not a great programmer (yet Wink).

you are using
Quote:
:NEW.text and :NEW.category

What does this :NEW do? is it like a dummy variable?

and ive noticed you added a $ ($consultant) sign to the keywords. what does this do in plSQL??

last one,
Quote:
COLUMN text FORMAT A30 WORD_WRAPPED


what does A30 word_wrapped meen?

Once again i really apreciate youre help, time and energy.
Greetings Dunthon
Re: Categorising information, [message #321921 is a reply to message #321902] Wed, 21 May 2008 17:15 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9101
Registered: November 2002
Location: California, USA
Senior Member
I'm sorry, but I am not familiar with data warehouse mapping. You could certainly use a procedure or function to update the column, instead of a trigger. I have demonstrated an update using a function below. If you placed the trigger on the data warehouse field, then it should automatically populate the category column when rows are inserted into the datawarehouse table. If you do mass inserts, it may be more efficient to do a mass update as well, by using something like the update with the function that I have provided below. What I know about data warehouses is that they are generally denormalized and intended for fast querying, not for frequent inserts, updates, and deletes.

:NEW.column_name is used to reference the new value of a column within a trigger.

The $ is the Oracle Text symbol for stemming. If you search for $consultant, it searches for all words that are based on that stem word, such as the plural consultants, which would not be found if searching for consultant without the $.

A30 WORD_WRAPPED is part of the SQL*Plus COLUMN command. it means that it is to be displayed as characters, not numeric, up to 30 characters long, then wrap around to the next line after 30 characters, but breaking on the spaces between words, not in the middle of a word. I use such things just to make it easier to display in one line on the forums without having to scroll to the right to read it.

Here is another demo with a single update using a function:

SCOTT@orcl_11g> CREATE TABLE information
  2    (info_id    NUMBER,
  3  	text	   CLOB,
  4  	category   VARCHAR2 (30),
  5  	CONSTRAINT info_id_pk PRIMARY KEY (info_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO information (info_id, text) VALUES (1, 'Oracle consultants')
  3  INTO information (info_id, text) VALUES (2, 'Oracle')
  4  INTO information (info_id, text) VALUES (3, 'consultant')
  5  INTO information (info_id, text) VALUES (4, 'consultant working for Oracle')
  6  INTO information (info_id, text) VALUES (5, 'java developer')
  7  INTO information (info_id, text) VALUES (6, 'FoxPro programmers')
  8  INTO information (info_id, text) VALUES (7, 'input clerk')
  9  SELECT * FROM DUAL
 10  /

7 rows created.

SCOTT@orcl_11g> CREATE TABLE sidetable
  2    (cat_id	   NUMBER,
  3  	category   VARCHAR2 (30),
  4  	keywords   VARCHAR2 (30),
  5  	CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO sidetable VALUES (1, 'Oracle consultant', 'Oracle AND $consultant')
  3  INTO sidetable VALUES (2, 'Consultant', '$advisor OR $consultant')
  4  INTO sidetable VALUES (3, 'Developer', '$programmer OR $developer')
  5  SELECT * FROM DUAL
  6  /

3 rows created.

SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
  2  INDEXTYPE IS CTXSYS.CTXRULE
  3  /

Index created.

SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION get_category
  2    (p_text IN CLOB)
  3    RETURN	  VARCHAR2
  4  AS
  5    v_category information.category%TYPE;
  6  BEGIN
  7    SELECT category
  8    INTO   v_category
  9    FROM   (SELECT category
 10  	       FROM   sidetable
 11  	       WHERE  MATCHES (keywords, p_text) > 0
 12  	       ORDER  BY cat_id, LENGTH (category) DESC)
 13    WHERE  ROWNUM = 1;
 14    RETURN v_category;
 15  EXCEPTION
 16    WHEN NO_DATA_FOUND THEN
 17  	 RETURN 'UNKNOWN';
 18  END get_category;
 19  /

Function created.

SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM information
  2  /

   INFO_ID TEXT                           CATEGORY
---------- ------------------------------ ------------------------------
         1 Oracle consultants
         2 Oracle
         3 consultant
         4 consultant working for Oracle
         5 java developer
         6 FoxPro programmers
         7 input clerk

7 rows selected.

SCOTT@orcl_11g> UPDATE information
  2  SET    category = get_category (text)
  3  /

7 rows updated.

SCOTT@orcl_11g> SELECT * FROM information
  2  /

   INFO_ID TEXT                           CATEGORY
---------- ------------------------------ ------------------------------
         1 Oracle consultants             Oracle consultant
         2 Oracle                         UNKNOWN
         3 consultant                     Consultant
         4 consultant working for Oracle  Oracle consultant
         5 java developer                 Developer
         6 FoxPro programmers             Developer
         7 input clerk                    UNKNOWN

7 rows selected.

SCOTT@orcl_11g> 


Re: Categorising information, [message #321923 is a reply to message #321902] Wed, 21 May 2008 17:33 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9101
Registered: November 2002
Location: California, USA
Senior Member
I previously moved this thread from SQL and PL/SQL to Oracle Text. I will now move it to Warehouse Builder in the hopes of attracting comments as to whether a trigger or function or procedure or mapping or whatever is most appropriate to populate the category column.
Re: Categorising information, [message #322083 is a reply to message #321408] Thu, 22 May 2008 07:14 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
gonna try creating it now, lets see what happens.

ive created the sidetable with the following info
Quote:
INSERT ALL
INTO ttl_rolvertaal VALUES (1, 'Business consultant', '$business AND $architect')
INTO ttl_rolvertaal VALUES (2, 'Business consultant', '$business AND $consultant')
INTO ttl_rolvertaal VALUES (3, 'Programmamanager', '$programma AND $manager')
INTO ttl_rolvertaal VALUES (4, 'Projectmanager', '$project AND $manager')
INTO ttl_rolvertaal VALUES (5, 'Projectondersteuning', '$project AND $ondersteuning')
INTO ttl_rolvertaal VALUES (6, 'Projectleider', '$project AND $leider')
INTO ttl_rolvertaal VALUES (7, 'Applicatiebeheerder', '$systeem AND $specialist')
INTO ttl_rolvertaal VALUES (8, 'Technisch applicatiebeheerder', '$technisch AND $beheerder')
INTO ttl_rolvertaal VALUES (9, 'Functioneel applicatiebeheerder', '$functioneel AND $beheerder')
INTO ttl_rolvertaal VALUES (10, 'Ontwikkelaar', '$java AND $specialist')
INTO ttl_rolvertaal VALUES (11, 'Ontwikkelaar', '$java AND $expert')
INTO ttl_rolvertaal VALUES (12, 'Ontwikkelaar', '$.net AND $specialist')
INTO ttl_rolvertaal VALUES (13, 'Ontwikkelaar', '$.net AND $expert')
INTO ttl_rolvertaal VALUES (14, 'DB administrator', '$database AND $administrator')
INTO ttl_rolvertaal VALUES (15, 'DB administrator', '$database AND $specialist')
INTO ttl_rolvertaal VALUES (16, 'Functioneel ontwerper', '$functioneel AND $ontwerper')
INTO ttl_rolvertaal VALUES (17, 'Technische ontwerper', '$technisch AND $ontwerper')
INTO ttl_rolvertaal VALUES (18, 'Consultant', '$adviseur OR $consultant')
INTO ttl_rolvertaal VALUES (19, 'Coördinator', '$coördinator OR $manager')
INTO ttl_rolvertaal VALUES (20, 'Projectmanager', '$projectmanager')
INTO ttl_rolvertaal VALUES (21, 'Projectondersteuning', '$projectondersteuning OR $projectcontrol OR $projectsecretaris OR $projectmedewerker')
INTO ttl_rolvertaal VALUES (22, 'Projectleider', '$projectleider')
INTO ttl_rolvertaal VALUES (23, 'Applicatiebeheerder', '$applicatiebeheerder')
INTO ttl_rolvertaal VALUES (24, 'Tester', '$tester')
INTO ttl_rolvertaal VALUES (25, 'Ontwikkelaar', '$bouwer OR $programmeur OR $ontwikkelaar OR $developer OR $engineer')
INTO ttl_rolvertaal VALUES (26, 'Technische ontwerper', '$architect')
INTO ttl_rolvertaal VALUES (27, 'Ontwerper', '$ontwerper')
INTO ttl_rolvertaal VALUES (28, 'Informatieanalist', '$analist')
INTO ttl_rolvertaal VALUES (29, 'DB administrator', '$DBA OR $databaseadministrator')
SELECT * FROM DUAL
/


i have first jused the AND statements followed by the normal and OR's, i noticed lenght in the function does this meen i also need to order on lenght?
Re: Categorising information, [message #322086 is a reply to message #321408] Thu, 22 May 2008 08:15 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
oke i have set it up and included it,

So far i created the sidetable, and filled it with the code in my previous post. ive added the index.

And did a few changes to the function.

the new one is
Quote:
CREATE OR REPLACE FUNCTION get_category
(p_text IN CLOB)
RETURN VARCHAR2
AS
v_category ext_aanvraag.aanvraagrol%TYPE;
BEGIN
SELECT rol
INTO v_category
FROM (SELECT rol
FROM ttl_rolvertaal
WHERE MATCHES (sleutelwoord, p_text) > 0
ORDER BY rol_id DESC)
WHERE ROWNUM = 1;
RETURN v_category;
EXCEPTION
WHEN NO_DATA_FOUND THEN
return 'Onbekend';
WHEN OTHERS THEN
return 'Onbekend';
END get_category;
/


I have added a second exeption as for some reason i was getting an error. and this solved it Wink

Also i have removed the lenght part, dont know why exactly but it seems it was crashing on it for some reason.

now i can run the procedure but its not giving the correct results yet, its about 80% accurate.

for example if you have the input: Project manager it should return projectmanager, but it returns coordinator.

also ive noticed if you would search for tester or testers it returns tester, but if you search for projectmanager it returns projectmanager but if you search for projectmanagers it returns onbekend.

Any suggestions? as i said before iam not sure what the length part does. might be helping, not sure. but i might not really got the order correctly in the sidetable
icon14.gif  Re: Categorising information, [message #322089 is a reply to message #321408] Thu, 22 May 2008 08:21 Go to previous messageGo to next message
dunthon
Messages: 10
Registered: May 2008
Junior Member
oke my bad,

i've placed back the lenght part and it already fixed biggest part Very Happy somehow it made my TOAD crash before but now it seems to work fine.

only thing now i can find so far is the not returning projectmanager whenn the input is projectmanagers.

this has no hurry ofcourse, iam gonna try getting it correctly loaded in the mapping now.

edit:
all up and running looks awesome, 99.9% correctly categorised and i kinda doubt if the last 0.1% would be able to work correctly ^^

[Updated on: Thu, 22 May 2008 08:39]

Report message to a moderator

Re: Categorising information, [message #322222 is a reply to message #322089] Thu, 22 May 2008 16:42 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9101
Registered: November 2002
Location: California, USA
Senior Member
Project is a real word and managers is a real word. Projectmanagers is not a real word, therefore Oracle cannot recognize the stem, in order to find words based on the same stem. You will need to add such words to your table in order for them to be recognized. I suggest that you combine any descriptions for the same category, using parentheses, so that there is only one row per category, as in the demonstration below. Please post the values that are causing problems, as they may be simple to resolve. I am going to guess that you may be having problems with things like periods and umlauts. If so, these things can be resolved simply, if you will just post a few examples of data, category description that you want to match, and what results you are getting instead.

SCOTT@orcl_11g> CREATE TABLE ext_aanvraag
  2    (info_id       NUMBER,
  3  	text	      CLOB,
  4  	aanvraagrol   VARCHAR2 (31),
  5  	CONSTRAINT info_id_pk PRIMARY KEY (info_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO ext_aanvraag (info_id, text) VALUES (1, 'Oracle consultants')
  3  INTO ext_aanvraag (info_id, text) VALUES (2, 'Oracle')
  4  INTO ext_aanvraag (info_id, text) VALUES (3, 'consultant')
  5  INTO ext_aanvraag (info_id, text) VALUES (4, 'consultant working for Oracle')
  6  INTO ext_aanvraag (info_id, text) VALUES (5, 'java developer')
  7  INTO ext_aanvraag (info_id, text) VALUES (6, 'FoxPro programmers')
  8  INTO ext_aanvraag (info_id, text) VALUES (7, 'input clerk')
  9  INTO ext_aanvraag (info_id, text) VALUES (8, 'projectmanagers')
 10  INTO ext_aanvraag (info_id, text) VALUES (9, 'project managers')
 11  SELECT * FROM DUAL
 12  /

9 rows created.

SCOTT@orcl_11g> CREATE TABLE ttl_rolvertaal
  2    (rol_id	       NUMBER,
  3  	rol	       VARCHAR2 (31),
  4  	sleutelwoord   VARCHAR2 (200),
  5  	CONSTRAINT     rol_id_pk PRIMARY KEY (rol_id))
  6  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO ttl_rolvertaal VALUES (1, 'Business consultant',
  3    '$business AND ($architect OR $consultant)')
  4  INTO ttl_rolvertaal VALUES (2, 'Programmamanager',
  5    '$programma AND $manager')
  6  INTO ttl_rolvertaal VALUES (3, 'Projectmanager',
  7    '($project AND $manager) OR $projectmanager OR projectmanagers')
  8  INTO ttl_rolvertaal VALUES (4, 'Projectondersteuning',
  9    '($project AND $ondersteuning) OR $projectondersteuning OR $projectcontrol OR $projectsecretaris OR $projectmedewerker')
 10  INTO ttl_rolvertaal VALUES (5, 'Projectleider',
 11    '($project AND $leider) OR $projectleider')
 12  INTO ttl_rolvertaal VALUES (6, 'Applicatiebeheerder',
 13    '($systeem AND $specialist) OR $applicatiebeheerder')
 14  INTO ttl_rolvertaal VALUES (7, 'Technisch applicatiebeheerder',
 15    '$technisch AND $beheerder')
 16  INTO ttl_rolvertaal VALUES (8, 'Functioneel applicatiebeheerder',
 17    '$functioneel AND $beheerder')
 18  INTO ttl_rolvertaal VALUES (9, 'Ontwikkelaar',
 19    '($java AND ($specialist OR $expert OR $net)) OR ($net AND $expert) OR $bouwer OR $programmeur OR $ontwikkelaar OR $developer OR $engineer')
 20  INTO ttl_rolvertaal VALUES (10, 'DB administrator',
 21    '($database AND ($administrator OR $specialist)) OR $DBA OR $databaseadministrator')
 22  INTO ttl_rolvertaal VALUES (11, 'Functioneel ontwerper',
 23    '$functioneel AND $ontwerper')
 24  INTO ttl_rolvertaal VALUES (12, 'Technische ontwerper',
 25    '($technisch AND $ontwerper) OR $architect')
 26  INTO ttl_rolvertaal VALUES (13, 'Consultant',
 27    '$adviseur OR $consultant')
 28  INTO ttl_rolvertaal VALUES (14, 'Coördinator',
 29    '$coördinator OR $manager')
 30  INTO ttl_rolvertaal VALUES (15, 'Tester',
 31    '$tester')
 32  INTO ttl_rolvertaal VALUES (16, 'Ontwerper',
 33    '$ontwerper')
 34  INTO ttl_rolvertaal VALUES (17, 'Informatieanalist',
 35    '$analist')
 36  SELECT * FROM DUAL
 37  /

17 rows created.

SCOTT@orcl_11g> CREATE INDEX ttl_rolvertaal_idx ON ttl_rolvertaal (sleutelwoord)
  2  INDEXTYPE IS CTXSYS.CTXRULE
  3  /

Index created.

SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION get_category
  2    (p_text IN CLOB)
  3    RETURN	  VARCHAR2
  4  AS
  5    v_category ext_aanvraag.aanvraagrol%TYPE;
  6  BEGIN
  7    SELECT rol
  8    INTO   v_category
  9    FROM   (SELECT rol
 10  	       FROM   ttl_rolvertaal
 11  	       WHERE  MATCHES (sleutelwoord, p_text) > 0
 12  	       ORDER  BY rol_id, LENGTH (rol) DESC)
 13    WHERE  ROWNUM = 1;
 14    RETURN v_category;
 15  EXCEPTION
 16    WHEN NO_DATA_FOUND THEN
 17  	 RETURN 'Onbekend';
 18  END get_category;
 19  /

Function created.

SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM ext_aanvraag
  2  /

   INFO_ID TEXT                           AANVRAAGROL
---------- ------------------------------ -------------------------------
         1 Oracle consultants
         2 Oracle
         3 consultant
         4 consultant working for Oracle
         5 java developer
         6 FoxPro programmers
         7 input clerk
         8 projectmanagers
         9 project managers

9 rows selected.

SCOTT@orcl_11g> UPDATE ext_aanvraag
  2  SET    aanvraagrol = get_category (text)
  3  /

9 rows updated.

SCOTT@orcl_11g> SELECT * FROM ext_aanvraag
  2  /

   INFO_ID TEXT                           AANVRAAGROL
---------- ------------------------------ -------------------------------
         1 Oracle consultants             Consultant
         2 Oracle                         Onbekend
         3 consultant                     Consultant
         4 consultant working for Oracle  Consultant
         5 java developer                 Ontwikkelaar
         6 FoxPro programmers             Onbekend
         7 input clerk                    Onbekend
         8 projectmanagers                Projectmanager
         9 project managers               Projectmanager

9 rows selected.

SCOTT@orcl_11g> 

icon14.gif  Re: Categorising information, [message #322766 is a reply to message #321408] Mon, 26 May 2008 03:24 Go to previous message
dunthon
Messages: 10
Registered: May 2008
Junior Member
Thanks Barbara,

The code is now working thanks to you're help, Thanks for your time and energy on helping me tackle this problem.

The results are working for about >99% so its great, At the moment i will need to spend my time on other things so i wont be able to test the current results intensively.

I might come back on this in a few weeks but i cant promise anything. As i already said i really apriciated youre help so far thank you!!
Previous Topic: can't delete data from extracts table (mapping)
Next Topic: inserting into table through ab initio
Goto Forum:
  


Current Time: Thu Dec 26 18:55:54 CST 2024