Home » Developer & Programmer » Data Integration » Categorising information,
Categorising information, [message #321408] |
Tue, 20 May 2008 03:30 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
Hey everyone,
Recently I've started learning some pl-SQL and I've been writing some small functions and procedures, and I've run into a problem i cant get my hand around, and i was hoping someone here could help me get back on track.
The problem, in my database I've got a colom containing a lot of different information (mostly its a part of a sentence) as i want to use this information in a report i thought it was nice to get this all categorised.
I have made a function using SOUND-EX and a table containing keywords, so i could match the information with the keywords and when this would match i would know the category. unfortunately this doesn't work and doesn't give the results I'm looking for. I have discussed the Sound-ex with some other people and we concluded the sound-ex wasn't really the best way to go.
My burning Q. is, Does anyone know a good way to get this Random information categorised? It would be awesome if you could get me started on the code, or just be able to give me some id's on how to handle this.
I have attached the Sound-ex code we have used.
Thanks in advance for helping and reading this.
-
Attachment: cat_role.sql
(Size: 1.42KB, Downloaded 2410 times)
[Updated on: Tue, 20 May 2008 03:34] Report message to a moderator
|
|
|
|
Re: Categorising information, [message #321412 is a reply to message #321408] |
Tue, 20 May 2008 03:39 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
thx for the fast reply awesome,
ive just added the soundex code to my previous post.
The catogorising can be seen as the following, In the current information each record in the database contains random information (this is a textfield and will be manual inserted)
i want to add an extra colom containing a catogory for this information.
for example, if the text would contain the word programmer i want the catagory to be developer, if the text would contain adviser or consultant i want the catogory to be consultant.
i hope this is more clear,
Greetings
|
|
|
|
Re: Categorising information, [message #321436 is a reply to message #321408] |
Tue, 20 May 2008 04:54 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
did you look at the code i have added? my id behind the code was that if the text sounded like a certain keyword it would return the category, but as i said it doesnt work, so iam looking for a new way.
Quote: | I still think that Oracle Text might help you out.
|
could you be more specific iam not sure what you meen here, ive been looking around and do you meen Oracle ConText? unfortionally iam not to far into the codes and for a part its still more trying code then understanding some parts.
|
|
|
Re: Categorising information, [message #321437 is a reply to message #321408] |
Tue, 20 May 2008 05:01 |
dr.s.raghunathan
Messages: 540 Registered: February 2008
|
Senior Member |
|
|
you create a table with the following
you find word occurance in terms of number with words...
sort in desending order place it separate table..
you have new column with varchar2(2000)..
delete all less freqent words and catagorise as others...
now go back to your orinal text file read it upto period (.)
link it with created file for the first find occurance...
some extent it can be categoriesed...
if this is not offering any lead to your problems..
send a data in text format...
we will make an attempt to resolve it
yours
dr.s.raghunathan
|
|
|
Re: Categorising information, [message #321458 is a reply to message #321436] |
Tue, 20 May 2008 05:48 |
|
Maaher
Messages: 7065 Registered: December 2001
|
Senior Member |
|
|
dunthon wrote on Tue, 20 May 2008 11:54 | did you look at the code i have added? my id behind the code was that if the text sounded like a certain keyword it would return the category, but as i said it doesnt work, so iam looking for a new way.
| Yes, I've seen your code and it's the reason why I was puzzled about the use of SOUNDEX. It's a common misconception: SOUNDEX has nothing to do with SOUND. For the complete SOUNDEX algorithm, I suggest you look at WikPedia but it comes down to this: the first character remains the same. Vowels are disregarded. Consonants are replaced by a numeric value (b becomes 1, etc...). Duplicate digits are removed (33 becomes 3 etc...). Only the first letter and the first 3 digits are returned. So, you will find that both 'Repelsteeltje' and 'Rafelgaren' will return the same SOUNDEX value.
dunthon wrote on Tue, 20 May 2008 11:54 | could you be more specific iam not sure what you meen here, ive been looking around and do you meen Oracle ConText? unfortionally iam not to far into the codes and for a part its still more trying code then understanding some parts.
| Oracle Text provides a way to search with synonyms. You can even use fuzzy logic. But Oracle Text is not intuitive or easy to implement at first. It will demand some effort from your part.
An alternative could be -as already suggested- to use a table with category/keyword pairs. That way, you can assign categories through a standard join query. But all depends on the size of the keyword set. It will require manual input.
Edit: synonyms and Oracle Text: CTX_THES
Oracle Text
MHE
[Updated on: Tue, 20 May 2008 05:53] Report message to a moderator
|
|
|
|
Re: Categorising information, [message #321570 is a reply to message #321476] |
Tue, 20 May 2008 14:13 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
What you are describing is called Document Classification and is part of Oracle Text. Oracle Text used to be called Intermedia, and prior to that it was called Oracle Context. Documents can be classified or categorized based on keywords that you supply or by using a thesaurus. I have provided a simple example below using a sidetable containing categories and keywords as you have described that you already have. You need to use the word OR between each keyword. Bear in mind that documents may fit into multiple categories, so instead of having one category column you should have a separate table to join the documents and categories. You should use primary and foreign keys to make queries faster.
-- If you have data like this:
SCOTT@orcl_11g> CREATE TABLE information
2 (info_id NUMBER,
3 text CLOB,
4 CONSTRAINT info_id_pk PRIMARY KEY (info_id))
5 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO information VALUES (1, 'He is an experienced programmer.')
3 INTO information VALUES (2, 'I am a developer, not a DBA.')
4 INTO information VALUES (3, 'We need an advisor.')
5 INTO information VALUES (4, 'Hiring a consultant can be expensive.')
6 INTO information VALUES (5, 'Are you a programmer, developer, advisor, or consultant?')
7 SELECT * FROM DUAL
8 /
5 rows created.
-- and you have a sideteable of categories for keywords like this:
SCOTT@orcl_11g> CREATE TABLE sidetable
2 (cat_id NUMBER,
3 category VARCHAR2 (30),
4 keywords VARCHAR2 (30),
5 CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO sidetable VALUES (1, 'developer', 'programmer OR developer')
3 INTO sidetable VALUES (2, 'consultant', 'advisor OR consultant')
4 SELECT * FROM DUAL
5 /
2 rows created.
-- then you can classify your data by category like this:
SCOTT@orcl_11g> CREATE TABLE info_cat
2 (info_id NUMBER,
3 cat_id NUMBER,
4 CONSTRAINT info_id_fk FOREIGN KEY (info_id) REFERENCES information (info_id),
5 CONSTRAINT cat_id_fk FOREIGN KEY (cat_id) REFERENCES sidetable (cat_id))
6 /
Table created.
SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
2 INDEXTYPE IS CTXSYS.CTXRULE
3 /
Index created.
SCOTT@orcl_11g> BEGIN
2 FOR doc IN
3 (SELECT info_id, text
4 FROM information)
5 LOOP
6 FOR cat IN
7 (SELECT cat_id
8 FROM sidetable
9 WHERE MATCHES (keywords, doc.text) > 0)
10 LOOP
11 INSERT INTO info_cat VALUES (doc.info_id, cat.cat_id);
12 END LOOP;
13 END LOOP;
14 END;
15 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> SELECT * FROM info_cat
2 /
INFO_ID CAT_ID
---------- ----------
1 1
2 1
3 2
4 2
5 1
5 2
6 rows selected.
-- so you can query like this:
SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT s.category, i.text
2 FROM sidetable s, information i, info_cat c
3 WHERE s.cat_id = c.cat_id
4 AND c.info_id = i.info_id
5 ORDER BY category
6 /
CATEGORY TEXT
------------------------------ ------------------------------
consultant We need an advisor.
consultant Hiring a consultant can be
expensive.
consultant Are you a programmer,
developer, advisor, or
consultant?
developer He is an experienced
programmer.
developer Are you a programmer,
developer, advisor, or
consultant?
developer I am a developer, not a DBA.
6 rows selected.
SCOTT@orcl_11g>
|
|
|
Re: Categorising information, [message #321728 is a reply to message #321570] |
Wed, 21 May 2008 03:03 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
@everyone - Thanks for your reply 's and help so far.
@Barbara Boehmer - this is a really great post and really helpful i believe it is almost everything i need. and its actually way easy-er set up as i thought it would be. Still got a Q. left.
I have send you a private message, so i can add some docs i don't feel comfortable with posting it on forums.
|
|
|
Re: Categorising information, [message #321874 is a reply to message #321728] |
Wed, 21 May 2008 12:05 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
It is best to post follow-up in the forum thread, rather than by PM. You don't have to use your actual table names, column names, or data, just enough to simulate the problem. If I understood your PM correctly, the gist of it is that you need only one category per row in your data file and that you would like it to pick the more specific category. For example, if you have the word "Oracle" and the word "consultant" you would want it to pick the category "Oracle consultant" instead of just "consultant". You also want "UNKNOWN" for those that do not match any categories.
There are three basic methods of classification: the one that I already demonstrated, another that requires that you provide it some examples for training, and another that creates the categories for you. I think the method that I provided comes the closest to what you want and the other two might not get you the categories that you want or put things in the right categories and are very complicated to set up. The only problem with the method that I used is that if you just add match_score to it, it only produces 0 for non-matches or 100 for matches. In order to have a full range of scores, you have to use the second method with svm_classifier and cls_train and provide it with examples. It gets really complicated and still might not be too accurate.
I think the best workaround for ranking the matches would be to do so in your sidetable, using the cat_id. So, you would make sure that "Oracle consultant" has a lower cat_id than "consultant". Alternatively, or in addition to, you could rank by the length of the category, figuring that a longer name is likely to be more specific. You would then take the first row after the ranking. In the case of identical rankings, which may or may not occur, it would arbitrarily pick one of the first identically ranked rows. You will need to either convert your table of separate keyword columns to one column, and you can use either OR or AND to specify your requirements or add another column to specify whether the condition is AND or OR and concatenate them in your query. You can even add combinations with parentheses and stemming and such. Stemming allows a search of the singular form, like consultant to find the plural, like consultants. Basically, you can use most of the query features available in a typical CONTAINS search. To achieve the UNKNOWN, you can just use outer joins and NVL.
Please see the revised demonstration below.
SCOTT@orcl_11g> CREATE TABLE information
2 (info_id NUMBER,
3 text CLOB,
4 CONSTRAINT info_id_pk PRIMARY KEY (info_id))
5 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO information VALUES (1, 'Oracle consultants')
3 INTO information VALUES (2, 'Oracle')
4 INTO information VALUES (3, 'consultant')
5 INTO information VALUES (4, 'consultant working for Oracle')
6 INTO information VALUES (5, 'java developer')
7 INTO information VALUES (6, 'FoxPro programmers')
8 INTO information VALUES (7, 'input clerk')
9 SELECT * FROM DUAL
10 /
7 rows created.
SCOTT@orcl_11g> CREATE TABLE sidetable
2 (cat_id NUMBER,
3 category VARCHAR2 (30),
4 keywords VARCHAR2 (30),
5 CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO sidetable VALUES (1, 'Oracle consultant', 'Oracle AND $consultant')
3 INTO sidetable VALUES (2, 'Consultant', '$advisor OR $consultant')
4 INTO sidetable VALUES (3, 'Developer', '$programmer OR $developer')
5 SELECT * FROM DUAL
6 /
3 rows created.
SCOTT@orcl_11g> CREATE TABLE info_cat
2 (info_id NUMBER,
3 cat_id NUMBER,
4 CONSTRAINT info_id_fk FOREIGN KEY (info_id) REFERENCES information (info_id),
5 CONSTRAINT cat_id_fk FOREIGN KEY (cat_id) REFERENCES sidetable (cat_id))
6 /
Table created.
SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
2 INDEXTYPE IS CTXSYS.CTXRULE
3 /
Index created.
SCOTT@orcl_11g> BEGIN
2 FOR doc IN
3 (SELECT info_id, text
4 FROM information)
5 LOOP
6 FOR cat IN
7 (SELECT cat_id
8 FROM (SELECT cat_id
9 FROM sidetable
10 WHERE MATCHES (keywords, doc.text) > 0
11 ORDER BY cat_id, LENGTH (category) DESC)
12 WHERE ROWNUM = 1)
13 LOOP
14 INSERT INTO info_cat VALUES (doc.info_id, cat.cat_id);
15 END LOOP;
16 END LOOP;
17 END;
18 /
PL/SQL procedure successfully completed.
SCOTT@orcl_11g> SELECT * FROM info_cat
2 /
INFO_ID CAT_ID
---------- ----------
1 1
3 2
4 1
5 3
6 3
SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT NVL (s.category, 'UNKNOWN') AS category, i.text
2 FROM information i, info_cat c, sidetable s
3 WHERE i.info_id = c.info_id (+)
4 AND c.cat_id = s.cat_id (+)
5 ORDER BY category
6 /
CATEGORY TEXT
------------------------------ ------------------------------
Consultant consultant
Developer java developer
Developer FoxPro programmers
Oracle consultant Oracle consultants
Oracle consultant consultant working for Oracle
UNKNOWN Oracle
UNKNOWN input clerk
7 rows selected.
SCOTT@orcl_11g>
|
|
|
Re: Categorising information, [message #321879 is a reply to message #321728] |
Wed, 21 May 2008 12:40 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
I though about this a little more. Since you mentioned this is a data warehouse environment, and since you only want one category, it might be better to add the category column as you originally wanted, and skip the info_cat table. You could use a trigger to automatically populate the category when new rows are inserted. Please see the demo below.
SCOTT@orcl_11g> CREATE TABLE information
2 (info_id NUMBER,
3 text CLOB,
4 category VARCHAR2 (30),
5 CONSTRAINT info_id_pk PRIMARY KEY (info_id))
6 /
Table created.
SCOTT@orcl_11g> CREATE TABLE sidetable
2 (cat_id NUMBER,
3 category VARCHAR2 (30),
4 keywords VARCHAR2 (30),
5 CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO sidetable VALUES (1, 'Oracle consultant', 'Oracle AND $consultant')
3 INTO sidetable VALUES (2, 'Consultant', '$advisor OR $consultant')
4 INTO sidetable VALUES (3, 'Developer', '$programmer OR $developer')
5 SELECT * FROM DUAL
6 /
3 rows created.
SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
2 INDEXTYPE IS CTXSYS.CTXRULE
3 /
Index created.
SCOTT@orcl_11g> CREATE OR REPLACE TRIGGER info_air
2 BEFORE INSERT ON information
3 FOR EACH ROW
4 DECLARE
5 v_category information.category%TYPE;
6 BEGIN
7 SELECT category
8 INTO v_category
9 FROM (SELECT category
10 FROM sidetable
11 WHERE MATCHES (keywords, :NEW.text) > 0
12 ORDER BY cat_id, LENGTH (category) DESC)
13 WHERE ROWNUM = 1;
14 :NEW.category := v_category;
15 EXCEPTION
16 WHEN NO_DATA_FOUND THEN
17 :NEW.category := 'UNKNOWN';
18 END;
19 /
Trigger created.
SCOTT@orcl_11g> INSERT ALL
2 INTO information (info_id, text) VALUES (1, 'Oracle consultants')
3 INTO information (info_id, text) VALUES (2, 'Oracle')
4 INTO information (info_id, text) VALUES (3, 'consultant')
5 INTO information (info_id, text) VALUES (4, 'consultant working for Oracle')
6 INTO information (info_id, text) VALUES (5, 'java developer')
7 INTO information (info_id, text) VALUES (6, 'FoxPro programmers')
8 INTO information (info_id, text) VALUES (7, 'input clerk')
9 SELECT * FROM DUAL
10 /
7 rows created.
SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM information
2 /
INFO_ID TEXT CATEGORY
---------- ------------------------------ ------------------------------
1 Oracle consultants Oracle consultant
2 Oracle UNKNOWN
3 consultant Consultant
4 consultant working for Oracle Oracle consultant
5 java developer Developer
6 FoxPro programmers Developer
7 input clerk UNKNOWN
7 rows selected.
SCOTT@orcl_11g>
|
|
|
|
Re: Categorising information, [message #321921 is a reply to message #321902] |
Wed, 21 May 2008 17:15 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
I'm sorry, but I am not familiar with data warehouse mapping. You could certainly use a procedure or function to update the column, instead of a trigger. I have demonstrated an update using a function below. If you placed the trigger on the data warehouse field, then it should automatically populate the category column when rows are inserted into the datawarehouse table. If you do mass inserts, it may be more efficient to do a mass update as well, by using something like the update with the function that I have provided below. What I know about data warehouses is that they are generally denormalized and intended for fast querying, not for frequent inserts, updates, and deletes.
:NEW.column_name is used to reference the new value of a column within a trigger.
The $ is the Oracle Text symbol for stemming. If you search for $consultant, it searches for all words that are based on that stem word, such as the plural consultants, which would not be found if searching for consultant without the $.
A30 WORD_WRAPPED is part of the SQL*Plus COLUMN command. it means that it is to be displayed as characters, not numeric, up to 30 characters long, then wrap around to the next line after 30 characters, but breaking on the spaces between words, not in the middle of a word. I use such things just to make it easier to display in one line on the forums without having to scroll to the right to read it.
Here is another demo with a single update using a function:
SCOTT@orcl_11g> CREATE TABLE information
2 (info_id NUMBER,
3 text CLOB,
4 category VARCHAR2 (30),
5 CONSTRAINT info_id_pk PRIMARY KEY (info_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO information (info_id, text) VALUES (1, 'Oracle consultants')
3 INTO information (info_id, text) VALUES (2, 'Oracle')
4 INTO information (info_id, text) VALUES (3, 'consultant')
5 INTO information (info_id, text) VALUES (4, 'consultant working for Oracle')
6 INTO information (info_id, text) VALUES (5, 'java developer')
7 INTO information (info_id, text) VALUES (6, 'FoxPro programmers')
8 INTO information (info_id, text) VALUES (7, 'input clerk')
9 SELECT * FROM DUAL
10 /
7 rows created.
SCOTT@orcl_11g> CREATE TABLE sidetable
2 (cat_id NUMBER,
3 category VARCHAR2 (30),
4 keywords VARCHAR2 (30),
5 CONSTRAINT cat_id_pk PRIMARY KEY (cat_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO sidetable VALUES (1, 'Oracle consultant', 'Oracle AND $consultant')
3 INTO sidetable VALUES (2, 'Consultant', '$advisor OR $consultant')
4 INTO sidetable VALUES (3, 'Developer', '$programmer OR $developer')
5 SELECT * FROM DUAL
6 /
3 rows created.
SCOTT@orcl_11g> CREATE INDEX sidetable_idx ON sidetable (keywords)
2 INDEXTYPE IS CTXSYS.CTXRULE
3 /
Index created.
SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION get_category
2 (p_text IN CLOB)
3 RETURN VARCHAR2
4 AS
5 v_category information.category%TYPE;
6 BEGIN
7 SELECT category
8 INTO v_category
9 FROM (SELECT category
10 FROM sidetable
11 WHERE MATCHES (keywords, p_text) > 0
12 ORDER BY cat_id, LENGTH (category) DESC)
13 WHERE ROWNUM = 1;
14 RETURN v_category;
15 EXCEPTION
16 WHEN NO_DATA_FOUND THEN
17 RETURN 'UNKNOWN';
18 END get_category;
19 /
Function created.
SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM information
2 /
INFO_ID TEXT CATEGORY
---------- ------------------------------ ------------------------------
1 Oracle consultants
2 Oracle
3 consultant
4 consultant working for Oracle
5 java developer
6 FoxPro programmers
7 input clerk
7 rows selected.
SCOTT@orcl_11g> UPDATE information
2 SET category = get_category (text)
3 /
7 rows updated.
SCOTT@orcl_11g> SELECT * FROM information
2 /
INFO_ID TEXT CATEGORY
---------- ------------------------------ ------------------------------
1 Oracle consultants Oracle consultant
2 Oracle UNKNOWN
3 consultant Consultant
4 consultant working for Oracle Oracle consultant
5 java developer Developer
6 FoxPro programmers Developer
7 input clerk UNKNOWN
7 rows selected.
SCOTT@orcl_11g>
|
|
|
|
Re: Categorising information, [message #322083 is a reply to message #321408] |
Thu, 22 May 2008 07:14 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
gonna try creating it now, lets see what happens.
ive created the sidetable with the following info
Quote: | INSERT ALL
INTO ttl_rolvertaal VALUES (1, 'Business consultant', '$business AND $architect')
INTO ttl_rolvertaal VALUES (2, 'Business consultant', '$business AND $consultant')
INTO ttl_rolvertaal VALUES (3, 'Programmamanager', '$programma AND $manager')
INTO ttl_rolvertaal VALUES (4, 'Projectmanager', '$project AND $manager')
INTO ttl_rolvertaal VALUES (5, 'Projectondersteuning', '$project AND $ondersteuning')
INTO ttl_rolvertaal VALUES (6, 'Projectleider', '$project AND $leider')
INTO ttl_rolvertaal VALUES (7, 'Applicatiebeheerder', '$systeem AND $specialist')
INTO ttl_rolvertaal VALUES (8, 'Technisch applicatiebeheerder', '$technisch AND $beheerder')
INTO ttl_rolvertaal VALUES (9, 'Functioneel applicatiebeheerder', '$functioneel AND $beheerder')
INTO ttl_rolvertaal VALUES (10, 'Ontwikkelaar', '$java AND $specialist')
INTO ttl_rolvertaal VALUES (11, 'Ontwikkelaar', '$java AND $expert')
INTO ttl_rolvertaal VALUES (12, 'Ontwikkelaar', '$.net AND $specialist')
INTO ttl_rolvertaal VALUES (13, 'Ontwikkelaar', '$.net AND $expert')
INTO ttl_rolvertaal VALUES (14, 'DB administrator', '$database AND $administrator')
INTO ttl_rolvertaal VALUES (15, 'DB administrator', '$database AND $specialist')
INTO ttl_rolvertaal VALUES (16, 'Functioneel ontwerper', '$functioneel AND $ontwerper')
INTO ttl_rolvertaal VALUES (17, 'Technische ontwerper', '$technisch AND $ontwerper')
INTO ttl_rolvertaal VALUES (18, 'Consultant', '$adviseur OR $consultant')
INTO ttl_rolvertaal VALUES (19, 'Coördinator', '$coördinator OR $manager')
INTO ttl_rolvertaal VALUES (20, 'Projectmanager', '$projectmanager')
INTO ttl_rolvertaal VALUES (21, 'Projectondersteuning', '$projectondersteuning OR $projectcontrol OR $projectsecretaris OR $projectmedewerker')
INTO ttl_rolvertaal VALUES (22, 'Projectleider', '$projectleider')
INTO ttl_rolvertaal VALUES (23, 'Applicatiebeheerder', '$applicatiebeheerder')
INTO ttl_rolvertaal VALUES (24, 'Tester', '$tester')
INTO ttl_rolvertaal VALUES (25, 'Ontwikkelaar', '$bouwer OR $programmeur OR $ontwikkelaar OR $developer OR $engineer')
INTO ttl_rolvertaal VALUES (26, 'Technische ontwerper', '$architect')
INTO ttl_rolvertaal VALUES (27, 'Ontwerper', '$ontwerper')
INTO ttl_rolvertaal VALUES (28, 'Informatieanalist', '$analist')
INTO ttl_rolvertaal VALUES (29, 'DB administrator', '$DBA OR $databaseadministrator')
SELECT * FROM DUAL
/
|
i have first jused the AND statements followed by the normal and OR's, i noticed lenght in the function does this meen i also need to order on lenght?
|
|
|
Re: Categorising information, [message #322086 is a reply to message #321408] |
Thu, 22 May 2008 08:15 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
oke i have set it up and included it,
So far i created the sidetable, and filled it with the code in my previous post. ive added the index.
And did a few changes to the function.
the new one is
Quote: | CREATE OR REPLACE FUNCTION get_category
(p_text IN CLOB)
RETURN VARCHAR2
AS
v_category ext_aanvraag.aanvraagrol%TYPE;
BEGIN
SELECT rol
INTO v_category
FROM (SELECT rol
FROM ttl_rolvertaal
WHERE MATCHES (sleutelwoord, p_text) > 0
ORDER BY rol_id DESC)
WHERE ROWNUM = 1;
RETURN v_category;
EXCEPTION
WHEN NO_DATA_FOUND THEN
return 'Onbekend';
WHEN OTHERS THEN
return 'Onbekend';
END get_category;
/
|
I have added a second exeption as for some reason i was getting an error. and this solved it
Also i have removed the lenght part, dont know why exactly but it seems it was crashing on it for some reason.
now i can run the procedure but its not giving the correct results yet, its about 80% accurate.
for example if you have the input: Project manager it should return projectmanager, but it returns coordinator.
also ive noticed if you would search for tester or testers it returns tester, but if you search for projectmanager it returns projectmanager but if you search for projectmanagers it returns onbekend.
Any suggestions? as i said before iam not sure what the length part does. might be helping, not sure. but i might not really got the order correctly in the sidetable
|
|
|
Re: Categorising information, [message #322089 is a reply to message #321408] |
Thu, 22 May 2008 08:21 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
oke my bad,
i've placed back the lenght part and it already fixed biggest part somehow it made my TOAD crash before but now it seems to work fine.
only thing now i can find so far is the not returning projectmanager whenn the input is projectmanagers.
this has no hurry ofcourse, iam gonna try getting it correctly loaded in the mapping now.
edit:
all up and running looks awesome, 99.9% correctly categorised and i kinda doubt if the last 0.1% would be able to work correctly ^^
[Updated on: Thu, 22 May 2008 08:39] Report message to a moderator
|
|
|
Re: Categorising information, [message #322222 is a reply to message #322089] |
Thu, 22 May 2008 16:42 |
|
Barbara Boehmer
Messages: 9101 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Project is a real word and managers is a real word. Projectmanagers is not a real word, therefore Oracle cannot recognize the stem, in order to find words based on the same stem. You will need to add such words to your table in order for them to be recognized. I suggest that you combine any descriptions for the same category, using parentheses, so that there is only one row per category, as in the demonstration below. Please post the values that are causing problems, as they may be simple to resolve. I am going to guess that you may be having problems with things like periods and umlauts. If so, these things can be resolved simply, if you will just post a few examples of data, category description that you want to match, and what results you are getting instead.
SCOTT@orcl_11g> CREATE TABLE ext_aanvraag
2 (info_id NUMBER,
3 text CLOB,
4 aanvraagrol VARCHAR2 (31),
5 CONSTRAINT info_id_pk PRIMARY KEY (info_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO ext_aanvraag (info_id, text) VALUES (1, 'Oracle consultants')
3 INTO ext_aanvraag (info_id, text) VALUES (2, 'Oracle')
4 INTO ext_aanvraag (info_id, text) VALUES (3, 'consultant')
5 INTO ext_aanvraag (info_id, text) VALUES (4, 'consultant working for Oracle')
6 INTO ext_aanvraag (info_id, text) VALUES (5, 'java developer')
7 INTO ext_aanvraag (info_id, text) VALUES (6, 'FoxPro programmers')
8 INTO ext_aanvraag (info_id, text) VALUES (7, 'input clerk')
9 INTO ext_aanvraag (info_id, text) VALUES (8, 'projectmanagers')
10 INTO ext_aanvraag (info_id, text) VALUES (9, 'project managers')
11 SELECT * FROM DUAL
12 /
9 rows created.
SCOTT@orcl_11g> CREATE TABLE ttl_rolvertaal
2 (rol_id NUMBER,
3 rol VARCHAR2 (31),
4 sleutelwoord VARCHAR2 (200),
5 CONSTRAINT rol_id_pk PRIMARY KEY (rol_id))
6 /
Table created.
SCOTT@orcl_11g> INSERT ALL
2 INTO ttl_rolvertaal VALUES (1, 'Business consultant',
3 '$business AND ($architect OR $consultant)')
4 INTO ttl_rolvertaal VALUES (2, 'Programmamanager',
5 '$programma AND $manager')
6 INTO ttl_rolvertaal VALUES (3, 'Projectmanager',
7 '($project AND $manager) OR $projectmanager OR projectmanagers')
8 INTO ttl_rolvertaal VALUES (4, 'Projectondersteuning',
9 '($project AND $ondersteuning) OR $projectondersteuning OR $projectcontrol OR $projectsecretaris OR $projectmedewerker')
10 INTO ttl_rolvertaal VALUES (5, 'Projectleider',
11 '($project AND $leider) OR $projectleider')
12 INTO ttl_rolvertaal VALUES (6, 'Applicatiebeheerder',
13 '($systeem AND $specialist) OR $applicatiebeheerder')
14 INTO ttl_rolvertaal VALUES (7, 'Technisch applicatiebeheerder',
15 '$technisch AND $beheerder')
16 INTO ttl_rolvertaal VALUES (8, 'Functioneel applicatiebeheerder',
17 '$functioneel AND $beheerder')
18 INTO ttl_rolvertaal VALUES (9, 'Ontwikkelaar',
19 '($java AND ($specialist OR $expert OR $net)) OR ($net AND $expert) OR $bouwer OR $programmeur OR $ontwikkelaar OR $developer OR $engineer')
20 INTO ttl_rolvertaal VALUES (10, 'DB administrator',
21 '($database AND ($administrator OR $specialist)) OR $DBA OR $databaseadministrator')
22 INTO ttl_rolvertaal VALUES (11, 'Functioneel ontwerper',
23 '$functioneel AND $ontwerper')
24 INTO ttl_rolvertaal VALUES (12, 'Technische ontwerper',
25 '($technisch AND $ontwerper) OR $architect')
26 INTO ttl_rolvertaal VALUES (13, 'Consultant',
27 '$adviseur OR $consultant')
28 INTO ttl_rolvertaal VALUES (14, 'Coördinator',
29 '$coördinator OR $manager')
30 INTO ttl_rolvertaal VALUES (15, 'Tester',
31 '$tester')
32 INTO ttl_rolvertaal VALUES (16, 'Ontwerper',
33 '$ontwerper')
34 INTO ttl_rolvertaal VALUES (17, 'Informatieanalist',
35 '$analist')
36 SELECT * FROM DUAL
37 /
17 rows created.
SCOTT@orcl_11g> CREATE INDEX ttl_rolvertaal_idx ON ttl_rolvertaal (sleutelwoord)
2 INDEXTYPE IS CTXSYS.CTXRULE
3 /
Index created.
SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION get_category
2 (p_text IN CLOB)
3 RETURN VARCHAR2
4 AS
5 v_category ext_aanvraag.aanvraagrol%TYPE;
6 BEGIN
7 SELECT rol
8 INTO v_category
9 FROM (SELECT rol
10 FROM ttl_rolvertaal
11 WHERE MATCHES (sleutelwoord, p_text) > 0
12 ORDER BY rol_id, LENGTH (rol) DESC)
13 WHERE ROWNUM = 1;
14 RETURN v_category;
15 EXCEPTION
16 WHEN NO_DATA_FOUND THEN
17 RETURN 'Onbekend';
18 END get_category;
19 /
Function created.
SCOTT@orcl_11g> COLUMN text FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM ext_aanvraag
2 /
INFO_ID TEXT AANVRAAGROL
---------- ------------------------------ -------------------------------
1 Oracle consultants
2 Oracle
3 consultant
4 consultant working for Oracle
5 java developer
6 FoxPro programmers
7 input clerk
8 projectmanagers
9 project managers
9 rows selected.
SCOTT@orcl_11g> UPDATE ext_aanvraag
2 SET aanvraagrol = get_category (text)
3 /
9 rows updated.
SCOTT@orcl_11g> SELECT * FROM ext_aanvraag
2 /
INFO_ID TEXT AANVRAAGROL
---------- ------------------------------ -------------------------------
1 Oracle consultants Consultant
2 Oracle Onbekend
3 consultant Consultant
4 consultant working for Oracle Consultant
5 java developer Ontwikkelaar
6 FoxPro programmers Onbekend
7 input clerk Onbekend
8 projectmanagers Projectmanager
9 project managers Projectmanager
9 rows selected.
SCOTT@orcl_11g>
|
|
|
Re: Categorising information, [message #322766 is a reply to message #321408] |
Mon, 26 May 2008 03:24 |
dunthon
Messages: 10 Registered: May 2008
|
Junior Member |
|
|
Thanks Barbara,
The code is now working thanks to you're help, Thanks for your time and energy on helping me tackle this problem.
The results are working for about >99% so its great, At the moment i will need to spend my time on other things so i wont be able to test the current results intensively.
I might come back on this in a few weeks but i cant promise anything. As i already said i really apriciated youre help so far thank you!!
|
|
|
Goto Forum:
Current Time: Thu Dec 26 18:55:54 CST 2024
|