Home » RDBMS Server » Performance Tuning » Looking for a better access method (Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi)
Looking for a better access method [message #354847] |
Tue, 21 October 2008 10:12 |
annagel
Messages: 220 Registered: April 2006
|
Senior Member |
|
|
I have been working on a domain index that performs fuzzy matching of random strings. I know the context index can be used to accomplish something similar to what I am doing, and I am employing the context index for certain data-types, but its matching ability on data which does not have a language structure to it like serial number for example leaves something to be desired.
The index performs ngram matching which basically entails breaking the string to be indexed into some set length pieces which will be called ngrams, storing those ngrams away and then finding similarity based on how many ngrams two strings have in common. To this I add a weight based on the frequency of occurrence of a given ngram in the data-set.
Example set-up for the index follows:
--the data table to be indexed
CREATE TABLE lkups AS
SELECT SUBSTR(DBMS_RANDOM.STRING ('x', 17), 1, 17) val
FROM DUAL
CONNECT BY ROWNUM <= 1000;
--index table 1 contains the row-ids and ngrams
CREATE TABLE ngi (ngram, row_id, CONSTRAINT ngipk PRIMARY KEY (ngram, row_id)) AS
SELECT DISTINCT SUBSTR (SUBSTR (dat.val, cnt.rwnum - 1, DECODE (rwnum, 1, 2, 3)), 1, 3) ngram,
dat.ROWID row_id
FROM lkups dat,
(SELECT ROWNUM rwnum
FROM DUAL
CONNECT BY ROWNUM <= 17) cnt;
--index table 2 contains ngram counts for use in weighting
CREATE TABLE ngn (ngram, ngram_count, CONSTRAINT ngnpk PRIMARY KEY(ngram)) AS
SELECT ngram,
COUNT (*) ngram_count
FROM ngi
GROUP BY ngram;
A score for a given match is computed summing 1/the ngram count for each matched ngram and then dividing this value by the theoretical maximum score for a query where a count of 1 is used when an ngram does not exist in the index.
My current access method follows:
SELECT i.row_id,
SUM (1 / n.ngram_count) / mx.mx_score scr
FROM ngn n,
ngi i,
(SELECT DISTINCT SUBSTR (:cmpval, ROWNUM - 1, DECODE (ROWNUM, 1, 2, 3))
ngram
FROM DUAL
CONNECT BY ROWNUM <= LENGTH (:cmpval)) lk,
(SELECT SUM (1 / NVL (ngram_count, 1)) mx_score
FROM (SELECT DISTINCT SUBSTR (:cmpval,
ROWNUM - 1,
DECODE (ROWNUM, 1, 2, 3)) ngram
FROM DUAL
CONNECT BY ROWNUM <= LENGTH (:cmpval)) lk,
ngn n
WHERE n.ngram(+) = lk.ngram) mx
WHERE lk.ngram = n.ngram
AND i.ngram = n.ngram
GROUP BY i.row_id, mx.mx_score
HAVING SUM (1 / n.ngram_count) / mx.mx_score >= :sc;
This works well but as the indexed set grows performance starts to degrade because we are looking up a lot of information that we throw out before it is returned. The problem, as I see it is that our filtering by score happens in the having clause and as a result searching for a high probability match will need to filter out large quantities of data with only one or two very common ngrams in common.
My thought on a method of getting around this problem was to figure out all possible ngram combinations have a sufficient score to meet my threshold, and then go look for these combinations.
The following query accomplishes that, poorly:
SELECT ng1.row_id,
MAX (mtch.scr) scr
FROM (SELECT LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',') ng1,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 2), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng2,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 3), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng3,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 4), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng4,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 5), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng5,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 6), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng6,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 7), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng7,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 8), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng8,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 9), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng9,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 10), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng10,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 11), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng11,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 12), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng12,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 13), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng13,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 14), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng14,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 15), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng15,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 16), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng16,
NVL (LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 17), ','),
LTRIM (REGEXP_SUBSTR (comb, ',[^,]+', 1, 1), ',')) ng17,
( NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
1),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
2),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
3),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
4),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
5),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
6),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
7),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
8),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
9),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
10),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
11),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
12),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
13),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
14),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
15),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
16),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
17),
',')),
0))
/ mx.mx_score scr
FROM (SELECT SYS_CONNECT_BY_PATH (ngram, ',') comb,
SYS_CONNECT_BY_PATH (ngram_count, ',') scr
FROM (SELECT rws.ngram,
ng.ngram_count,
ROW_NUMBER () OVER (ORDER BY ng.ngram_count ASC)
rw_num
FROM (SELECT DISTINCT SUBSTR (:cmpval,
ROWNUM - 1,
DECODE (ROWNUM,
1, 2,
3)) ngram
FROM DUAL
CONNECT BY ROWNUM <= LENGTH (:cmpval)) rws,
ngn ng
WHERE ng.ngram = rws.ngram)
CONNECT BY NOCYCLE rw_num > PRIOR rw_num) dat,
(SELECT SUM (1 / NVL (ngram_count, 1)) mx_score
FROM (SELECT DISTINCT SUBSTR (:cmpval,
ROWNUM - 1,
DECODE (ROWNUM, 1, 2, 3))
ngram
FROM DUAL
CONNECT BY ROWNUM <= LENGTH (:cmpval)) lk,
ngn n
WHERE n.ngram(+) = lk.ngram) mx
WHERE ( NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
1),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
2),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
3),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
4),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
5),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
6),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
7),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
8),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
9),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
10),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
11),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
12),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
13),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
14),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
15),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
16),
',')),
0)
+ NVL ( 1
/ TO_NUMBER (LTRIM (REGEXP_SUBSTR (scr,
',[^,]+',
1,
17),
',')),
0))
/ mx.mx_score >= :sc) mtch,
ngi ng1,
ngi ng2,
ngi ng3,
ngi ng4,
ngi ng5,
ngi ng6,
ngi ng7,
ngi ng8,
ngi ng9,
ngi ng10,
ngi ng11,
ngi ng12,
ngi ng13,
ngi ng14,
ngi ng15,
ngi ng16,
ngi ng17
WHERE ng1.row_id = ng2.row_id
AND ng1.row_id = ng3.row_id
AND ng1.row_id = ng4.row_id
AND ng1.row_id = ng5.row_id
AND ng1.row_id = ng6.row_id
AND ng1.row_id = ng7.row_id
AND ng1.row_id = ng8.row_id
AND ng1.row_id = ng9.row_id
AND ng1.row_id = ng10.row_id
AND ng1.row_id = ng11.row_id
AND ng1.row_id = ng12.row_id
AND ng1.row_id = ng13.row_id
AND ng1.row_id = ng14.row_id
AND ng1.row_id = ng15.row_id
AND ng1.row_id = ng16.row_id
AND ng1.row_id = ng17.row_id
AND mtch.ng1 = ng1.ngram
AND mtch.ng2 = ng2.ngram
AND mtch.ng3 = ng3.ngram
AND mtch.ng4 = ng4.ngram
AND mtch.ng5 = ng5.ngram
AND mtch.ng6 = ng6.ngram
AND mtch.ng7 = ng7.ngram
AND mtch.ng8 = ng8.ngram
AND mtch.ng9 = ng9.ngram
AND mtch.ng10 = ng10.ngram
AND mtch.ng11 = ng11.ngram
AND mtch.ng12 = ng12.ngram
AND mtch.ng13 = ng13.ngram
AND mtch.ng14 = ng14.ngram
AND mtch.ng15 = ng15.ngram
AND mtch.ng16 = ng16.ngram
AND mtch.ng17 = ng17.ngram
GROUP BY ng1.row_id;
This is performs very poorly, multiple accesses of the same table is one problem; worse is the fact that I get far to many combinations which meet the threshold, so any benefit which is gained by not looking up data bellow the threshold is more than lost by the cost of generating the combinations and the fact that the combinations lead to multiple look-ups of the same "good" indexed value which then have to be filtered out with a distinct or in this case grouping to get the score.
The return time on the first algorithm is not horrible I am looking at about 5 seconds on an index of 5 million values, but it is doing "unnecessary" work. By which I mean it is retrieving data which will not make it back to the user. Can anyone think of a different access path/storage method which might lesson or eliminate this "unnecessary" overhead?
Any suggestions are appreciated
Thanks,
Andrew
|
|
|
|
Re: Looking for a better access method [message #354999 is a reply to message #354894] |
Wed, 22 October 2008 03:11 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Fascinating problem. I love this stuff.
Rather than doing this with SQL, it might pay to do it iteratively in PL/SQL.
You could take the :cmpval as input and quickly split it up into ngrams, then lookup those ngrams in NGN to get the ngram_count of each (ie. the number of "appearances" of each in the NGI index) and ORDER BY ngram_count ascending. At the same time, record the theoretical maximum score (TMAX) as the sum of 1/ngram_count.
Starting with the least "popular" ngram and then iterating, you could:
- Subtract 1/ngram_count from TMAX
- Fetch all of the matching rows from NGI and record a score for them in an associative array. Add 1/ngram_count to the score for each ROWID selected.
- Track the current "high score" (HIGH) as you go. This is best score so far of any rowid.
- If score for any rowid slips below HIGH - TMAX, delete it from the associative array (or don't insert it in the first place).
- Stop when the associative array is empty. The winner is the rowid you just deleted. Watch out for dead-heats though.
The advantage here is that you could find your winner before you get to the really common ngrams that chew up IO.
Ross Leishman
|
|
|
Re: Looking for a better access method [message #355054 is a reply to message #354999] |
Wed, 22 October 2008 07:13 |
annagel
Messages: 220 Registered: April 2006
|
Senior Member |
|
|
Kevin, I am familiar with the Text indexes (I mentioned them as Context indexes in my post) and will be using this functionality for matching address, name, and description type data, but the exact nature of the fuzzy they employ is proprietary and not disclosed by Oracle. This feature will return matches, but they are usually not the best matches in the result set because the algorithm is built presumably to fuzzy match English language data. This is further supported by the enumeration of specific languages supported by the fuzzy match algorithm in the Text documentation.
Ross,
Very cool idea, as a rule I tend to try for the pure SQL solution when I can, but perhaps this is a situation where the PL/SQL solution is the way to go. I am looking for a way to grab all matches which meet a certain threshold not the "most" similar but the concept still applies. The "normal" invocation is going to be looking for a match with a very high match percentage which because of the weighting method and the way data is really distributed (ie this is not REALLY random data is just looks like it is) if 2 or 3 of the most highly rated ngrams are missing a match is not possible...which in thinking about it might lead me back to a modified version of my original pure SQL solution where is narrow my suspect ROW_ID pool by identifying these "required" ngramgs first and then only going more fine grain after that.
Going to play with this idea a bit, I will post back what my findings are. Thanks for the help, this is exactly what I needed a fresh mind to look at a problem I have been looking at too long to find a good solution to.
Andrew
|
|
|
|
Re: Looking for a better access method [message #355137 is a reply to message #354847] |
Wed, 22 October 2008 16:07 |
coleing
Messages: 213 Registered: February 2008
|
Senior Member |
|
|
I could almost guarantee that your query would run a heck of a lot quicker if you could do it without the REGEXP commands.
They are horrendously slow in Oracle with large volumes.
Anyway, I presume you have already looked at utl_match? (which is also slow - but worth a mention).
|
|
|
|
Re: Looking for a better access method [message #355142 is a reply to message #355139] |
Wed, 22 October 2008 19:39 |
annagel
Messages: 220 Registered: April 2006
|
Senior Member |
|
|
In regard to the regexp stuff, I have heard they have issues in 10g, but that code was a proof of concept more than anything, and the concept was slow in pretty much every way imaginable, the regexp made it worse I am sure, but it was far too bad before they got thrown into the mix...it was more a failed example just of an ides I tossed around, my original query gets far better performance.
As for UTL_MATCH, I am aware of it, but never really considered it for this application as any use of it is going to require a complete scan of source data set and in the app this is all for I will be scanning 4 data-sets with between 1 and 5 million value each, the volume is just too high for anything in UTL_MATCH to be usable.
I spent some time this morning tossing around some stuff that I got the idea for after reading Ross' reply and I got a performance boost of something like 60% by basically restricting my set of look-up ids by ranking my ngrams by weight and setting a cut-off point basically saying if none of these ngrams are in the result it is impossible to meet the threshold. This I am starting to think it still wasteful because I keep looking up ngram info after I "know" (not really but could know if I was doing it procedural) that either I have surpassed the threshold for sure or that surpassing it is impossible.
Unfortunately I got pulled off this which is all long term work this afternoon to get something out quick so my next attempt which will be a more PL/SQL based approach the idea being I start with the same set I was using in SQL, but then process that in loops dropping out rowIDs as they either meet the threshold and can be returned or meeting the threshold becomes impossible so they get dropped. I am hopping to get an even bigger improvement with that method but I will have to wait and see in the morning.
Thanks for everyone input hope I have some good news tomorrow...
|
|
|
Goto Forum:
Current Time: Tue Nov 26 02:23:46 CST 2024
|