Home » RDBMS Server » Performance Tuning » highly skewed data distribution
highly skewed data distribution [message #134574] |
Thu, 25 August 2005 10:52 |
sebastianR
Messages: 33 Registered: August 2005
|
Member |
|
|
Hi,
what shall I do if an index range scan is performing poor because of highly skewed data distribution? Do I have highly skewed data distribution at all?
I am searching in a column named KEY_VAL values like:
-'1%'
-'12%'
-'123%'
-'1234%'
At '123%' the CBO switches from full table scan to use my index which is the absolut right decision.
I've selected the count of '1%', '2%', '3%', ... ,'9%', 'A%', 'Z%' with the following result:
KEY_VAL | count(KEY_VAL)
==================
'1%' 61 000
'2%' 48 000
'3%' 38 000
'4%' 12 000
'5%' 19 000
'6%' 18 000
'7%' 17 000
'8%' 23 000
'9%' 25 000
'A%' 11 000
...
...
'Z%' 27 000
==================
total rows 870 000 (640 000 distinct KEY_VALs)
Is this list evidence for skewed or non skewed data distribution?
I have made a histogram like this for the respective index and table:
exec dbms_stats.gather_table_stats(ownname => 'myscheme', tabname => 'table1', method_opt => 'FOR COLUMNS KEY_VAL SIZE 250');
exec dbms_stats.gather_index_stats(ownname => 'myscheme', indname => 'IDX_OBJ_KEY_1');
In the table there are 640887 DISTINCT values of KEY_VAL where there are 870 000 rows in total.
And now there is a TestCase which totally makes me wonder:
Searching '123%' returned 593 rows in 12 seconds.
But searching 'AA0%' returns 996 rows in only 4 seconds!
After intense testing 'AA0%' remains the exception to be that fast. But I don't understand why?
1. question:
Am I right putting the histograms bucket size as high as possible, due I have many distinct values?
Ad1) Is there a way to set the bucket size even higher than 254?
Ad2) Am I wrong using a histogram in this case at all?
2. question:
Why does 'AA0%' return that fast (4 seconds), where '123%' (and nearly all other tested cases) need at least 12 seconds?
AA0% returns more rows!!!, therefore I thought it would be slower? There seems to be no relation (at least no simple relation) between the number of rows returned and the time needed to execute?
3. question
Is it that the data is skewed, and the case 'AA0%' is just luckily placed in the index so that the search in the index is faster than searching for '123%'?
4. question
how is the index search performed? it is a b-tree (with depth blevel = 2, clustering factor 600 000); so the maximum search time would be calculated log N, right?
Did I get it right that when searching the index, it'll start by looking at the very middle position of the index and deciding if the value it is looking for, rests in the lower half or rests in the upper half, picking the correct half and doing the same there; getting the middle value and deciding to go to lower half or upper half; until it reaches the value? binary search (binäre suche)?
Am I totally wrong?
5. question
how is the index sorted? Alphabetically?
I really appreciate every comment or suggestion and I thank you for your time,
SebastianR
|
|
|
Re: highly skewed data distribution [message #134576 is a reply to message #134574] |
Thu, 25 August 2005 11:03 |
sebastianR
Messages: 33 Registered: August 2005
|
Member |
|
|
by the way, i don't index the field KEY_VAL directly,
the index looks like:
CREATE INDEX IDX_OBJ_KEY_1 ON FDMT_OBJ_KEY
("FDMT"."ISDATENULL"("END_DATE"), UPPER("KEY_VAL"))
LOGGING
TABLESPACE FDMT_IDX
PCTFREE 10
INITRANS 2
MAXTRANS 255
STORAGE (
INITIAL 64K
MINEXTENTS 1
MAXEXTENTS 2147483645
PCTINCREASE 0
BUFFER_POOL DEFAULT
)
where isdatenull is just checking if the supplied DATE IS NULL OR NOT;
the selectivity of END_DATE: 18 / 870 000
the selectivity of UPPER(KEY_VAL): 640 000
So I think the index will first filter every rows where the date is not null and not > SYSDATE, and then filter using KEY_VAL.
Here is the used SQL Query:
SELECT DISTINCT T1.OBJ_ID, 4141
FROM TABLE1 T1
WHERE T1.end_date>sysdate
UNION SELECT DISTINCT T1.OBJ_ID, 4141
FROM TABLE1 T1
WHERE UPPER(T1.KEY_VAL) LIKE UPPER('123%')
AND isdatenull(T1.END_DATE) = 1
UNION SELECT DISTINCT T2.KEY_ID, 4141
FROM TABLE2 T2
WHERE UPPER(T2.PUID) LIKE UPPER('123%')
|
|
|
Re: highly skewed data distribution [message #134586 is a reply to message #134574] |
Thu, 25 August 2005 12:01 |
smartin
Messages: 1803 Registered: March 2005 Location: Jacksonville, Florida
|
Senior Member |
|
|
Ok, I like the thought you have given this, and the information you have provided. Couple thoughts in random order:
The response time isn't determined necessarily by the number of rows returned, but where those blocks are located. It is the number of blocks accessed, and the method (single or multiblock) of that access.
One thing you can try is to sort your data in different ways, and see if your attempts using '123%' vs 'AA0%' have more closely the same times. The ways to try are a) by date then key value within date b) by key value then date within key value c) some other column d) random order.
But, your situation is complicated by the fact that you are probably using an index skip scan (it would have been helpful for you to post your explain plan also), instead of a normal range scan or unique scan. A skip scan treats the index as if it were several different indexes, one for each of the different leading column values in the index.
So, you'll want to test what happens if you reverse the order of your index columns (try it both ways). Also you can probably use index key compression with some success here.
For more on access paths (how data is read), you should read the performance tuning guide that comes with 10g (even if you aren't on 10g). Specifically part IV.
I like to make my histogram bucket size as big as possible, which I think is in the 250ish range (it is documented, but I never remember it, so I just use 250). I also always like to gather histograms, regardless of whether I think the data is skewed or not (and regardless of whether it actually is skewed). The more info for the CBO the better.
As far as whether your key_val column is skewed, it doesn't look like it is skewed that much, with 600,000 unique values out of 800,000 total values, but we don't have enough information to know. Your count() query wasn't so much showing skewness on the key_val column itself, because you were only showing the leading character of key_val, not a count of the occurances of the whole value. But don't worry with it, just let oracle gather histograms for all columns when you get stats.
As far as stats go, I like to use the cascade option of gather_table_stats to get both index and table stats at the same time, rather than doing a separate call to gather_index_stats. gather_table_stats(owner,table,cascade=>true,method_opt=>'for all columns size 250'). You can use degree and estimate_percent params as needed.
And why do you have the isdatenull function thing in there to begin with, and why is it indexed? Btree indexes (which are sorted alphabetically) do not store null values. Use that to your advantage and think about what that means. Also, why can't you just use where date_col is null to find out if it is null?
|
|
|
Re: highly skewed data distribution [message #134699 is a reply to message #134574] |
Fri, 26 August 2005 05:02 |
sebastianR
Messages: 33 Registered: August 2005
|
Member |
|
|
First of all, I dumped the isDateNull function because it didn't improve performance at all.
I used it because i have two clauses regarding the field END_DATE:
1) END_DATE IS NULL
OR
2) END_DATE > SYSDATE
I used the isdatenull function so I could make an index on UPPER(KEY_VAL) (that is the field I am searching on) and END_DATE, because I believed in performance improvement using a concatenated index. The OR condition was messing up that concept, therefore I selected the same table twice joining them using UNION ALL so at one table I selected all rows which matched END_DATE>SYSDATE and on the other I could use my concatenated index on isdatenull(END_DATE), UPPER(KEY_VAL).
Please feel free to judge this approach, I'd appreciate your thoughts on this in general.
Anyway, I was wrong, a simple index on upper(KEY_VAL) is equally good.
Therefore, please allow me to repost my actual select-query:
(SELECT DISTINCT T1.OBJ_ID,1228
FROM TABLE1 T1
WHERE UPPER(T1.KEY_VAL) LIKE UPPER('123%')'
AND (T1.END_DATE > SYSDATE OR T1.END_DATE IS NULL)
UNION SELECT DISTINCT T2.KEY_ID, 1228
FROM TABLE2 T2
WHERE UPPER(T2.PUID) LIKE UPPER('123%'));
I am not sure if I understood what you meant by 'sort your data in different ways'. Do you speak of the possibility of poor clustering?
Do you want me to physically reorder my data, on different columns, or did you just mean I should change the order of my where-clauses?
I did the latter with no noticable change, but I've read this interesting white paper about indexing: download.oracle.com/oowsf2004/1272_wp.pdf
In there, the author points out 3 reasons for poor indexing, or non-choosen indexes by the CBO:
1. Selectivity
2. Uneven Data distribution
3. Poor Clustering
I am pretty sure now, that Selectivity and Data distribution are good, so I am looking now if maybe my clustering is bad.
And I think that might be it, because when sorted ASC on the primary key KEY_ID (which is not used when selecting) the values in the KEY_VAL fields are differing from row to row.
As this is not an exact determination for bad clustering I will post my index information:
table info:
num_rows 870 000
blocks 6254
num_rows: 871439
distinct_keys_: 640884
avg leaf blocks per key: 1
avg data blocks per key: 1
clustering factor: 605687 <--- ISN'T THIS VEEEERY HIGH???
As I read here (http://www.blacksheepnetworks.com/security/resources/oracle/faqdbapf.htm):
USER_INDEXES.CLUSTERING_FACTOR - This defines how ordered the rows are in the index. If CLUSTERING_FACTOR approaches the number of blocks in the table, the rows are ordered. If it approaches the number of rows in the table, the rows are randomly ordered. In such a case, it is unlikely that index entries in the same leaf block will point to rows in the same data blocks.
Okay, so 605687 nearly is 640884: bad clustering
What possibilities do I have to make this better? Please post any ideas and/or suggestions.
Thank you in advance,
SebastianR
|
|
|
Re: highly skewed data distribution [message #134702 is a reply to message #134574] |
Fri, 26 August 2005 05:05 |
sebastianR
Messages: 33 Registered: August 2005
|
Member |
|
|
For for the sake of completeness here the explain plan for the posted select query:
Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop
SELECT STATEMENT Optimizer Mode=CHOOSE 493 372
SORT UNIQUE 493 7 K 372
UNION-ALL
TABLE ACCESS BY INDEX ROWID TABLE1.TABLE1_OBJ_KEY 493 7 K 364
INDEX RANGE SCAN TABLE1.SR_OBJ_KEY_5 518 4
TABLE ACCESS BY INDEX ROWID TABLE1.TABLE1_ASSET 1 17 3
INDEX RANGE SCAN TABLE1.IX_TABLE1_ASSET_2 1 2
|
|
|
Re: highly skewed data distribution [message #135016 is a reply to message #134574] |
Mon, 29 August 2005 08:21 |
sebastianR
Messages: 33 Registered: August 2005
|
Member |
|
|
I've now created a temporary Table with the exact same data like the one I am working with, except I added a
clause to it, so the rows are now in the order I like for the specific select-query.
I also created the same indexes and gathered statistics for the table and indexes.
Now, if I look on the index statistics, I get the following:
TABLE_NAME BLEVEL AVG_DATA_BLOCKS_PER_KEY AVG_LEAF_BLOCKS_PER_KEY CLUSTERING_FACTOR
SR_TEST_OBJ_KEY SR_TEST_IDX_1 24 6 448
SR_TEST_OBJ_KEY SR_TEST_IDX_2 1 1 9261
SR_TEST_OBJ_KEY SR_TEST_IDX_3 1 1 641089
where the first index is on end_date, the second on UPPER(KEY_VAL) and the third on REVERSE(UPPER(KEY_VAL)).
Ignoring the first index (on end_date), the second one greatly improved on the clustering_factor while the third one remained the same (which is absolutly clear to me).
Testing the search with closing wildcard ('123%'), there is no significant difference in execution time.
So I guess the clustering_factor is not a too critical factor at all.
SebastianR
|
|
|
Goto Forum:
Current Time: Sat Jan 18 17:03:06 CST 2025
|