Options for querying on the skewed data [message #542739] |
Fri, 10 February 2012 00:39 |
|
orapratap
Messages: 134 Registered: November 2011 Location: Canada
|
Senior Member |
|
|
Hello
We have a table with huge data which is skewed on a 'status' column
The 'status' column has 6 distinct values with 1 particular value occupying 80-85% records
In the batch process we query the data on the status and process the retrieved records
My senior is insisting on partitioning which I see not much feasible considering cost implications just for a part of functionality
I have following options in mind, could you please suggest on it?
See there are 6 status 'A','B','C','D','E','F'
with 'A' occupying 80% records
'B' to 'F' occupies 2% till 14% records in the table(approx)
1)
Create a conditional index on status (using case) to have records with all statuses except 'A'
Then create If-ELSE structure
IF input parameter is 'A'
select /*+ FULL Parallel(t) */ * from t where status='A';
ELSE
Select /*+ INDEX (t conditional_index) */ * from t where status in ('B','C');
END IF;
I want to create conditional index here for 2 reasons
1] since it will have values for status except 'A' this nullify the chance that this index will be picked up when status='A' will be queried
Thus making the performance worst (status ='A' is for 80% records) - The IF-ELSE is additional protection
2] Less impact on the DMLS as the index will not be on status='A' which contribute to large chunk of records
2)
Populate a dummy table which would contain rowid and status
Since the business closes at 21:00 and batch process starts at 21:30
Between these times periods refresh the dummy table every day using merge (to catch business transactions during the day)
Now during the batch process retrieve records from the main table using the rowids in the dummy table depending on the input status value
3)
Create index on status
Make sure hard coded status values are used in the database procedures
Gather stats with the histograms
And leave it to the Optimizer to choose the best possible path
Thanks and Regards
OraPratap
|
|
|
Re: Options for querying on the skewed data [message #542778 is a reply to message #542739] |
Fri, 10 February 2012 03:10 |
John Watson
Messages: 8962 Registered: January 2010 Location: Global Village
|
Senior Member |
|
|
I'll have a tentative try at this.
Your index first. You say that the most selective value would retrieve 2% of the rows. It is quite possible that even for 2%, a scan is more efficient. This would depend on many criteria (row size, block size, key size, clustering factor,...) but I think it likely that unless you do something else an index is not going to help in the majority of queries. The "something else" might be to sort the data on the key; convert to an IOT (possibly with an overflow segment); add extra columns to the keys. None of these need an extra licence. And you might want to consider a bitmap index instead, much easier.
Now partitioning. Presumably you mean list partitioning? Given the distribution it is hard to see how range or hash would help. This would seem to be a good option, partition pruning would surely help. Consider the queries and think about whether you can sub-partition; how to partition your indexes; the impact on manageability.
You'll have to test and test and test, to prove that the any change does give predictable, measurable, improvement.
Good luck.
|
|
|
|
Re: Options for querying on the skewed data [message #542932 is a reply to message #542819] |
Fri, 10 February 2012 19:42 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
If you are accessing data on status alone then partitioning is definitely the way to go. Even if it is faster to scan 2% of the table via an index rather than a full table scan, you are going to pollute the buffer cache and get rid of cached blocks that have a much better likelihood of being reused. ie. You *might* make your query a *little* faster, but you'll slow down the rest of the system.
Bitmap indexes are not magically faster than b-tree indexes for high-cardinality keys. You need to *combine* many bitmap index searches in a single query to get any benefit. And then there are the locking issues.
What are your objections to two list partitions: one for the skewed value, the other for DEFAULT. You add a PARTITION clause to your create-table and GLOBAL clause to your create-index scripts, populate the table, and you're done. No further maintenance or overheads required.
Ross Leishman
|
|
|