Home » RDBMS Server » Performance Tuning » Best way to partition my table
Best way to partition my table [message #330307] Sun, 29 June 2008 05:04 Go to next message
kjcamann@gmail.com
Messages: 2
Registered: June 2008
Junior Member
Hi everyone.

I'm working on a very big data warehouse containing scientific information. One of the tables holds definitions for sets of numeric data, and will probably be extremely large (possibly billions or rows in the future). I've thought a lot about a partitioning strategy but I am not sure which to go with. This is pretty tough (and long) problem so thanks in advance if you read to the end!

The two relevant columns in the sets table are:

element_id (the primary key)
set_id (an indexed field, not unique, shows set membership of the element)

There are two operations that should be fast: selecting an element using its element_id, and selecting all elements with the same set_id. Nothing more complicated needs to happen (no ranges, other operators, or anything like that).

Clearly the table should be partitioned on the set_id, to ensure all elements in the set are in the same partition. This way partitions not containing the target set_id are pruned right away.

But I am having trouble deciding what type of partition to use. I really dislike the idea of range, since it does not make sense for something like an ID number. Hash is nice, because it ensures all data from the same sets are stored together. But if you have a huge number of elements in the same set (and this can happen) one of the partitions will be abnormally large compared to the others and the distribution will not be even as promised. I think hash is really more for primary keys or indexed columns with "more uniqueness" than a set_id has.

The way to fix this seems to be List-Hash partitioning. First partition the set_id by lists, then the element_id by hash. Most lists will have only 1 hash sub-partition, but they could have more to break up a single abnormally large set. Unfortunately there is no such thing as a hash-hash partition. I had hoped multicolumn hashes would do the trick, but they don't seem to from my experiments with the query planner.

But this means that I would have to specify a list of all set_ids per partition. There might be a lot of those (100s, perhaps 1000s?). That sounds like a pretty unwieldy VALUES clause. I honestly don't care how hard it is to manage; all I care about is how slow it will go. Does Oracle have a problem with list partitions with huge numbers of VALUES in them? The reason I don't care much about maintainence is that no matter what I do, I will need to implement a lot of custom maintainence routines by hand anyway because there is so much data. So I don't care how ugly it is; I'll be making custom GUIs to deal with as much of the mess as I can. Unless there is a better way, of course.

Thanks in advance, and thanks for reading
Ken
Re: Best way to partition my table [message #330318 is a reply to message #330307] Sun, 29 June 2008 07:22 Go to previous messageGo to next message
Kriptas
Messages: 118
Registered: June 2008
Senior Member
I think that you need to use list partioning.
1 set_id on 1 partition. (if there is about >1000K elements id in one element set)
if your query often need to select certain element set, so oracle will go direct to certain partition and will select all elements from it. you can use about 64K partitions...
if number of rows for 1 element is larger of number of rowd for 1 element set, then you can use partitioning by element id.
on non partitioned column use global bitmap index...
maybe use index organised table with partitions.
Re: Best way to partition my table [message #330347 is a reply to message #330307] Sun, 29 June 2008 16:24 Go to previous messageGo to next message
coleing
Messages: 213
Registered: February 2008
Senior Member
Just a thought.

Why do you think partitoning the data will speed up your queries? WHat was your driver for deciding on partitioning? Have you seen some evidence that would make partitioning a better option?

Sounds like both types of query are going to be standard b-tree index lookups, which means you are going to want a global element-id index over your partitions if you want to maintain performance on the element-id only transactions (if you partition by set-id).



Re: Best way to partition my table [message #330353 is a reply to message #330347] Sun, 29 June 2008 18:14 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Unless a single SET_ID is going to occupy >10% of a single partition, partitioning will not be faster than a simple b-tree index lookup. However if you were to partition on some other column, the SET_ID index would need to be global.

The most important thing you haven't told us is the cardinality of the SET_ID.
- How many rows per SET_ID are there?
- How much does it vary?
- Do all rows for a SET_ID arrive at the same time? Or are they inserted with long intervals in-between?

If there are 20+ rows per SET_ID, you could really benefit from clustering (assuring that rows with the same SET_ID reside in the same database block). ie. 20 rows spread over 20 blocks will perform 20x the IO of 20 rows clustered in 1 block.

You can achieve this with CLUSTERS (either a hash or index cluster) or by periodically rebuilding partitions in sorted order (can you tolerate this overhead?).

If all rows for a SET_ID are inserted at the same time, then you may get a high degree of clustering by default.

Ross Leishman
Re: Best way to partition my table [message #330355 is a reply to message #330307] Sun, 29 June 2008 19:23 Go to previous messageGo to next message
kjcamann@gmail.com
Messages: 2
Registered: June 2008
Junior Member
Thank you for responding everyone!

Some questions:

"The most important thing you haven't told us is the cardinality of the SET_ID.
- How many rows per SET_ID are there?
- How much does it vary?
- Do all rows for a SET_ID arrive at the same time? Or are they inserted with long intervals in-between?"

Sorry, I should have said this explicitly because that is the most important thing. I think if my answer to this question was better I wouldn't be having such a problem.

- Rows per set_id: A set may have have as few as 10 rows or as many as 10-20 million. I also don't want to totally rule out the possibility of one being even bigger, but that should be rare.

- Variance of set size: To be honest, I was really hoping I could come up with a solution that would work "OK" no matter how I answered this question, so that I could avoid answering it. It would take a long time to do the research to figure this out beforehand. But if I have to, then I have to.

This is why I asked if Oracle would "choke" if I gave it a list partition with hundreds or thousands of VALUES arguments (for the small sets) in a list partition. That way I could handle the big sets and the little ones, even if the little ones are handled in an ugly way. Partitioning by hash would be fine for the little ones, but not the big ones, because it would create abnormally large partitions. I wanted to find a strategy that would work for both.

What I wanted to do was this: declare that a partition must never have more than X rows in it. If you have 100 sets of size (X / 100), then they all go in one partition and your VALUES list for that partition has 100 entries. If a bigger set has size X, it gets its own partition to itself. If another set has size (100 * X), then it gets its own list partition AND it gets 100 hash sub-partitions. Then everything is assured not to exceed some maximum size X, to avoid having any tables that are too big and thus hard for the database to manage effectively. The problem is that the bigger X is, the more small sets all get stuck together in the same list. Then you have very big VALUES clauses, and I wonder if that makes trouble for Oracle.

Of course, you could just have more lists and let the small sets have relatively small partitions to themselves instead of packing them together like that. But there is a hard limit of only be 1 million partitions, so this is not good, scalability-wise. You "waste" a lot of potential space on smaller partitions, because of the opportunity cost of not making a larger one when there are a fixed number.

- All the rows originally arrive at the same time, but there may be some modifications after long intervals. There probably would not be a lot of modifications, because at that point you'd just make a new one and do a bulk insert.

"Why do you think partitoning the data will speed up your queries? WHat was your driver for deciding on partitioning? Have you seen some evidence that would make partitioning a better option?"

Well, as you guys can probably tell I'm no expert so maybe I don't understand this. But I had thought that this was true. Suppose I have a lot of elements in set 12. If I don't use partitioning, they could be placed anywhere. If the table is very large, partitioning would allow Oracle to immediately know where the continuous space containing all set 12 elements are (partition pruning), and this should lead to better performance. In this scenario, I can also have a local index for the set_id instead of a global one, isn't the performance of scanning a smaller indices better? I tried it several different ways with the query planner, and it seemed to do better when I could use partition pruning.

Also from what I understood, having extremely large indices and tables that are unpartitioned is just a "bad idea" because the database cannot handle them well. Oracle is meant to work with objects of a certain size and going well beyond that is a bad idea. Is that not correct? I had thought that was the whole point of partitioning.
Re: Best way to partition my table [message #330456 is a reply to message #330355] Mon, 30 June 2008 03:55 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Okay, lets clear up some misconceptions:

Say you partition your table and place 1000 SET_IDs of about the same size in one partition. If you scan for 1 set_id, a Full Partition Scan will read 1000x more data than it needs to; that's inefficient - you would be better off with an index.

For this model to work at all, a partition could contain no more than 100 set_ids of about the same size. Even then, the benefits would be marginal, you would need no more than 10 set_ids per partition to make the Full Scan option work.

If you reject the Full Scan option for most purposes, that leaves an index scan. Is it faster to scan a smaller partitioned index? Not really - at least not enough to make a differece.

But that doesn't really matter. You can still partition your index on SET_ID and your table on something else. This is a GLOBALLY PARTITIONED index. Locally Partitioned indexes have some advantages, but don't rule out Global indexes if they make sense.

Here's what I'd do:
  • List partition the table on SET_ID
  • Create one partition per value for any SET_ID with more than 1M rows
  • Put EVERY other SET_ID in a DEFAULT partition (ie. you don't have to list the values)
  • Create a locally partitioned b-tree index on SET_ID. This will be useless for the non-default partitions, but Oracle will KNOW it is useless and use a full-partition-scan. But you will need to watch out for bind-variables.

This will give you b-tree indexed access to the smaller SET_IDs, and fast multi-block read access to the larger SET_IDs.

Also, since all rows are inserted together, you will get a good clustering effect in the DEFAULT partition, making the indexed read more efficient.

Don't list huge numbers of SET_IDs in partition clauses. It will only make Oracle parse your queries slower. It will have a similar effect to adding an enormouse IN (list) to your SQL.

Ross Leishman
Re: Best way to partition my table [message #330688 is a reply to message #330307] Mon, 30 June 2008 16:32 Go to previous messageGo to next message
coleing
Messages: 213
Registered: February 2008
Senior Member
Ross suggestion here is a good one. I would look at that one further to see if fits your needs.

Dont forget, using this method you will still need a global index on element-id to perform your element-id only scans.



Re: Best way to partition my table [message #330706 is a reply to message #330688] Mon, 30 June 2008 23:34 Go to previous message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Good point. Considering your large volumes, you should consider hash partitioning that global index.
Previous Topic: The column name in user_ind_colums shows a SYS_ name when function based index created
Next Topic: multiple dbwr processes and multiple archver process
Goto Forum:
  


Current Time: Tue Nov 26 17:29:45 CST 2024