Home » RDBMS Server » Performance Tuning » Bloom filter joins
Bloom filter joins [message #608516] Thu, 20 February 2014 07:15 Go to next message
Kevin Meade
Messages: 2103
Registered: December 1999
Location: Connecticut USA
Senior Member
PRE-Join Filtering with Bloom Filters on EXADATA

A bloom filter is an older idea (around 1970) I would guess invented by some person named BLOOM (Burton Bloom it turns out (I checked (thanks internet))). Sticking with ideas, the idea of this thing called a bloom filter is: given a set of values and a specific value you have in hand, can you tell if the value in hand exists in your set and do so in a fast and cheap way? What makes a bloom filter useful is that it is an imperfect strategy which by accepting some amount of controlled imperfection, allows us to make a very cheap and very fast check which might return very useful information.

Here are the primary characteristics of a bloom filter.

• The time needed to check a list to see if it contains a specific value does not depend upon the number of entries in the list.  This means we can have some truly large lists and still be able to get an answer real quick.  FAST.

• It is very small in size compared to the amount of data it covers.  Again this allows for very large lists without large resources needed to be able to check that list.  CHEAP.

• A bloom filter is never wrong when it says a value is NOT in a list, so this is a certainty that can be relied upon.  USEFUL.

• But a bloom filter cannot always tell that a value is NOT in a list and so can return a false positive.  How often this happens is controllable and is a tradeoff between FAST AND CHEAP vs. FREQUENCE OF NOT KNOWING.  IMPERFECT.


So how can a structure that has these behavioral characteristics be useful? Oracle uses them everywhere. But think joins. A join is essentially a pairing of lists. As the first table in a hash join is hashed in memory, a bloom filter list is constructed using its hash join keys. The database server can pass this bloom filter list to EXADATA for its use when reading the second table in the hash join. As EXADATA reads rows from the second table in the join it does a bloom filter check of the list made from the first table. If the bloom filter check says the join key from the second table's row is not in the bloom filter list made from the first table's join keys, then it is certain that the row from the second table does not join to any row in the first table and it can be thrown away. Such rows do not make the trip from disk to database server with the corresponding savings accrued.

It turns out that bloom filters can be made so fast and so cheap that it would be almost moronic not to use them even in situations where the expectation is that very few "throw the row away" determinations will be made. Tossing out even a few rows can be enough to justify using a bloom filter.

So how do you exploit it? You don't really. It just happens and it is not like you can change your code in some way to make more of less use of them. But you can't really talk about SMARTSCAN and not mention it because it is such a cool mechanism.

One thing to keep in mind here is that a pre-join bloom filter is not doing any joins. It is only checking if a row can be excluded from a join because it has no match. Failing to exclude the row only means that the row in sent back to the database server so that the true join can be done there. This is why it is called a pre-join filter.

Kevin
Re: Tuning query without where clause [message #608542 is a reply to message #608516] Thu, 20 February 2014 14:02 Go to previous messageGo to next message
John Watson
Messages: 8960
Registered: January 2010
Location: Global Village
Senior Member
The mathematics behind Bloom filters is interesting, and I suspect that Oracle does not always get it right. I have had interesting times trying to tune Exadata installations to get Bloom joins. I think the problem is that the CBO is not really Exadata aware. The decision to use offload processing is made by the SQL execution engine (not by the CBO) for each execution.

Things to think about:
First, you must use parallel processing: you will never get a Bloom join for a serial query. Second, there are two hints: the documented hint is /*+ px_join_filter(emp dept) */ but there is another undocumented hint, /*+ use_hash(emp dept) bloom join */ that sometimes seems to work better. Third, no matter what you do with hints, Oracle will sometimes refuse to do a Bloom join.

I would be very interested to know other people's experience with this.
Re: Tuning query without where clause [message #608543 is a reply to message #608542] Thu, 20 February 2014 14:04 Go to previous messageGo to next message
John Watson
Messages: 8960
Registered: January 2010
Location: Global Village
Senior Member
I've split this topic off, it was a bit of a thread hijack.
Re: Tuning query without where clause [message #608607 is a reply to message #608542] Fri, 21 February 2014 05:56 Go to previous messageGo to next message
msol25
Messages: 396
Registered: June 2011
Senior Member
Kevin Sir,

Nice Info for Bloom Filter........ Smile
Re: Tuning query without where clause [message #608669 is a reply to message #608607] Fri, 21 February 2014 20:16 Go to previous message
Kevin Meade
Messages: 2103
Registered: December 1999
Location: Connecticut USA
Senior Member
Thanks. But John points out some good things. My understanding of Bloom Filters comes mostly from EXADATA and research online. I am kind of like today's scientists. They don't really understand the true nature of things, only what they can observe indirectly. Like finding exoplanets. You can't really see them; you calculate a star wobble using some really fancy math and figure: "Oh there must be a planet near by". John knows the internals. His knowledge comes with being an Oracle Certified Specialist and Oracle Certified Master and Oracle Certified Expert multiple times over.

I talk about them in hopes of introducing the "Common Man" to them, from a black box perspective.

1. you can't really change what they do or when they get used.

2. they are a pre-join filter thing, not an actual join.  There is considerable confusion here based on how the BF is described in several texts and articles.  These loose statements suggest that EXADATA SMARTSCAN is doing some kind of join on the disk drives.  But in fact this is not at all true.  The BF is only removing rows it knows have no match to the other half of the join.  The join still occurs on the database server in EXADATA.


These specifics became important to me in understanding what SMARTSCAN was and how I could exploit and just as importantly how I could not exploit it. So I figure it will mean something to others as well.

Kevin
Previous Topic: Tuning query without where clause
Next Topic: AWR top events:: enq: TX - row lock contention
Goto Forum:
  


Current Time: Thu Nov 21 11:19:53 CST 2024