Re: Functions and Relations

From: vc <boston103_at_hotmail.com>
Date: 21 Nov 2006 11:46:04 -0800
Message-ID: <1164138363.954387.10480_at_f16g2000cwb.googlegroups.com>


Aloha Kakuikanu wrote:
> Sampo Syreeni wrote:
> > Suppose you want an equijoin f(x)=y where x and y are in different
> > relations. Most optimizers throw away any statistics on x because of the
> > function application or use some approximate statistic derived from x.
> > With high cardinality on both attributes, the plan tree might become
> > something like hash_join(f(x), y).
>
> This is a great example of how optimizer can leverage new access path.
> But first let's refine it a little bit.
>
> Hash join is similar to nested loop in a sense that it creates a
> temporary join index structure. (OK, this index is technically a hash
> table). Hash join is an ugly artifact of physical world where we have
> random io inferior to sequential io.

I think multiblock vs. single block IO for an index access path for NLs is also an important factor, maybe more important than IO randomness depending on a specific IO/RAID system of course.

> The implementation progresses to
> eliminating this gap completely: with solid state disc technology
> nested loops is never slower than hash join.

With solid state disks, reading large portions of memory vs. smaller chunks may still make hash joins more performant(performance complexity  O(n) for NL may be higher than O(m+n) for HJ where m and n are relation cardinalities).

>Therefore, from
> theoretical perspective we can focus on Nested loops join solely.
>
> Now returning to the main discussion. The query in question:
>
> select x,y from T
> where x=y and y=f(x)
>
> The new access path is to evaluate
>
> 'x=y' /\ 'y=f(x)`
>
> somehow, and if this relation is finite and small, then we can join T
> via indexed nested loops!

How do you suggest to materialize 'f(x)' as a relation and why do you think the join will be more performant than just calculating 'f(x)' per each row ?

>
> > If the function is treated as a relation, the machinery is already there
> > in the form of table statistics. In this case they just need to be
> > filled in at query compilation or dynamic sampling time.
>
> One more reason, indeed. We can sample a function as a normal relation
> and derive statistics.

Oracle statistics for example provide food for the cost based optimizer which is mainly interested in IO. Well, versions 9i and above take into account CPU as well, but I believe IO is still the optimizer staple. I am not sure how representing a function as a relation will improve the optimizer bahavior. Could you please explain ?

> Even if there is no new access paths, we'll
> still benefit from more accurate statistics.

How exactly ? Received on Tue Nov 21 2006 - 20:46:04 CET

Original text of this message