Re: In an RDBMS, what does "Data" mean?

From: Anthony W. Youngman <wol_at_thewolery.demon.co.uk>
Date: Sun, 20 Jun 2004 00:27:56 +0100
Message-ID: <4AY57rd8vM1AFw1w_at_thewolery.demon.co.uk>


In message <M-6dnRspMqlopEndRVn-hQ_at_comcast.com>, Laconic2 <laconic2_at_comcast.net> writes
>
>"Tony" <andrewst_at_onetel.net.uk> wrote in message
>news:c0e3f26e.0406190351.dcb2cdc_at_posting.google.com...
>> "Anthony W. Youngman" <wol_at_thewolery.demon.co.uk> wrote in message
>news:<L4l1l0HM7y0AFwZd_at_thewolery.demon.co.uk>...
>> > Was it Knuth that said "premature optimisation is the worst evil"?
>>
>> I don't know, but YOU are the one who wants to optimise prematurely,
>> i.e. while designing the database. The relationalists prefer to
>> optimise at the last possible moment, i.e. when we know what the query
>> is.
>
>Not only when we know the query, but also when we know, approximately, the
>data volumes.
>
>Given that access strategy costs are non linear, this is an important input
>to optimization.
>
Okay. Let's try to explain. It's statistics, so it's totally outside relational theory (statistics is fuzzy logic, after all :-)

Let's assume we've got an invoice file, with a thousand invoices. Each invoice has ten detail lines.

Your app gets one detail line from the table. The database grabs another line as a pre-fetch ...

What's the chance that the next thing your app asks for is either another line, or the invoice header? Quite high. If it's another line, what's the chance that the one it asks for is the one the db has prefetched. Easy - it's one in ten thousand. But if, by chance or design, it happens to belong to the same invoice as the first line, the chances have improved to one in ten or twenty - a massive improvement.

Look at your apps. How many data fetches are "random" (ie depend on external input), and how many are of "related data", ie given that you have one record already, you're retrieving data that shares a "foreign key" relationship with the data you already have. I'd guess that, typically, for every one of the first type you have a hundred, maybe more, of the second.

So, by actively clustering related data together, you could massively improve database performance. In other words, you'd be copying what Pick achieves by default.

Let's phrase it in a different way. Those invoice detail lines can be considered as a bag of lists. By default, relational will treat them as a random set. It has a one-in-ten-thousand chance of guessing the next one correctly, unless it has a clever optimisation engine or the DBA gives it a hint.

Pick by default knows it's a bag of lists. If the app really does ask for a random "next line", Pick's worst case is the same as relational. But if, as is normally the case, the next line comes from the same invoice, Pick's DEFAULT chance of getting it right is the same as the chance of two consecutive lines coming from the same invoice - ie pretty high.

So Pick's worst case is equal to relational's worst case - in the highly UNlikely scenario of random data access. But in the normal case, of accessing rows that are somehow linked, relational relies on AI or the DBA "hinting" to the database. Pick just does it that way "by default".

And that's why ALL the anecdotes I've ever seen say that Pick outperfoms relational by a huge margin.

-- 
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
HEX wondered how much he should tell the Wizards. He felt it would not be a
good idea to burden them with too much input. Hex always thought of his reports
as Lies-to-People.
The Science of Discworld : (c) Terry Pratchett 1999
Received on Sun Jun 20 2004 - 01:27:56 CEST

Original text of this message