Re: Need expert help... a challenging query

From: <joe_celko_at_my-deja.com>
Date: Wed, 15 Dec 1999 22:13:24 GMT
Message-ID: <8393pv$n8t$1@nnrp1.deja.com>

>> Quick question... how can I make the query "find all records with
(foo > x), sort the results by 'bar', and give me only the top 10" as efficient as possible, given that there are millions of records with (foo > x)? <<

This is not a quick question -- it is a question that makes SQL teachers cry. There is no sorting in SQL; it is a non-procedural, set- oriented language.

Then you never brother to define "top 10" in a meaningful way. If I have nine people, do I declare "no contest" and quit? How do you want to handle ties?

>> Do I just build indices for 'foo' and 'bar' and hope for the best? <<

Did you ever read the language specifications? There are no indexes in SQL and never have been. That is an implementation choice and the VLDB products do not even store tables as contigous storage. You are still thinking you have a file system circa 1960, probably tape or simple disk files. Besides you are a programmer and cannot do a thing about indexes anyway.

>> My concern there is that the database might use the 'foo' index to
do a very fast retrieval of everything with foo > x, but then waste a bunch of time sorting the entire result set via some n*log(n) sorting routine (possibly using the 'bar' index, but still in n*log(n) time) and then throw away the bottom 99.999% of the results. Am I correct in assuming this is the default algorithm the database would use? <<

Yes, an index will help to retrieve data faster in most cases where teh subset is smaller than a threshhold percentage. Who cares? You cannot change the optimizer -- it's a non-procedural language.

>> If not, then that is the default algorithm (or its time complexity,
at least); and if o, then how do I override it with something that won't waste so much time sorting results that will be discarded? <<

You cannot change the optimizer -- it's a non-procedural language.

>> Any help would be much appreciated... <<

Get a good book on basic relational concepts or use a report write and procedural code for reports and procedures. Now having lambasted you for free instead of at my usual insanely high consulting rates, try this suggestion. I have a whole chapter in JOE CELKO'S SQL FOR SMARTIES (second edition, Morgan-Kaufmann, 1999) on this sort of query.

The best answer given a subquery to establish a subset based on a count. The idea is take each salary and build a group of other salaries which are less than or equal to it. The groups with three or fewer rows is what we want to see.

This query gives a columnar answer, and the query can be extended to other numbers by changing the constant in the HAVING clause.

SELECT DISTINCT salary
FROM Personnel AS P1
WHERE :n >= (SELECT COUNT(*) - 1 -- control parameter

               FROM Personnel AS P2
              WHERE P1.salary < P2.salary)

An equivalent version can also be done with a GROUP BY clause, but you need to use a COUNT(DISTINCT x) operator to handle duplicates. Here is that code.

SELECT P1.salary
FROM Personnel AS P1, Personnel AS P2 WHERE P1.salary <= P2.salary
GROUP BY P1.salary
HAVING COUNT(DISTINCT P2.salary) <= :n; -- control parameter

The performance of these two approaches will vary with each SQL implementation and with the available indexing. It is probably best to experiment with both.

--CELKO-- Sent via Deja.com http://www.deja.com/
Before you buy. Received on Wed Dec 15 1999 - 16:13:24 CST