Re: Help with sql query speed. Explain plan included

From: Maxim Demenko <mdemenko_at_gmail.com>
Date: Wed, 21 Mar 2007 20:42:04 +0100
Message-ID: <46018A8C.9050009@gmail.com>

Charles Hooper schrieb:
> On Mar 18, 11:11 pm, "fitzjarr..._at_cox.net" <fitzjarr..._at_cox.net>
> wrote:

> Your results
> may vary on other platforms.
>
> Charles Hooper
> PC Support Specialist
> K&M Machine-Fabricating, Inc.
>

Some thoughts (Charles, sorry, i snipped your results, because my posting came out too to long, but i'll reference sometimes execution plans which you have posted):
I think, the even
distribution of data is rather untypical for real life ( i.e. 3 mio residents, each of them answered approximately 90 questions). Maybe i am wrong and the distribution in OP case is indeed more even, but i expect it rather to be normal ( something like binomial - sorry for possibly wrong terms, statistic is not my strength). I.e. i would expect (for 9 buckets for example)
1 resident ansered 1 question
8 answered 2 questions
28 answered 3 questions
56 answered 4 questions
70 answered 5 questions
56 answered 6 questions
28 answered 7 questions
8 answered 8 questions
1 answered 9 question
where 1,8,28 ... are binomial coefficients like N!/(N-n)!*n! . Total were 2**(9-1) customers which answered questionnaire. Well, for this example i took 20 buckets ( gives me 524288 customers, within each bucket 5 questions, results in 2621440 customers which is nearly the same as by OP). To generate test data i prepared a driving table with precalculated bucket sizes.

create table driving_table as
with t as (

select rownum rn,

     ntile(20) over(order by level) - 1 asc_factorial_src,
     ntile(20) over(order by level desc)  -1 desc_factorial_src

from dual connect by level <=100),
t1 as (

select rn,

     power(2,(sum(log(2,decode(asc_factorial_src,0,1,asc_factorial_src)))
     over(partition by mod(rn,5)) -
     sum(log(2,decode(asc_factorial_src,0,1,asc_factorial_src)))
     over(partition by mod(rn,5) order by asc_factorial_src) -
     sum(log(2,decode(desc_factorial_src,0,1,desc_factorial_src)))
     over(partition by mod(rn,5) order by asc_factorial_src desc))) binom

from t),
t2 as (

select rn, sum(binom) over(order by rn) high_resident_id from t1)
select rn,

nvl(lag(high_resident_id) over(order by rn),0) + 1 low_resident_id, high_resident_id
from t2

Then, test data itself could be generated with following query:

insert into t(residentid,questionnumber,answer) with t as (

select rownum question#
from dual connect by level <=100),
t1 as (

select rownum res_id
from dual connect by level <=2621440), row_source as(

   select res_id,
   question#,
   row_number() over(partition by res_id order by dbms_random.value) question_id,

trunc(dbms_random.value(1,81)) answer_id from t,t1)
select res_id,question_id,answer_id
from row_source r,

driving_table t
where res_id between low_resident_id and high_resident_id and question# <= rn

(Actually, i splitted this insert into 3 chunks to get rid of temp/undo space). This will give exactly test data as i looked for - about 2.6 Mio residents answered normally distributed amount of random questions in range of 1-100 with random answers in range 1-80, in total about 130 Mio records (regarding flaws in my logic, corrections are always appreciated)

Now, how would be the data accessed.
To get set of pairs question/answer for a reference resident (1486674) it is undoubtedly most optimal to do an index range scan. In total there can be atmost 80*100 possible permutations query/answer. Some of them are probably more frequent, some less ( i.e. q2/a34 occurs by 25% of residents, q93/a77 only by 0.2%%). If reference resident has all pairs from seldom distribution, then index range scan may be benefitial for filter operation of outer query, if they are however frequent, then many index lookups may be more worse, than FTS. Also, it would be benefitially, if the data could be filtered easy by pairs q/a, but in the same place index access (which is not cheap) would be eliminated, somewhat like

q/a	res_id
1/1	2,23,45,643,67
17/29	3,45,55,68,35555,2323677

and so on...

So, one could get altogether all data of interest tied together accessing them over one keyed lookup per q/a key. I think, the most suitable data structure for that would be an index cluster. Here, what i tried:

create cluster t1_cluster(
questionnumber number,
answer number)
index;

create table t1(
residentid number,
questionnumber number,
answer number)
cluster t1_cluster(questionnumber,answer);

insert into t1 select * from t order by questionnumber,answer;

create index t1_idx on t1(residentid) nologging;

begin
dbms_stats.gather_schema_stats(user,estimate_percent=>5,cascade=>true); end;
/

Here i cheated a little, to fill optimally the cluster. If the original table would be ordered by questionnumber,answer, then clustering factor of corresponding index would be much better and indexed access to heap table would be much cheaper. However, such ordering is not possible to maintaince because data will be incoming in probably some other order,i could set SIZE for cluster nearly same as blocksize and don't care about ordering by insert - in this case data in the cluster would still tied together within one block, so i'll need a negligible more io's to get them. At the last, the size of the cluster is significantly (about 27%) smaller compared to the original table, because the key (question/answer pair) is stored onyl once per block. Here are results:

SQL> select tx.residentid

2 from t1 t0,t1 tx
3 where t0.residentid!=tx.residentid

   4     and  t0.questionnumber=tx.questionnumber
   5     and t0.answer=tx.answer
   6     and t0.residentid=1486674
   7      group by tx.residentid
   8       having count(*)>5

9 /

RESIDENTID

    2042580
    1564161
    1172778
    2539150
    1785664
    2588591
     462103
    2419963

8 Zeilen ausgewahlt.

Abgelaufen: 00:00:01.53

(to get timings, i have reduced the number of common pairs to 5 simply to get some matches, with 10 it would be a bit more difficult)

SQL> explain plan for

   2 select tx.residentid
   3 from t1 t0,t1 tx
   4 where t0.residentid!=tx.residentid

   5     and  t0.questionnumber=tx.questionnumber
   6     and t0.answer=tx.answer
   7     and t0.residentid=1486674
   8      group by tx.residentid
   9       having count(*)>10

10 /

EXPLAIN PLAN ausgefuhrt.

Abgelaufen: 00:00:00.05
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT

Plan hash value: 2982241237

| Id  | Operation                      | Name           | Rows  | Bytes
|TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                | 42549 |
914K|       |  3646   (2)| 00:00:44 |
|*  1 |  FILTER                        |                |       |
|       |            |          |
|   2 |   HASH GROUP BY                |                | 42549 |
914K|    27M|  3646   (2)| 00:00:44 |
|   3 |    NESTED LOOPS                |                |   850K|
17M|       |  1434   (1)| 00:00:18 |
|   4 |     TABLE ACCESS BY INDEX ROWID| T1             |    51 |   561
|       |    56   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | T1_IDX         |    51 |
|       |     4   (0)| 00:00:01 |
|*  6 |     TABLE ACCESS CLUSTER       | T1             | 16550 |
177K|       |    27   (0)| 00:00:01 |
|*  7 |      INDEX UNIQUE SCAN         | T1_CLUSTER_IDX |     1 |
|       |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

    1 - filter(COUNT(*)>10)
    5 - access("T0"."RESIDENTID"=1486674)
    6 - filter("T0"."RESIDENTID"<>"TX"."RESIDENTID" AND

"TX"."RESIDENTID"<>1486674)

7 - access("T0"."QUESTIONNUMBER"="TX"."QUESTIONNUMBER" AND "T0"."ANSWER"="TX"."ANSWER") Best regards

Maxim Received on Wed Mar 21 2007 - 14:42:04 CDT