Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Usenet -> c.d.o.server -> Re: Help with sql query speed. Explain plan included
Charles Hooper schrieb:
> On Mar 18, 11:11 pm, "fitzjarr..._at_cox.net" <fitzjarr..._at_cox.net>
> wrote:
> Your results
> may vary on other platforms.
>
> Charles Hooper
> PC Support Specialist
> K&M Machine-Fabricating, Inc.
>
Some thoughts (Charles, sorry, i snipped your results, because my
posting came out too to long, but i'll reference sometimes execution
plans which you have posted):
I think, the even
distribution of data is rather untypical for real life ( i.e. 3 mio
residents, each of them answered approximately 90 questions). Maybe i am
wrong and the distribution in OP case is indeed more even, but i expect
it rather to be normal ( something like binomial - sorry for possibly
wrong terms, statistic is not my strength). I.e. i would expect
(for 9 buckets for example)
1 resident ansered 1 question
8 answered 2 questions
28 answered 3 questions
56 answered 4 questions
70 answered 5 questions
56 answered 6 questions
28 answered 7 questions
8 answered 8 questions
1 answered 9 question
where 1,8,28 ... are binomial coefficients like N!/(N-n)!*n! . Total
were 2**(9-1) customers which answered questionnaire. Well, for this
example i took 20 buckets ( gives me 524288 customers, within each
bucket 5 questions, results in 2621440 customers which is nearly the
same as by OP). To generate test data i prepared a driving table with
precalculated bucket sizes.
create table driving_table as
with t as (
select rownum rn,
ntile(20) over(order by level) - 1 asc_factorial_src, ntile(20) over(order by level desc) -1 desc_factorial_srcfrom dual connect by level <=100),
select rn,
power(2,(sum(log(2,decode(asc_factorial_src,0,1,asc_factorial_src))) over(partition by mod(rn,5)) - sum(log(2,decode(asc_factorial_src,0,1,asc_factorial_src))) over(partition by mod(rn,5) order by asc_factorial_src) - sum(log(2,decode(desc_factorial_src,0,1,desc_factorial_src))) over(partition by mod(rn,5) order by asc_factorial_src desc))) binomfrom t),
select rn, sum(binom) over(order by rn) high_resident_id
from t1)
select rn,
nvl(lag(high_resident_id) over(order by rn),0) + 1 low_resident_id,
high_resident_id
from t2
Then, test data itself could be generated with following query:
insert into t(residentid,questionnumber,answer) with t as (
select rownum question#
from dual connect by level <=100),
t1 as (
select rownum res_id
from dual connect by level <=2621440),
row_source as(
select res_id,
question#,
row_number() over(partition by res_id order by dbms_random.value)
question_id,
trunc(dbms_random.value(1,81)) answer_id
from t,t1)
select res_id,question_id,answer_id
from row_source r,
driving_table t
where res_id between low_resident_id and high_resident_id
and question# <= rn
(Actually, i splitted this insert into 3 chunks to get rid of temp/undo space). This will give exactly test data as i looked for - about 2.6 Mio residents answered normally distributed amount of random questions in range of 1-100 with random answers in range 1-80, in total about 130 Mio records (regarding flaws in my logic, corrections are always appreciated)
Now, how would be the data accessed.
To get set of pairs question/answer for a reference resident (1486674)
it is undoubtedly most optimal to do an index range scan. In total there
can be atmost 80*100 possible permutations query/answer. Some of them
are probably more frequent, some less ( i.e. q2/a34 occurs by 25% of
residents, q93/a77 only by 0.2%%). If reference resident has all pairs
from seldom distribution, then index range scan may be benefitial for
filter operation of outer query, if they are however frequent, then many
index lookups may be more worse, than FTS. Also, it would be
benefitially, if the data could be filtered easy by pairs q/a, but in
the same place index access (which is not cheap) would be eliminated,
somewhat like
q/a res_id 1/1 2,23,45,643,67 17/29 3,45,55,68,35555,2323677
So, one could get altogether all data of interest tied together accessing them over one keyed lookup per q/a key. I think, the most suitable data structure for that would be an index cluster. Here, what i tried:
create cluster t1_cluster(
questionnumber number,
answer number)
index;
create table t1(
residentid number,
questionnumber number,
answer number)
cluster t1_cluster(questionnumber,answer);
insert into t1 select * from t order by questionnumber,answer;
create index t1_idx on t1(residentid) nologging;
begin
dbms_stats.gather_schema_stats(user,estimate_percent=>5,cascade=>true);
end;
/
Here i cheated a little, to fill optimally the cluster. If the original table would be ordered by questionnumber,answer, then clustering factor of corresponding index would be much better and indexed access to heap table would be much cheaper. However, such ordering is not possible to maintaince because data will be incoming in probably some other order,i could set SIZE for cluster nearly same as blocksize and don't care about ordering by insert - in this case data in the cluster would still tied together within one block, so i'll need a negligible more io's to get them. At the last, the size of the cluster is significantly (about 27%) smaller compared to the original table, because the key (question/answer pair) is stored onyl once per block. Here are results:
SQL> select tx.residentid
2 from t1 t0,t1 tx
3 where t0.residentid!=tx.residentid
4 and t0.questionnumber=tx.questionnumber 5 and t0.answer=tx.answer 6 and t0.residentid=1486674 7 group by tx.residentid 8 having count(*)>5
RESIDENTID
2042580
1564161
1172778
2539150
1785664
2588591
462103
2419963
8 Zeilen ausgewahlt.
Abgelaufen: 00:00:01.53
(to get timings, i have reduced the number of common pairs to 5 simply to get some matches, with 10 it would be a bit more difficult)
SQL> explain plan for
2 select tx.residentid
3 from t1 t0,t1 tx
4 where t0.residentid!=tx.residentid
5 and t0.questionnumber=tx.questionnumber 6 and t0.answer=tx.answer 7 and t0.residentid=1486674 8 group by tx.residentid 9 having count(*)>10
EXPLAIN PLAN ausgefuhrt.
Abgelaufen: 00:00:00.05
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 42549 | 914K| | 3646 (2)| 00:00:44 | |* 1 | FILTER | | | | | | | | 2 | HASH GROUP BY | | 42549 | 914K| 27M| 3646 (2)| 00:00:44 | | 3 | NESTED LOOPS | | 850K| 17M| | 1434 (1)| 00:00:18 | | 4 | TABLE ACCESS BY INDEX ROWID| T1 | 51 | 561 | | 56 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | T1_IDX | 51 | | | 4 (0)| 00:00:01 | |* 6 | TABLE ACCESS CLUSTER | T1 | 16550 | 177K| | 27 (0)| 00:00:01 | |* 7 | INDEX UNIQUE SCAN | T1_CLUSTER_IDX | 1 | | | 0 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
1 - filter(COUNT(*)>10) 5 - access("T0"."RESIDENTID"=1486674) 6 - filter("T0"."RESIDENTID"<>"TX"."RESIDENTID" AND"TX"."RESIDENTID"<>1486674)
7 - access("T0"."QUESTIONNUMBER"="TX"."QUESTIONNUMBER" AND "T0"."ANSWER"="TX"."ANSWER") Best regards
Maxim Received on Wed Mar 21 2007 - 14:42:04 CDT
![]() |
![]() |