Re: Help with sql query speed. Explain plan included

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: 15 Mar 2007 10:53:37 -0700
Message-ID: <1173981217.888095.239950@n76g2000hsh.googlegroups.com>

On Mar 15, 9:14 am, JL19..._at_HOTMAIL.CO.UK wrote:
> Hello,
>
> I have a query about how to improve the speed of our sql and am asking
> for advice.
>
> My real system is complicated and relates to a call center, automated
> options chosen, and time on line using an automated telephone system.
> Rather than having to describe all the complicated bits, I have made
> up a situation using census data instead because we can all understand
> it.
>
> I have itemized the points of the made up situation. Here they are.
>
> 1. The data is census information.
> 2. There are 3 million people in the country.
> 3. Each person has to fill out a census form.
> 4. There are 80 questions on each census form.
> 5. Each question on the census form has an answer that is between 1
> and 100.
> 6. The country residents do not have to answer every question.
> 7. If they do not answer a question, a row in the table is not
> created.
>
> The information is stored in an Oracle database with the following
> structure.
>
> residentID questionNumber answer
> 1 1 10
> 1 2 45
> 1 3 67
> 1 4 55
> 2 1 15
> 2 2 56
> 2 3 12
>
> etc
>
> We need a query that that reads "give us the resident ID of any person
> that has answered more than 10 questions the same as the resident with
> ID number 1486674"
>
> Here is our sql.
>
> select tx.residentid
> from test1 t0,test1 tx
> where t0.residentid!=tx.residentid
> and t0.questionnumber=tx.questionnumber
> and t0.answer=tx.answer
> and t0.residentid=1486674
> group by tx.residentid
> having count(*)>10;
>
> The problem with this query is that it only returns 177 rows yet takes
> "Elapsed: 00:01:10.75" which is about 70 seconds.
>
> The only thing that is important to us is query speed. If the data
> needs restructuring to get speed then this is OK. But firstly is there
> a better SQL to this without restructuring of data.
>
> The plan statistics are calculated on 100% of rows in all tables with
> cascade at 2am every morning. They are up to date.
>
> Here is my plan. Index idxe has been created with "create unique index
> idxe on test1(residentid,questionnumber,answer)". Just in case it
> matters, there are no other table indexes but can easily add them if
> it improves performance.
>
> We use 10.2 on a Solaris V245.
>
> Thank you for helping in advance.
> John
>
> --------------------------------------------------------------------------------------
> | Id | Operation | Name | Rows | Bytes |TempSpc| Cost
> (%CPU)| Time |
> --------------------------------------------------------------------------------------
> | 0 | SELECT STATEMENT | | 33322 | 715K| | 26691
> (7)| 00:05:21 |
> |* 1 | FILTER | | | |
> | | |
> | 2 | HASH GROUP BY | | 33322 | 715K| 21M| 26691
> (7)| 00:05:21 |
> |* 3 | HASH JOIN | | 666K| 13M| | 24919
> (7)| 00:05:00 |
> |* 4 | INDEX RANGE SCAN | IDXE | 22 | 242 | | 3
> (0)| 00:00:01 |
> |* 5 | TABLE ACCESS FULL| TEST1 | 37M| 398M| | 24428
> (6)| 00:04:54 |
>
> PLAN_TABLE_OUTPUT
> ---------------------------------------------------
> Predicate Information (identified by operation id):
> ---------------------------------------------------
>
> 1 - filter(COUNT(*)>10)
> 3 - access("T0"."QUESTIONNUMBER"="TX"."QUESTIONNUMBER" AND
> "T0"."ANSWER"="TX"."ANSWER")
> filter("T0"."RESIDENTID"<>"TX"."RESIDENTID")
> 4 - access("T0"."RESIDENTID"=1486674)
> 5 - filter("TX"."RESIDENT"<>1486674)
>
> 22 rows selected.

I agree with much of what David Fitzjarrell stated regarding the source of the problem and how possibly to correct the problem. You are running Oracle 10g R2, which is performing a hash group by, so there is no sort required by the GROUP BY. David mentioned the SORT_AREA_SIZE parameter - while there is no sort indicated by the explain plan, the HASH_AREA_SIZE by default is set to twice the SORT_AREA_SIZE, so the parameter does have some relevance to your case. The index on the RESIDENTID column is preventing two full tablescans, but still allows one to occur, which results in 22 rows being hash joined to 37,000,000 rows (average of 11 bytes per row). The resulting hash join returns 666,000 rows and requires 13MB, and during the GROUP BY operation 21MB is spilled to the temporary tablespace to return 33,322 rows.

Most of the time is likely spent in returning the 37,000,000 rows (likely from disk) and for performing the multi-pass hash join and hash group by. You can set event 10104 generate a trace file to help determine the effective HASH_AREA_SIZE needed to prevent the hashing operations from spilling to disk. Your primary key index on the table is likely RESIDENTID and QUESTIONNUMBER, and it appears that you created another index on the RESIDENTID column. I believe that someone else suggested a composite index on the QUESTIONNUMBER and ANSWER columns - that appears to be a very good idea to reduce the 37,000,000 rows that are entering the hash join (only 2% of the rows would be passed on to the hash join).

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc. Received on Thu Mar 15 2007 - 12:53:37 CDT