Improving a "DISTINCT" query on a column with skewed data [message #65778] |
Wed, 22 December 2004 08:36 |
Shane Kaszyca
Messages: 16 Registered: May 2003
|
Junior Member |
|
|
I am trying to perform a query for distinct values on a column in one of my tables. There are about 16000 rows in the table and about 30 distinct values total. The data distribution appears at the bottom of this message. The query is performing a full table scan. I have tried putting an index on the column but that doesn't seem to work. Basically, I am doing a SELECT DISTINCT value FROM table. Any idea how I can prevent a Full Table Scan? It's costing me 4 seconds - which I can't afford. The data is skewed to one of the values - It appears about 84 percent of the time. Regards,
Shane Kaszyca
-value- -count-
value 1 38
value 2 113
value 3 70
value 4 436
value 5 172
value 6 48
value 7 1
value 8 1
value 9 117
value 10 1
value 11 129
value 12 13728
value 13 75
value 14 70
value 15 21
value 16 109
value 17 348
value 18 71
value 19 44
value 20 158
value 21 119
value 22 1
value 23 69
value 24 130
value 25 58
value 26 194
value 27 130
value 28 4
value 29 1
value 30 50
|
|
|
Re: Improving a "DISTINCT" query on a column with skewed data [message #65779 is a reply to message #65778] |
Wed, 22 December 2004 12:15 |
Art Metzer
Messages: 2480 Registered: December 2002
|
Senior Member |
|
|
----------------------------------------------------------------------
SQL> SELECT description
2 , qty
3 , LPAD(TO_CHAR(RATIO_TO_REPORT(qty)
4 OVER () * 100
5 , 'fm90.00')
6 , 5) "%AGE"
7 FROM (SELECT description
8 , COUNT(*) qty
9 FROM t
10 GROUP BY description)
11 ORDER BY TO_NUMBER(SUBSTR(description,INSTR(description,' ') + 1))
12 /
DESCRIPTION QTY %AGE
--------------- ---------- -----
value 1 38 0.23
value 2 113 0.69
value 3 36 0.22
value 4 436 2.67
value 5 172 1.05
value 6 48 0.29
value 7 1 0.01
value 8 1 0.01
value 9 117 0.72
value 10 2 0.01
value 11 129 0.79
value 12 13728 84.18
value 13 75 0.46
value 14 34 0.21
value 15 21 0.13
value 16 109 0.67
value 17 348 2.13
value 18 71 0.44
value 19 44 0.27
value 20 158 0.97
value 21 119 0.73
value 22 1 0.01
value 23 69 0.42
value 24 66 0.40
value 25 58 0.36
value 26 194 1.19
value 27 64 0.39
value 28 4 0.02
value 29 1 0.01
value 30 50 0.31
30 rows selected.
SQL>
----------------------------------------------------------------------
Doing this DISTINCT should not take four seconds. In my testing, I had no indexes, and I was able to get a response in about 60 milliseconds.
That said, I don't see how you can do a DISTINCT without doing a full table scan. I mean, think about it: if you're not doing a FTS on a DISTINCT operation, then the row(s) that you hadn't table-scanned might have held that additional, unique value (and you would get back a wrong answer). And full-scanning an index wouldn't give that much performance boost over full-scanning the table, especially if the table is skinny.
I ran tkprof on three approaches I could think of to this problem. Those results are listed below.********************************************************************************
<font color=red>SELECT DISTINCT description
FROM
t</font>
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.06 0.06 43 46 12 30
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.06 0.06 43 46 12 30
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 217
Rows Row Source Operation
------- ---------------------------------------------------
30 SORT UNIQUE
16307 TABLE ACCESS FULL T
********************************************************************************
********************************************************************************
<font color=#663366>SELECT description
FROM
(SELECT description, COUNT(*) FROM t GROUP BY description)</font>
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.06 0.06 0 46 12 30
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.06 0.06 0 46 12 30
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 217
Rows Row Source Operation
------- ---------------------------------------------------
30 SORT GROUP BY
16307 TABLE ACCESS FULL T
********************************************************************************
********************************************************************************
<font color=blue>SELECT description
FROM
(
SELECT description
, ROW_NUMBER()
OVER (PARTITION BY description
ORDER BY NULL) r
FROM t
)
WHERE r = 1</font>
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.37 0.37 66 46 18 30
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.37 0.37 66 46 18 30
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 217
Rows Row Source Operation
------- ---------------------------------------------------
30 VIEW
16307 WINDOW SORT
16307 TABLE ACCESS FULL T
******************************************************************************** As you can see, the first two approaches perform similarly in 0.06 seconds, but the third one is a throwaway.
What kind of tkprof results do you get using your data?
And remember, full table scans are not always evil, indexes are not always good.
|
|
|
|