Re: How to get a distinct count of result set of multople table joins?

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: Wed, 27 Feb 2008 19:03:11 -0800 (PST)
Message-ID: <a5cb644b-5935-4272-8d57-d57a446b3c39@i7g2000prf.googlegroups.com>


On Feb 27, 5:43 pm, "Chris ( Val )" <chris..._at_gmail.com> wrote:

> On Feb 28, 3:27 am, Charles Hooper <hooperc2..._at_yahoo.com> wrote:
> > I suspect that Oracle is taking a long time to remove the duplicate
> > rows due to a combination of the number of rows, the number of columns
> > (you are specifying to retrieve all columns from the tables), and the
> > amount of memory available for sorting unique (or hash unique) the
> > rows to produce a unique list of rows.  Are you able to better define
> > the columns that are interest, and eliminate any columns that are
> > common between the various tables.  
>
> Unfortunatley I did not write the SQL, and I don't understand the
> data requirements enough to manipulate it. I can ask for it to be
> changed, but that can take ages to happen :)
>
> > I have found in some cases on
> > Oracle 10.2.0.2 that listing all columns in a GROUP BY allows Oracle
> > to retrieve a distinct list of rows more quickly than by using the
> > DISTINCT syntax.
> >
> > In the second of your examples, I do not see where you are using the
> > UNION syntax - are you trying to do something different in that
> > example?  
>
> Sorry, I meant that I tried to use the same query as I just posted to
> Mark,
> but with the COUNT statement added as in my first post.
>
> I am having trouble getting the right count returned.
> >
> > Explain plan, or better yet a DBMS Xplan showing the predicted and
> > actual results would be helpful, as requested by Ed and Mark.  If the
> > explain plan shows nothing useful, turn on a 10046 trace at level 8
> > and see what is happening behind the scenes.
>
> Even though this is a DEV database, I am not allowed to go to those
> extreams.
>
> My job is to automate a process of extracting data, of which this SQL
> was handed
> to me to use in my script.
>
> Thanks for your help.
>
> --
> Chris

Looking at the plan, 13 tables were full table scanned producing an estimated 19,222,632 bytes in 24,271 rows, using 12 hash joins for each half of the UNION when the duplicates are removed. You indicate that Oracle is actually returning 700,000 rows before the unique count is applied. This seems to imply that Oracle is actually dealing with roughly 554,400,000 bytes for each half of the UNION when the duplicates are removed. Your SQL statement is likely making significant use of the TEMP tablespace. You may be able to improve performance by increasing the memory available for sorting and hash joins. Assuming that WORKAREA_SIZE_POLICY is set to AUTO, you might be able to improve performance a little by having the DBA increase the PGA_AGGREGATE_TARGET. To see if the above estimates are close, you might try running DBMS_XPLAN for the query. See the following link for an example:   http://jonathanlewis.wordpress.com/2006/11/09/dbms_xplan-in-10g/

I think that I now understand what you originally posted. The two queries should be returning the same number of rows before the count, since the UNION syntax removes duplicate rows automatically. Note the:
  "SORT (UNIQUE)" in your execution plan.

Here is a test case that demonstates the performance improvement suggestions that I made earlier:
CREATE TABLE T1 (

  C1 VARCHAR2(30),
  C2 VARCHAR2(30),
  C3 VARCHAR2(30));

CREATE TABLE T2 (

  C1 VARCHAR2(30),
  C2 VARCHAR2(30),
  C3 VARCHAR2(30));

CREATE TABLE T3 (

  C1 VARCHAR2(30),
  C2 VARCHAR2(30),
  C3 VARCHAR2(30));

INSERT INTO
  T1
SELECT

  TO_CHAR(SIN(ROWNUM*3.1415/180)*10000,'0000000000'),
  TO_CHAR(COS(ROWNUM*3.1415/180)*10000,'0000000000'),
  TO_CHAR(TAN(ROWNUM*3.1415/180)*10000,'0000000000')
FROM
  DUAL
CONNECT BY
  LEVEL<=400000;

INSERT INTO
  T2
SELECT

  TO_CHAR(SIN(ROWNUM*4*3.1415/180)*10000,'0000000000'),
  TO_CHAR(COS(ROWNUM*4*3.1415/180)*10000,'0000000000'),
  TO_CHAR(TAN(ROWNUM*4*3.1415/180)*10000,'0000000000')
FROM
  DUAL
CONNECT BY
  LEVEL<=100000;

INSERT INTO
  T3
SELECT

  TO_CHAR(SIN(ROWNUM*7.2*3.1415/180)*10000,'0000000000'),
  TO_CHAR(COS(ROWNUM*7.2*3.1415/180)*10000,'0000000000'),
  TO_CHAR(TAN(ROWNUM*7.2*3.1415/180)*10000,'0000000000')
FROM
  DUAL
CONNECT BY
  LEVEL<=70000;

COMMIT; EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE); EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE); EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T3',CASCADE=>TRUE); The first test is similar to your first query: SELECT
  COUNT(*)
FROM
(SELECT

  *
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+)
UNION
SELECT
  *
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+));   COUNT(*)


   2994198

The DBMS Xplan (watch for wordwrapping):


| Id  | Operation                  | Name | Starts | E-Rows | A-Rows
| A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp|
|   1 |  SORT AGGREGATE            |      |      1 |      1 |      1 |
00:03:16.27 |    5392 |  73199 |  73199 |       |       |
|         |
|   2 |   VIEW                     |      |      1 |   2806K|   2994K|
00:03:09.16 |    5392 |  73199 |  73199 |       |       |
|         |
|   3 |    SORT UNIQUE             |      |      1 |   2806K|   2994K|
00:03:03.17 |    5392 |  73199 |  73199 |   269M|  8322K|   37M

(1)| 572K|
| 4 | UNION-ALL | | 1 | | 7089K| 00:01:32.59 | 5342 | 0 | 0 | | | | | |* 5 | HASH JOIN RIGHT OUTER | | 1 | 1403K| 3544K| 00:00:28.78 | 2671 | 0 | 0 | 1102K| 1102K| 178K
(0)| |
| 6 | TABLE ACCESS FULL | T2 | 1 | 1 | 0 | 00:00:00.01 | 3 | 0 | 0 | | | | | |* 7 | HASH JOIN RIGHT OUTER| | 1 | 1403K| 3544K| 00:00:07.51 | 2668 | 0 | 0 | 4768K| 2062K| 7035K
(0)| |
| 8 | TABLE ACCESS FULL | T3 | 1 | 70000 | 70000 | 00:00:00.07 | 402 | 0 | 0 | | | | | | 9 | TABLE ACCESS FULL | T1 | 1 | 400K| 400K| 00:00:00.40 | 2266 | 0 | 0 | | | | | |* 10 | HASH JOIN RIGHT OUTER | | 1 | 1403K| 3544K| 00:00:28.78 | 2671 | 0 | 0 | 1102K| 1102K| 207K
(0)| |
| 11 | TABLE ACCESS FULL | T2 | 1 | 1 | 0 | 00:00:00.01 | 3 | 0 | 0 | | | | | |* 12 | HASH JOIN RIGHT OUTER| | 1 | 1403K| 3544K| 00:00:07.51 | 2668 | 0 | 0 | 4768K| 2062K| 7049K
(0)| |
| 13 | TABLE ACCESS FULL | T3 | 1 | 70000 | 70000 | 00:00:00.07 | 402 | 0 | 0 | | | | | | 14 | TABLE ACCESS FULL | T1 | 1 | 400K| 400K| 00:00:00.40 | 2266 | 0 | 0 | | | | | ----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):


   5 - access("T1"."C1"="T2"."C1")
   7 - access("T1"."C1"="T3"."C1")
  10 - access("T1"."C1"="T2"."C1")
  12 - access("T1"."C1"="T3"."C1")

The above took 3 minutes and 16 seconds on Oracle 11.1.0.6

Sample 2 returns the same result, just significantly faster: SELECT
  COUNT(*)
FROM
(SELECT DISTINCT

  *
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+));   COUNT(*)


   2994198


| Id  | Operation                 | Name | Starts | E-Rows | A-Rows
| A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp|
|   1 |  SORT AGGREGATE           |      |      1 |      1 |      1 |
00:01:18.14 |    2671 |  40365 |  40365 |       |       |
|         |
|   2 |   VIEW                    |      |      1 |   1403K|   2994K|
00:01:11.67 |    2671 |  40365 |  40365 |       |       |
|         |
|   3 |    HASH UNIQUE            |      |      1 |   1403K|   2994K|
00:01:05.68 |    2671 |  40365 |  40365 |   351M|    15M|   46M

(1)| 337K|
|* 4 | HASH JOIN RIGHT OUTER | | 1 | 1403K| 3544K| 00:00:28.78 | 2671 | 0 | 0 | 1102K| 1102K| 181K
(0)| |
| 5 | TABLE ACCESS FULL | T2 | 1 | 1 | 0 | 00:00:00.01 | 3 | 0 | 0 | | | | | |* 6 | HASH JOIN RIGHT OUTER| | 1 | 1403K| 3544K| 00:00:07.51 | 2668 | 0 | 0 | 4768K| 2062K| 7049K
(0)| |
| 7 | TABLE ACCESS FULL | T3 | 1 | 70000 | 70000 | 00:00:00.07 | 402 | 0 | 0 | | | | | | 8 | TABLE ACCESS FULL | T1 | 1 | 400K| 400K| 00:00:00.40 | 2266 | 0 | 0 | | | | | ---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):


   4 - access("T1"."C1"="T2"."C1")
   6 - access("T1"."C1"="T3"."C1")

The third example is one of the performance enhancements that I suggested, using GROUP BY, rather than DISTINCT: SELECT
  COUNT(*)
FROM
(SELECT

  T1.C1,
  T1.C2,
  T1.C3,
  T2.C1,
  T2.C2,
  T2.C3,
  T3.C1,
  T3.C2,
  T3.C3

FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+)
GROUP BY
  T1.C1,
  T1.C2,
  T1.C3,
  T2.C1,
  T2.C2,
  T2.C3,
  T3.C1,
  T3.C2,
  T3.C3);

  COUNT(*)


   2994198


| Id  | Operation                 | Name      | Starts | E-Rows | A-
Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-
Mem | Used-Tmp|
|   1 |  SORT AGGREGATE           |           |      1 |      1 |
1 |00:01:12.51 |    2671 |  36146 |  36146 |       |       |
|         |
|   2 |   VIEW                    | VM_NWVW_0 |      1 |   1403K|
2994K|00:01:05.85 |    2671 |  36146 |  36146 |       |
|          |         |
|   3 |    HASH GROUP BY          |           |      1 |   1403K|
2994K|00:00:59.86 |    2671 |  36146 |  36146 |   339M|    15M|   40M

(1)| 292K|
|* 4 | HASH JOIN RIGHT OUTER | | 1 | 1403K| 3544K|00:00:28.78 | 2671 | 0 | 0 | 1102K| 1102K| 155K
(0)| |
| 5 | TABLE ACCESS FULL | T2 | 1 | 1 | 0 |00:00:00.01 | 3 | 0 | 0 | | | | | |* 6 | HASH JOIN RIGHT OUTER| | 1 | 1403K| 3544K|00:00:07.51 | 2668 | 0 | 0 | 4768K| 2062K| 7049K
(0)| |
| 7 | TABLE ACCESS FULL | T3 | 1 | 70000 | 70000 |00:00:00.07 | 402 | 0 | 0 | | | | | | 8 | TABLE ACCESS FULL | T1 | 1 | 400K| 400K|00:00:00.40 | 2266 | 0 | 0 | | | | | --------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):


   4 - access("T1"."C1"="T2"."C1")
   6 - access("T1"."C1"="T3"."C1")

The above completed 6 seconds faster (1 minute, 12 seconds total).

Repeating the previous test run, eliminating unnecessary duplicate columns:
SELECT
  COUNT(*)
FROM
(SELECT

  T1.C1,
  T1.C2,
  T1.C3,
  T2.C2,
  T2.C3,
  T3.C2,
  T3.C3

FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+)
GROUP BY
  T1.C1,
  T1.C2,
  T1.C3,
  T2.C2,
  T2.C3,
  T3.C2,
  T3.C3);

  COUNT(*)


   2994198


| Id  | Operation                 | Name      | Starts | E-Rows | A-
Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-
Mem | Used-Tmp|
|   1 |  SORT AGGREGATE           |           |      1 |      1 |
1 |00:01:07.49 |    2671 |  28427 |  28427 |       |       |
|         |
|   2 |   VIEW                    | VM_NWVW_0 |      1 |   1403K|
2994K|00:01:01.46 |    2671 |  28427 |  28427 |       |
|          |         |
|   3 |    HASH GROUP BY          |           |      1 |   1403K|
2994K|00:00:52.48 |    2671 |  28427 |  28427 |   281M|    15M|   40M

(1)| 230K|
|* 4 | HASH JOIN RIGHT OUTER | | 1 | 1403K| 3544K|00:00:28.78 | 2671 | 0 | 0 | 1102K| 1102K| 178K
(0)| |
| 5 | TABLE ACCESS FULL | T2 | 1 | 1 | 0 |00:00:00.01 | 3 | 0 | 0 | | | | | |* 6 | HASH JOIN RIGHT OUTER| | 1 | 1403K| 3544K|00:00:07.51 | 2668 | 0 | 0 | 4768K| 2062K| 7049K
(0)| |
| 7 | TABLE ACCESS FULL | T3 | 1 | 70000 | 70000 |00:00:00.07 | 402 | 0 | 0 | | | | | | 8 | TABLE ACCESS FULL | T1 | 1 | 400K| 400K|00:00:00.40 | 2266 | 0 | 0 | | | | | --------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):


   4 - access("T1"."C1"="T2"."C1")
   6 - access("T1"."C1"="T3"."C1")

The above saved another 5 seconds (1 minute, 7 seconds total)

Trying one more trick:
SELECT
  COUNT (DISTINCT T1.C1||T1.C2||T1.C3||T2.C2||T2.C3||T3.C2||T3.C3) FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+); COUNT(DISTINCT
-------------_

       2994198



| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used- |

| 1 | SORT GROUP BY | | 1 | 1 | 1 | 00:01:01.20 | 2699 | 26462 | 26462 | 195M| 5093K| 37M
(1)| 207K|
|*  2 |   HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  185K

(0)| |
| 3 | TABLE ACCESS FULL | T2 | 1 | 1 | 0 | 00:00:00.01 | 3 | 0 | 0 | | | | | |* 4 | HASH JOIN RIGHT OUTER| | 1 | 1403K| 3544K| 00:00:07.51 | 2668 | 0 | 0 | 4768K| 2062K| 7039K
(0)| |
| 5 | TABLE ACCESS FULL | T3 | 1 | 70000 | 70000 | 00:00:00.07 | 402 | 0 | 0 | | | | | | 6 | TABLE ACCESS FULL | T1 | 1 | 400K| 400K| 00:00:00.40 | 2266 | 0 | 0 | | | | | -------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):


   2 - access("T1"."C1"="T2"."C1")
   4 - access("T1"."C1"="T3"."C1")

The above saved another 5 seconds (1 minute, 1 second).

Note that in all cases, the same number of rows were returned.

Charles Hooper
IT Manager/Oracle DBA
K&M Machine-Fabricating, Inc. Received on Wed Feb 27 2008 - 21:03:11 CST

Original text of this message