Re: GROUP BY

From: Jurij Modic <jurij.modic_at_mf.sigov.mail.si>
Date: 1997/09/01
Message-ID: <340b255f.8137849@www.sigov.si>#1/1

On 31 Aug 1997 22:06:14 GMT, "Mario" <mario.simic_at_in2.tel.hr> wrote:

>TurkBear <jvgreco_at_nospamprimenet.com> wrote
>> Tansel Ozkan <tansel_at_openix.com> wrote:
>> >
>> >I have a huge table with 10,000,000 records. And we are frequently
>> >running a query with a GROUP BY statement. Would an index on the grouped
>> >columns improve the performance? The SQL statement is :
>> >
>> >select mar_status, apr_status , count(*)
>> >from master_table
>> >group by mar_status, apr_status;
>> >
>> >If only subset of the records are needed by specifying a where clause as
>> >shown in the SQL statement below and one of the filters is the grouped
>> >column, would that make a difference?
>> >
>> >select mar_status, apr_status , count(*)
>> >from master_table
>> >where mar_status in ('1','2')
>> >group by mar_status, apr_status;
>>
>> I would index the group by columns and also one to be used in the
>> count statement, i.e. count(indexed_col) - this should be more
>> effecient.
>
>No, it wouldn't. In both cases, with or without indexes, full table scan
>will be performed.

Not quite true. If you slightly modify the query you can force it to use index. All you have to do is to add a dummy WHERE condition, which will reference an indexed column and evaluate to true for every row, e.g.

create index i_test on master_table(mar_status, apr_status);

select mar_status, apr_status , count(*) from master_table
WHERE MAR_STATUS < 'x' /* something to select all rows */ group by mar_status, apr_status;

In this case the query will be executed by scaning only I_TEST index without even accessing MASTER_TABLE. Depending on the amount of data stored in other columns the performance can be substantialy better. Note hovewer that rows with NULL in MAR_STATUS will not be included in the result so if you want to grup records also by null values in MAR_STATUS you must not use this method.

>If only subset of records is needed then you should index mar_status
>column and try to avoid using "in" in where clause. Depending on your
>data in mar_status column you should try
>
>select mar_status, apr_status , count(*)
>from master_table where mar_status < '2'
>group by mar_status, apr_status;
>
>or
>
>select mar_status, apr_status , count(*)
>from master_table where mar_status between '1' and '2'
>group by mar_status, apr_status;

Again, if you can use either of this two queries (depending on your mar_status values) you should create concatenated index on MASTER_TABLE(MAR_STATUS, APR_STATUS) so that the query can retrive all its data only from index, avoiding all "table_access_by_rowid". Performance can be dramaticaly improved.

>or
>
>select '1', apr_status , count(*)
>from master_table where mar_status = '1'
>group by mar_status, apr_status
>union
>select '2', apr_status , count(*)
>from master_table where mar_status = '2'
>group by mar_status, apr_status;

This one is IMHO probably even less efficient than the original one with the IN('1','2') in the WHERE clause

>Mario Simic

Regards,

Jurij Modic                             Republic of Slovenia
jurij.modic_at_mf.sigov.mail.si		Ministry of Finance
============================================================

The above opinions are mine and do not represent any official standpoints of my employer Received on Mon Sep 01 1997 - 00:00:00 CDT