Re: Fixing Performance issue with less selective columns
Date: Thu, 26 Aug 2021 14:34:44 +0100
Message-ID: <CAGtsp8mkgOJmCL-VHmqJinfKLBpouOtgZMbx1gw8Mt--Xmmddg_at_mail.gmail.com>
wLooking back at earlier posts I see I was rememberin the wrong statistic - the 340K was a different column - sorry about that. But the corrected figure makes it even more sensible to choose an index on trim(m_text).
I only asked about the previous query because I was curious about whether this was a continuation of the same investigation. It wasn't a question about "parallel run", and this query wasn't running parallel anyway. (The comment about a serial tablescan EFFECTIVELY running parallel is generic to all Exadata systems because a single serial session has to send read requests to all storage cells to do a tablescan.)
Your comments regarding (dc_code, trim(m_txt)) are ESSENTIALLY correct - though you may decide with further investigation that you could do even better, it all depends on the most frequently run queries in your application and whether you can do anything to rewrite them to take best advantage of any indexes you create.
Regards
Jonathan Lewis
On Thu, 26 Aug 2021 at 13:38, Lok P <loknath.73_at_gmail.com> wrote:
> Thank You Jonathan.
>
> I did run a manual SELECT query for "count (distinct trim(m_txt)) " and
> it gave me ~41.7 million and without TRIM function its coming ~41.8million
> distinct values. So it seems it would be having a good selectivity to help
> the query.
>
> *"The index (even if it were just trim(m_txt) wouldn't help you with your
> substr(trim(m_txt),2,20) predicate unless your access path could use an
> index fast full scan (or possibly an index full scan). However if there
> were a couple of leading columns with equality it might be sufficient to
> make the optimizer do an index range scan on the leading columns followed
> by a filter at the index on that predicate."*
>
> Not able to understand your above point fully. As we have other equality
> predicates but they are having very poor selectivity, so would you suggest
> creating a separate FBI index on *substr(trim(m_txt),2,20) *to cater the
> other queries?
> OR
> Are you suggesting to have a single composite index on for e.g. on
> (DC_CCODE, TRIM(M_TXT)) (considering DC_CCODE is used as an equality
> predicate in both the queries) and here still optimizer can use DC_CCODE as
> an index range scan followed by filter on *substr(trim(m_txt),2,20)* .
> But then won't the filter on *substr(trim(m_txt),2,20)* be a bottleneck
> as this expression is not exactly in the index? Or are you saying it can
> still use the same index without going to the table blocks?
>
> Regarding your question on parallel run *"Did you find this query as a
> result of looking at the node 1 load that prompted you question about a
> query on node 2 running 2 or 3 times as slowly?", *i have just ran it but
> not make sure paralle_force_local as TRUE. I would check it again, if your
> point is to see how it behaves with parallel hints, if parallel slaves are
> restricted to one node only.
>
>
> On Thu, Aug 26, 2021 at 4:33 PM Jonathan Lewis <jlewisoracle_at_gmail.com>
> wrote:
>
>>
>> Did you find this query as a result of looking at the node 1 load that
>> prompted you question about a query on node 2 running 2 or 3 times as
>> slowly?
>>
>> Parallelism would be a bad idea - on (a full rack) Exadata a tablescan is
>> effectively parallel 14 anyway because all the cell servers will work
>> concurrently to do the tablescan. Added to which you've reported the time
>> as a couple of seconds, and Oracle's once published a guideline that
>> there's no point in using parallelism for queries that take less than 6
>> seconds to complete (because of the startup and coordination overheads).
>>
>> The SQL Monitor report shows you doing 43,000 read requests to scan 45GB
>> of data - and even with 9.99% offload that's a very large number of
>> requests and a huge load if you want to do it a large number of times,
>> concurrently.
>>
>> You report roughly 330K distsinct values of M_TXT out of 111M rows -
>> that's an average of 333 rows per value. In the worst case that would be
>> 333 real physical read request to the table - which should perform just as
>> well as 43,000 optimised requests but (more significantly) shouldn't
>> overload your system when several users try to do it concurrently. So, as a
>> first step you should do a select count disstinct trim(m_txt) to see if the
>> trim() changes the estimates significantly, and if it still looks good then
>> index on trim(m_txt).
>>
>> If you want to reducecontention on index access you could globally hash
>> partition the index on trim(m_txt) - I'd have to check that you could do
>> that in 11.2.0.4 so you've got a little testing to do. However, you might
>> want to start the index with one or two of the other columns with small
>> numbers of distinct values and create the index with compression if that
>> guaranteeably reduces the number of visits to the table by a factor of 2 to
>> 20.
>>
>> The index (even if it were just trim(m_txt) wouldn't help you with your
>> substr(trim(m_txt),2,20) predicate unless your access path could use an
>> index fast full scan (or possibly an index full scan). However if there
>> were a couple of leading columns with equality it might be sufficient to
>> make the optimizer do an index range scan on the leading columns followed
>> by a filter at the index on that predicate.
>>
>>
>> Regards
>> Jonathan Lewis
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 25 Aug 2021 at 20:26, Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> Thank You Andy. Actually as I stated earlier, this query is executed in
>>> a few seconds only but it's the higher number of executions(thousands+)
>>> for different binds which add up to the overall long run time of the
>>> process. and also most of the runs result in zero rows only. I tried
>>> putting it/query in a loop for different binds and executed with parallel
>>> (i 4) and without parallel and it seems it's taking longer with parallel
>>> hints. It may be because of maintaining and aggregating those additional
>>> parallel slaves etc.
>>>
>>> *"Is :b4:= 'A' representative of your typical query here?" *apology for
>>> the confusion, actually i was trying to camouflage the actual bind values
>>> and that made it confusing. So it's actually coming as 'D' i.e. having
>>> ~104million matching rows in that table. So indexing that won't help here
>>> too.
>>>
>>> The column M_TXT, its VARCHAR2(100 byte) column and holding string
>>> values with many having spaces in it at start/last.So trim function
>>> seems necessary.This filter alone is making the result set to ZERO for this
>>> query and its having highest NUM_DISTINCT values too, so it seems good
>>> candidate here for index. And this filter With regards to
>>> substr(:B8,.50), I will confirm and try to get it removed as that doesn't
>>> seem to serve any purpose but was somehow is there in this legacy code. We
>>> should be able to simply remove that SUBSTR function without any issue. But
>>> irrespective of that SUBSTR function on the right hand side of the
>>> predicate, are you pointing towards creating a function based index on
>>> trim(M_TXT) at the left hand side to help this query? Actually , I saw in
>>> many other places , this column is utilized in the predicates section like
>>> *substr(trim(m_txt),2,20)*=decode(col1, 'YYY', substr('XXXXX',1,20)).
>>> So is it possible to just have one index to cater both scenarios or we can
>>> tweak the query someway so as to utilize the same index for all the queries
>>> on this string column?
>>>
>>>
>>> On Thu, Aug 26, 2021 at 12:12 AM Andy Sayer <andysayer_at_gmail.com> wrote:
>>>
>>>> Hi Lok,
>>>>
>>>> " If there exists any other way to make this query faster without
>>>> creating any new index that would really be helpful."
>>>> You can use parallelism and have the scan completed by more processes
>>>> at once.
>>>>
>>>> Is :b4:= 'A' representative of your typical query here? The results
>>>> you've shared suggest this will return 0 rows, however there aren't a lot
>>>> of distinct values for DC_CODE and you don't have any statistics. If the
>>>> argument is frequently used then an index and a frequency histogram on this
>>>> column would be lucrative. It's a bit surprising you have so many
>>>> histograms on this table but this column's statistics didn't appear.
>>>>
>>>> The filter on M_TXT also would provide decent selectivity, I would
>>>> question whether the trim is really needed. The substr 0.5 is a huge red
>>>> flag.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>>
-- http://www.freelists.org/webpage/oracle-lReceived on Thu Aug 26 2021 - 15:34:44 CEST