Re: Alerting early for minimum impact

From: yudhi s <learnerdatabase99_at_gmail.com>
Date: Wed, 18 Sep 2024 09:35:46 +0530
Message-ID: <CAEzWdqcmLf397be=6RrZyPgCHTrnV03p99GLbZKfC5pSd9njYA_at_mail.gmail.com>



Thank you Jonathan and Kelly.

We have alerting in place in regards to "elapsed_time" of the queries from v$sql and that is a measurement on which we can rely upon to understand the query run time. And in a system which is latency sensitive we can have alerting when the elapsed_time goes beyond a few minutes vs in other system we may have it when the query runs a longer time.

But here in this scenario there are multiple databases hosted in the same box and they may be mixed (few are latency sensitive vs others are throughput oriented etc), so how and what value of these metrics we should be alerted on? I am unable to understand how to make these metrics converted to useful alerting so as to understand , we really have issues and it's impacting the whole cluster now, as these seem to be a bit subjective? Can you please explain a bit more on this?

On Wed, Sep 18, 2024 at 2:42 AM Kellyn Pot'Vin-Gorman <dbakevlar_at_gmail.com> wrote:

> Agree with Jonathan on monitoring for these metrics that you can simply
> turn the final query into a metric extension in OEM to alert you when it
> surpasses it.
>
>
>
>
> *Kellyn Gorman*
> DBAKevlar Blog <http://dbakevlar.com>
> about.me/dbakevlar
>
>
>
> On Tue, Sep 17, 2024 at 14:01 Jonathan Lewis <jlewisoracle_at_gmail.com>
> wrote:
>
>> I haven't thought this one through, but could you do something with
>> v$sql, checking the related columns, possibly:
>>
>> IO_CELL_OFFLOAD_ELIGIBLE_BYTES
>> IO_INTERCONNECT_BYTES
>> IO_CELL_UNCOMPRESSED_BYTES
>> IO_CELL_OFFLOAD_RETURNED_BYTES
>>
>>
>>
>>
>> Regards
>> Jonathan Lewis
>>
>>
>>
>> On Tue, 17 Sept 2024 at 06:38, yudhi s <learnerdatabase99_at_gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
>>> We are using Oracle exadata servers 23.1 image and the database version
>>> is 19c.
>>>
>>> We have just encountered a scenario in which we have one select query
>>> which was doing a full table scan(cell smart scan) and suddenly as part of
>>> a bulk migration effort that query started running Avg ~4000 times per hour
>>> from initial(avg ~20 runs per hour usual pattern) and this has caused the
>>> cell server cpu spike to ~85% from normal ~30% CPU and the disk IO response
>>> (mainly the write response) increased from <.5ms to double i.e >1.5 ms and
>>> it impacted one of the other database which was latency sensitive one
>>> hosted in that same box.
>>>
>>> It took us a long time to get to the sql because it was not showing as
>>> any of the TOP contributors in that database layer but it did impact the
>>> storage cell servers IO and CPU significantly. The response time of that
>>> sql just increased from ~5-10 seconds earlier to ~20 seconds during the
>>> issue period , so we were unable to find this as one of the top sql in that
>>> database. And this query was not stuck but finishing up.
>>>
>>> And also we were having 3 to 4 other databases hosted in that exadata
>>> cluster. And when we got complaints from a different database which was a
>>> latency sensitive one , we spent a good amount of time digging into the top
>>> sqls from that database whereas it was a different database query which has
>>> caused this issue.
>>>
>>> So I wanted to understand , in such a scenario in which one of the
>>> queries impacting the storage cell IO/CPU may not be one showing up in the
>>> database layer. So what alerting can be done to get to it as soon as
>>> possible. and if anything is possible in the OEM metrics itself which we
>>> can configure to catch these kinds of issues beforehand? Also I don't see
>>> an easy way to find the top queries consuming storage cell CPU or IO across
>>> all the databases hosted in that box. Is there any option possible from
>>> awr/dba_hist* views?
>>>
>>> Regards
>>> Yudhi
>>>
>>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Sep 18 2024 - 06:05:46 CEST

Original text of this message