Re: Alerting early for minimum impact

From: Pap <oracle.developer35_at_gmail.com>
Date: Wed, 18 Sep 2024 01:31:10 +0530
Message-ID: <CAEjw_fi_AZh7Js+7og7idK-Si7Uqp2uGMnH7opyudhMpx9=tRw_at_mail.gmail.com>



We have encountered a similar issue in which a single adhoc query(doing full scan on a big partitioned table) executing from multiple sessions at same time causes bottlenecks across the whole exadata cluster impacting all the databases in that cluster. And it's because that query may be running in a single thread and in a single session in the DB layer but can utilize multiple storage cells in parallel to give its full cpu/io power to serve that query.

We tried putting alerting based on IO response time , but that endup giving a lot of false alerts and yes we did not find any good way to see the top queries based on Cell server IO or CPU from the dba_hist views. Maybe others having similar experiences can throw some light here, on right alerting to cater to such issues.

On Tue, Sep 17, 2024 at 11:08 AM yudhi s <learnerdatabase99_at_gmail.com> wrote:

> Hi ,
>
> We are using Oracle exadata servers 23.1 image and the database version is
> 19c.
>
> We have just encountered a scenario in which we have one select query
> which was doing a full table scan(cell smart scan) and suddenly as part of
> a bulk migration effort that query started running Avg ~4000 times per hour
> from initial(avg ~20 runs per hour usual pattern) and this has caused the
> cell server cpu spike to ~85% from normal ~30% CPU and the disk IO response
> (mainly the write response) increased from <.5ms to double i.e >1.5 ms and
> it impacted one of the other database which was latency sensitive one
> hosted in that same box.
>
> It took us a long time to get to the sql because it was not showing as any
> of the TOP contributors in that database layer but it did impact the
> storage cell servers IO and CPU significantly. The response time of that
> sql just increased from ~5-10 seconds earlier to ~20 seconds during the
> issue period , so we were unable to find this as one of the top sql in that
> database. And this query was not stuck but finishing up.
>
> And also we were having 3 to 4 other databases hosted in that exadata
> cluster. And when we got complaints from a different database which was a
> latency sensitive one , we spent a good amount of time digging into the top
> sqls from that database whereas it was a different database query which has
> caused this issue.
>
> So I wanted to understand , in such a scenario in which one of the queries
> impacting the storage cell IO/CPU may not be one showing up in the database
> layer. So what alerting can be done to get to it as soon as possible. and
> if anything is possible in the OEM metrics itself which we can configure to
> catch these kinds of issues beforehand? Also I don't see an easy way to
> find the top queries consuming storage cell CPU or IO across all the
> databases hosted in that box. Is there any option possible from
> awr/dba_hist* views?
>
> Regards
> Yudhi
>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Sep 17 2024 - 22:01:10 CEST

Original text of this message