Alerting early for minimum impact
Date: Tue, 17 Sep 2024 11:07:11 +0530
Message-ID: <CAEzWdqfkj6cMmdmZOC_jkpZLsxDK2jf5SrrtgB+zPEe_vcDP-Q_at_mail.gmail.com>
Hi ,
We are using Oracle exadata servers 23.1 image and the database version is
19c.
We have just encountered a scenario in which we have one select query which
was doing a full table scan(cell smart scan) and suddenly as part of a bulk
migration effort that query started running Avg ~4000 times per hour from
initial(avg ~20 runs per hour usual pattern) and this has caused the cell
server cpu spike to ~85% from normal ~30% CPU and the disk IO response
(mainly the write response) increased from <.5ms to double i.e >1.5 ms and
it impacted one of the other database which was latency sensitive one
hosted in that same box.
It took us a long time to get to the sql because it was not showing as any
of the TOP contributors in that database layer but it did impact the
storage cell servers IO and CPU significantly. The response time of that
sql just increased from ~5-10 seconds earlier to ~20 seconds during the
issue period , so we were unable to find this as one of the top sql in that
database. And this query was not stuck but finishing up.
And also we were having 3 to 4 other databases hosted in that exadata
cluster. And when we got complaints from a different database which was a
latency sensitive one , we spent a good amount of time digging into the top
sqls from that database whereas it was a different database query which has
caused this issue.
So I wanted to understand , in such a scenario in which one of the queries
Regards
Yudhi
-- http://www.freelists.org/webpage/oracle-lReceived on Tue Sep 17 2024 - 07:37:11 CEST