Re: Alerting early for minimum impact
Date: Wed, 18 Sep 2024 10:17:04 +0100
Message-ID: <CAGtsp8mjNG31RKH7ivPOmRKQhhbYMSKN=3VCQVe+jXQmjFrs5Q_at_mail.gmail.com>
>>> So I wanted to understand , in such a scenario in which one of the
queries impacting the storage cell IO/CPU may not be one showing up in the
database layer. >>> So what alerting can be done to get to it as soon as
possible
You want to know when something is putting a heavy load on the cell servers. In particular you want to know when some other database is loading the cell server database so much that the response times on your database are being affected. If you want to do this from your database I've suggested that you query your v$sql to report statements which have recently been doing a lot of Cell I/O. As with elapsed time, the meaning of "heavy I/O load" is subjective and database dependent. Obviously you will need all the databases accessing that set of cell servers to alert on their own cell loading so that when your database alerts on a time problem you can see if another database is showing a cell loading alert.
If you want to do something about a little more central you can examine
v$cell_thread_history which is a bit like ASH for cell servers and reports
across all databases running on the same cells as the database you're
querying from. It holds about 10 minutes of recent data, sampled once per
second, and pulls this data on demand from the cells. As with elapsed_time
etc. on v$sql there are useful numbers available, but the decision on how
to decide what constitutes a threat is up to your knowledge of the
systems. Tanel Poder wrote something about this view a long time ago, and
I don't think you'll find anything useful about it in any of the Oracle
reference manuals. (See
https://tanelpoder.com/2013/05/23/vcell_thread_history-ash-for-exadata-storage-cells/
)
.
Regards
Jonathan Lewis
On Wed, 18 Sept 2024 at 05:06, yudhi s <learnerdatabase99_at_gmail.com> wrote:
> Thank you Jonathan and Kelly.
>
> We have alerting in place in regards to "elapsed_time" of the queries from
> v$sql and that is a measurement on which we can rely upon to understand the
> query run time. And in a system which is latency sensitive we can have
> alerting when the elapsed_time goes beyond a few minutes vs in other system
> we may have it when the query runs a longer time.
>
> But here in this scenario there are multiple databases hosted in the same
> box and they may be mixed (few are latency sensitive vs others are
> throughput oriented etc), so how and what value of these metrics we should
> be alerted on? I am unable to understand how to make these metrics
> converted to useful alerting so as to understand , we really have issues
> and it's impacting the whole cluster now, as these seem to be a bit
> subjective? Can you please explain a bit more on this?
>
>>
>>>>
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Sep 18 2024 - 11:17:04 CEST