Re: Monitoring emails

From: Mark Brinsmead <mark.brinsmead_at_gmail.com>
Date: Fri, 28 Aug 2015 14:47:32 -0600
Message-ID: <CAAaXtLC_Dxm=4yXmyEVdGBLUyh6Cd-TzRSsc_UBzNbH1dHwRUA_at_mail.gmail.com>



For starters here, I will do as others have in this thread and assume that "email" is equivalent to "page" for the purpose of this discussion.

I have generally found that good preventive maintenance is the best way to minimize pages. If your "non-critical" checks are designed well, they will predict conditions that may eventually lead to a more critical (pager) event, allowing you to deal with them more-or-less at your convenience, within regular working hours.

If you find yourself being paged because a tablespace has run out of space or an index has reached maxextents (does that *ever* happen any more?) the fault is almost certainly your own, for having failed predicted the problem days earlier and fix it *before* a runtime incident could occur.

The other half of minimizing pages, of course, is to ensure that your non-critical health checks do not page. There is little point in having health cheacks with multiple levels of criticality, if all alerts are delivered by the same mechanism. For the non-critical tests, a once-daily report of all (unresolved) issues is probably what you want -- its even better if this automatically becomes some sort of checklist or to-do list for the junior members of the team.

Of course, simply stratifying your tests into "critical" and "noncritical" is not enough. You need to carefully craft your non-critical tests with the intent of identifying and resolving problems *before* they become critical.

At the end of the day, you probably end up doing almost as much work -- but you'll be doing less of it at 3AM with the CIO breathing down your neck.

Once you get the paging under control, you can then look at practices (like better storage management or segment management, or...) to automate or eliminate the most common tasks arising from your non-critical monitoring. This can also include things like identifying unstable infrastructure elements (flakey storage, unreliable backu servers, etc.) and giving yourself the kind of "solid infrastructure" Jeremy refers to. [I recall a very concrete example of this -- I once had a client with severe and widespread performance and stability issues almost all of which were attributed to using NFS storage over a 10 Mbit ethernet. Strangely, it took *years* to get them to do it, but once they upgraded the network to Gigabit, my workload for that customer dripped by about 99% -- both in terms of paging events and day-to-day maintenance.)

DBAs will probably never run out of work. But the more "silly stuff" you can automate or eliminate, the more of your time you can spend delivering the really high-value services you (and probably everybody else) prefer you to be doing.

On Fri, Aug 28, 2015 at 2:03 PM, Jeremy Schneider < jeremy.schneider_at_ardentperf.com> wrote:

> i head up much of this in our dba group - and i think we're pretty
> brutal when it comes to trimming down the pages that actually come out
> to our DBA phones/pagers. we've had weeks go by that we don't get a
> single page and then i'm relieved when i get a page and see that
> everything is still working fine. :) this is also because we have a
> great team and a solid infrastructure with infrequent major problems.
>
> my DBA team is not global, so we generally work north american
> business hours. when my phone beeps, i usually go look at it even if
> i'm in the middle of dinner with the kids. i value my own evenings
> and weekends - and i value the time and attention of other DBAs that i
> work with. so we don't want our phones to beep for anything that
> could have waited until the next morning when someone gets the office
> and checks their email.
>
> obviously we get paged if an important system is unavailable. we do
> have some "non-production" systems which would need off-hours
> attention if they are unavailable - of course this is really worked
> out with the business. but we've ruthlessly trimmed down the noise and
> our management goes to bat for us when everybody wants their stuff to
> be critical. honestly, over the past few years, i can't think of many
> issues we had where the business really needed a DBA to interrupt
> their dinner or weekend to look at it immediately. there were a few,
> but not many. just today i added a custom OEM metric on percentage of
> processes used, because we became aware of an issue where something
> could exhaust the processes on a database -- and we will now get a
> page if process usage goes above a certain threshold. but we've
> already taken so other steps to address the problem and i don't expect
> many pages - if any.
>
> now that was all just about pages. going back to the original
> question about automated emails, it's another subject entirely. i like
> to get emails and i have lots of server-side filters that move them
> into folders that my email client doesn't even look at until i click
> there. our SAs don't trim down their alerts like we do - so i get a
> decent amount of traffic from their monitoring system. but i like
> that. the key here is that it's all informational, and on the DBA team
> we don't expect ourselves to read them - just what we're interested
> in. people can setup filters to get rid of stuff they don't care
> about. i don't usually check my email when i'm not at work, so it
> doesn't bother me to get extra noise emails.
>
> so far today i've got 4 pages (which also come as emails), 32
> backup-related emails (there was a minor issue) and 33 miscellaneous
> emails from monitoring systems that i actually watch - that is, i look
> in the email folder occasionally and skim the subject lines and mark
> them as read. of course i would give attention to anything that
> really needs it, but all the critical stuff comes to our phones
> anyway.
>
> so far today i've got 192 "junk" emails from various other monitoring
> systems which i don't watch at all - on rare occasions i'll dig into
> one of those folders to look for something specific but otherwise i
> completely ignore it.
>
> -Jeremy
>
> --
> http://about.me/jeremy_schneider
>
>
> On Fri, Aug 28, 2015 at 3:22 PM, Mayen Shah <mshah_at_travelclick.com> wrote:
> > Your main goal should be to identify an act upon critical issues in your
> > environment. Of course there will be informational alerts/emails.
> >
> >
> >
> > Imagine few hundred alerts (minimum 1 minute per alert) * n number of
> DBAs.
> > Is it really productive? And among all the noise likelihood of missing
> > critical alerts are very high. One can argue that he/she ignore or do not
> > act upon x% of alerts. I am of the opinion that if you ignore any alert,
> it
> > is not worth alerting on.
> >
> >
> >
> > I have worked in environment where we will categorize alerts into
> > informative, warning, critical and emergency. Setup rules so emails are
> > organized and emergency and critical alerts are not missed.
> >
> >
> >
> > Thanks
> >
> > Mayen
> >
> >
> >
> > From: oracle-l-bounce_at_freelists.org [mailto:
> oracle-l-bounce_at_freelists.org]
> > On Behalf Of Alfredo Abate
> > Sent: Friday, August 28, 2015 3:02 PM
> > To: veeeraman_at_gmail.com
> > Cc: ORACLE-L
> > Subject: Re: Monitoring emails
> >
> >
> >
> > Ram,
> >
> >
> >
> > The important question is how many are Critical alerts vs Warning
> alerts? I
> > would think if you are getting 100s of Ciritcal alerts there is a
> problem.
> > :) I can see getting Warning alerts frequently but perhaps you filter
> those
> > to go to different folder, etc that can be reviewed a few times per day.
> >
> >
> >
> > For us we get Warning alerts throughout the day (maybe 25 - 100) and
> > Critical alerts very few if any per day. It all depends what is
> important
> > to you and the team managing the databases. This is where trying to
> find a
> > good balance between being proactive towards preventing something and
> > general "noise" can become an art as much as it is a science.
> >
> >
> >
> >
> >
> > Alfredo
> >
> >
> >
> > On Fri, Aug 28, 2015 at 1:10 PM, Ram Raman <veeeraman_at_gmail.com> wrote:
> >
> > List,
> >
> >
> >
> > How many automated emails do listers get from all the databases that are
> > being monitored on a daily basis?
> >
> >
> >
> > We get a few hundred emails a day (<100 DBs), but some new members here
> feel
> > that is too many and want us to cut down on that. I personally feel that
> > most of the messages are relevant to us.
> >
> >
> >
> > Thanks
> >
> > Ram.
> >
> > --
> >
> >
> >
> >
> >
> >
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Aug 28 2015 - 22:47:32 CEST

Original text of this message