Re: oracle 9 on solaris 8 - exception reporting

From: kerravon <kerravon_at_w3.to>
Date: Thu, 24 Apr 2008 16:16:37 -0700 (PDT)
Message-ID: <eb31e832-feb9-4ffa-a182-538b6af3b3d7@m1g2000pre.googlegroups.com>


On Apr 25, 4:34 am, joel garry <joel-ga..._at_home.com> wrote:
> On Apr 23, 3:53 pm, kerravon <kerra..._at_w3.to> wrote:
>
> Sybrand gave the correct response to find what a particular problem
> really is.

Thanks Sybrand and Joel for your excellent answers.

> You can also perhaps get some clues from various tools in OEM,
> particularly the lock and wait screens, and some clues from views like
> (select view_name from dba_views where view_name like '%WAIT%' or
> view_name like '%LOCK%';)

Ok. I have some options to suggest now.

> Many things can cause a large variance in a particular queries'
> response time, and they are often something outside of that particular
> sessions control.  Some possibilities are things like:
>
> Something actually overloads your I/O, such as what happens in a
> misconfigured or overloaded system doing a log switch - when you can't
> finish switching logs quickly, the db stops until it can finish the
> switch.  This can also happen for various reasons with a SAN,
> including some resilvering operations, and sometimes people don't even
> think about what traffic is going through their controllers or
> network.

The network shouldn't be able to intefere with the database access as far as I am aware.

> Is there any pattern to the slowdowns (yes, I understand
> trying to determine that would be why you'd ask about the 1 second
> thing, but have you noticed anything)?  

No, we haven't noticed anything. All we see is the system fail because it can't handle such long response times from Oracle.

> Anything in the alert log?

I'll ask.

> The session wants something that something else has.  This can be due
> to multiuser issues, such as everyone wants to update the same block
> in memory.  It can be due to locking.  Much of this is usually
> application design or programming issues.

I'm actually from the application side and been tasked with stopping the application from falling over. The application is designed such that
a single task is inserting and updating to these tables, and it's the same query being run every time, just the data changes.

> Sometimes there are simple answers, such as your buffers are getting
> thrashed by certain processes, and separating some things into their
> own buffers can make all of them more orderly.  Google v$bh for
> that.

Ok.

> A common cause of your type of problem is a batch process that
> periodically updates transaction tables.  At an extreme, they can pile
> up on each other.

Yes, we're not aware of anything like that.

> It may be worth a gander at an explain plan - could the plan be way
> off for certain uses of the code?

The way it works 99.9999% of the time appears to be perfectly fine, and there shouldn't be any exceptions.

> There are bugs and misfeatures, that's why it is usually a good reason
> to state the exact version (like 9.2.0.4) and the exact OS and
> hardware you are using.  

Oracle 9.2.0.8. SunOS 5.8. Hardware I'm not sure about. I think it's
a Sun UltraSparc. I'll find out.

> Sometimes everyone just goes "oh, you should
> do such-and-such because of this reason."  Sometimes there are myths,
> too, so show proper scepticism of those kinds of claims.  If it is
> entirely on the Solaris side, Oracle may not know much about that.  

Sure. But what I was hoping for was Oracle saying "spent 14 seconds waiting for the OS to respond to xyz call" whenever it encounters such a delay.

> Do
> you see anything like swapping or some high priority process taking
> over all the cpu's?

Unfortunately we don't have any monitoring in place that gives an answer
at the 30-60 second timeframe we need. Basically the problem comes fairly much at random, and after 60 seconds it's all over (ie the system
dies). It has taken about 2 months to get permission, but very soon we should have a script in place that executes "top" and an "iorate" program that I wrote that shows MB/sec of activity over a few seconds (via procfs). However, I doubt either of them are going to show anything,
based on my experience of the system.

> Statspacks, traces, knowledge of how your app works and
> instrumentation you put in the code are tools you use to point your
> quest in the proper direction.  There are a few methodologies to
> choose from, a lot of people like method-r.  There are just too many
> things happening to generically spit out a reason a query takes more
> than an arbitrary time - there may be more than one reason, it may be
> normal, it may be stupid...

At this point, simple acknowledgement from Oracle that it held a transaction for more than 1 second would be a good start. Otherwise the Oracle side can claim that the application must have a "sleep(15)" in there that I don't know about that occasionally gets executed.

We do a series of 4 inserts and updates and just have a total time for the lot. We would have to put in some monitoring immediately before and after each SQL statement and even then it wouldn't be definitive, because the OS may have swapped the task out (due to a bug in the OS) at that point, and Oracle never received the transaction.

> So, even if the granularity of the statspack is large, there may be a
> clue in the waits there.  The trace file may tell you you are waiting
> on disk I/O or some such.  You still need to figure out what is going
> on, because you are still smarter than the db.

Sure.

> Sometimes a problem is obscure, like something causing a procedure or
> everything to hard parse.  Are you seeing anything else besides this
> one problem?

Yes. There is one other thing. We had an error the other day in that a daily job that drops an old section/segment (forgot the correct word)
of a table caused an insert to fail with a "doesn't exist" error or something
like that. There is no explanation for this. We can code around it in
the application to retry if an insert comes up with such an error, but I would prefer to get an explanation from the Oracle software as to why
that happened, and if the software doesn't come up with a suitable answer, to report it to Oracle (the company).

Oh, and one more thing on top of that. The other day we had a 64-second response time to the insert/update. The system must have been extremely close to falling over with such a delay, but it didn't. This time we did know of something unusual happening. We were running some SQL to do some updates of the same table (to fix data). However, the SQL that was being run shouldn't have been updating the same rows that the normal update/insert did. Although the difference was only about 2 minutes. ie the normal thing writes current data, and we were updating rows 7-2 minutes old (ie we did an update on a 5 minute period, a couple of minutes after the end of that 5 minute period). So I'm wondering if there's some sort of bug in Oracle that is causing locking of an entire table instead of just the rows in question. Again, I need Oracle to give an explanation as to why it took so long, rather than second-guessing what happened.

BFN. Paul. Received on Thu Apr 24 2008 - 18:16:37 CDT

Original text of this message