Re: Performance issue - high Free buff wait
Date: Thu, 14 Jan 2021 22:47:43 +0530
Message-ID: <CAEjw_fj_GkRsetQevfbhcGdTfHH2kwQsFrE4fzzmCUD54W_3hw_at_mail.gmail.com>
I sent a response but that bounced back. Sending it with attachment.
On Thu, Jan 14, 2021 at 2:08 AM Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:
> From AWR and ash wait chains I might assume that your system is I/O bound.
> You can simply divide DB_time/elapsed = 58. You have 64 cores this means
> you are getting close from CPU perspective. you didn't include the host cpu
> numbers for the issue period. If your CPU is saturated this might be a
> cause as well.
>
> Looks like DB writer cannot write fast enough to clear the buffer cache
> and LGWR cannot keep up with the commit rate.
> I`ll ask dev to check the commit strategy, committing less frequently
> should improve your situation.
>
> Another place where you can check is buffer_cache size. If you have
> available memory you can try to increase it in small increments and try to
> measure the success rate (either wait events or any metrics app guys have).
> I`ll personally not change number writer processes.
>
> Last, but the most important thing, try to understand from a business
> perspective what is happening in that interval, why you have this spike in
> volumes of data. Understanding this will actually help you make the best
> decisions.
>
> For technical details I might be wrong as I kind of lost experience in
> situations like this, the systems I administer are very powerful exadata
> machines and can be resource bound only if dev is doing something very
> wrong.
>
>
> În mie., 13 ian. 2021 la 21:53, Pap <oracle.developer35_at_gmail.com> a
> scris:
>
>> Thank You so much.
>>
>> Attached is the DASH results. I am not sure how I should interpret the
>> WAIT_CHAIN column. But it shows Top ~338 sessions showing to be on ON CPU
>> only and contributing to ~29% activity followed by free buffer waits which
>> is ~17% of the overall activity.
>>
>> Want to understand, from what figure you came to know this one *"you
>> have 64 cores most probably you reached almost peak CPU usage during that
>> period "?*
>>
>> Does it mean that we are actually overloading the system during those
>> ~15minutes duration and thus the contention and slowness is observed
>> considering current ~64 cpu capacity and thus only option here should be to
>> ask the application team to reduce the max connection at any point in time
>> to the database or say reducing the max session limit at DB side?
>>
>> Regarding the rollbacks , i need to check , if those are real "rollbacks"
>> or some keep alive query from application is logging those.
>>
>> Thanks And Regards
>> Pap
>>
>>
>> On Wed, Jan 13, 2021 at 11:53 PM Laurentiu Oprea <
>> laurentiu.oprea06_at_gmail.com> wrote:
>>
>>> Hello Pap,
>>>
>>> Is not just those wait events, there is clearly a significant
>>> difference in load between the 2 periods you mentioned.
>>>
>>> -> the moment you open the excel you can see DB time 237.29 (mins)
>>> vs 874.29 (mins)
>>> Based on the fact that you have 64 cores most probably you reached
>>> almost peak CPU usage during that period maybe you was in the position
>>> where processes competed to get time on cpu.
>>>
>>> going down
>>> -> db block changes 9720600 vs 53870404 looks like transactions do more
>>> work ( more than 5 times more)
>>> -> redo size 2448278000 vs 11624470400 correlated with above metric you
>>> generated 5 times more redo info
>>> -> user rollbacks / user commits -> both look high , you can check with
>>> dev guys why you commit so often and why you have the rollbacks (do you
>>> have any errors in alertlog, to they have any timeout at app level)
>>>
>>> As you might be cpu bound during this peak period you might be storage
>>> bound as well. It is possible for the whole system to be slowed down by
>>> LGWR.
>>> You can validate this using Tanel`s chain waits:
>>>
>>> https://github.com/tanelpoder/tpt-oracle/blob/master/ash/dash_wait_chains.sql
>>>
>>> you can use it like this:
>>> _at_dash_wait_chains event2 1=1 "TIMESTAMP'2021-01-12 20:00:00'"
>>> "TIMESTAMP'2021-01-12 20:15:00'"
>>> and post output
>>>
>>> Good luck.
>>>
>>>
>>> În mie., 13 ian. 2021 la 16:05, Pap <oracle.developer35_at_gmail.com> a
>>> scris:
>>>
>>>> Not sure why , but the first email I dropped on this question was
>>>> bounced back from oracle-l, so trying again.
>>>>
>>>> Hello experts,
>>>>
>>>> We have a database with Oracle version 11.2.0.4. We are getting
>>>> complaints of slowness in one of our jobs mainly for ~15minutes duration on
>>>> a daily basis. And there is no plan change or any Undo reads for the
>>>> underlying sqls and also there is not one sql which we can singled out for
>>>> that job duration and that job consists of many small queries(selects,
>>>> inserts running many thousand times in that duration). But while
>>>> seeing/comparing the AWR for that ~15minutes period VS another ~15minutes
>>>> just before that, we are seeing a few odd waits , they are a combination of
>>>> Configuration(free buffer waits) followed by concurrency(buffer busy, index
>>>> contention etc) foreground waits. We do see comparatively higher sessions
>>>> during this ~15minutes window as compared to normal time.
>>>>
>>>> In the AWR under section "IOStat by Function summary" the DB writer
>>>> avg response time is logged as ~103 ms vs in normal period it stays ~35ms.
>>>> And then checking v$iostat_file , we see it has ASYNCH_IO set as OFF for
>>>> both "data file" and "temp file" which I have seen in many databases
>>>> normally set to ON. Also in dba_hist_active_sess_history the top waits
>>>> events are showing as below. I have attached excel with specific sections
>>>> of AWR during the issue period and normal period in two different tabs.
>>>>
>>>> 1)So , is it correct that by turning this ASYNCH_IO "ON" for data/temp
>>>> file , will cater all these issues because these all seem to be triggering
>>>> from the slow DB writer performance?
>>>>
>>>> Or
>>>>
>>>> 2)Should we ask the application team to reduce the total number of
>>>> sessions(maybe by decreasing the max connection limit) which are submitting
>>>> to the Database at that point to reduce contention?
>>>>
>>>> *Below is count of waits from Dba_hist_active_sess_history for that
>>>> ~15minutes window:-*
>>>>
>>>> Top two(log file sync and db file async I/O submit) are not associated
>>>> with application user/session.
>>>>
>>>> EVENT COUNT(*)
>>>> log file sync 260
>>>> db file async I/O submit 188
>>>> free buffer waits 72
>>>> 66
>>>> 65
>>>> 63
>>>> 62
>>>> db file sequential read 56
>>>> write complete waits 54
>>>> 54
>>>> db file sequential read 51
>>>>
>>>> Regards
>>>>
>>>> Pap
>>>>
>>>>
>>>>
>>>>
-- http://www.freelists.org/webpage/oracle-lReceived on Thu Jan 14 2021 - 18:17:43 CET
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet attachment: Host_CPU_And_Avg_Active_Session.xlsx