Re: Performance issue - high Free buff wait
Date: Sun, 17 Jan 2021 16:45:56 +0530
Message-ID: <CAEjw_fjPUF4aDipGNx38f5FRv3GQ=F26zR_4Hcquy8e1hi09Hg_at_mail.gmail.com>
Basically I have two doubts here,
As it seems for 15minutes period we are pushing the System beyond its
capacity. So considering we have 64 core CPU here , should we only allow
max 64 active session at any given point from application? And then
considering we have 4 app server we need to set max session limit to 16
from each app server?
Another doubt I had was if the dB writer response time of 100ms during
that issue period is the cause or the result of slowness? If it's the cause
can asynch IO setup for data file help improve situation here? Or by just
making row by row operations converted to batch will be enough ?
Regards
Pap
On Thu, 14 Jan 2021, 10:47 pm Pap, <oracle.developer35_at_gmail.com> wrote:
> Thank You very much.
>
> I sent a response but that bounced back. Sending it with attachment.
>
> The host CPU section for the issue period is as I have attached in the
> sheet . I am not seeing though if it's 100% utilized during those
> ~15minutes.
>
> From the application team we got to know we have ~4 app servers and each
> having a ~40 max session limit which means they can submit a total Max ~160
> session at any point in time. The number of sessions in the AWR showing
> ~1000 during the issue period vs ~950 during normal period. But as below,
> the sum of Avg active sessions is showing ~60 during issue period vs ~15
> during normal period. Does that mean we are saturating our ~64 cores. And
> thus we should rather ask the dev team to reduce the max session ,limit of
> the app server to 64 cores/4= ~16 Per app server. Is this the correct
> analogy? Note- Avg active session for both the period also i have mentioned
> in the attached sheet.
>
> Regarding ROLLBACKS, i see during the issue period its 16k vs the normal
> period its 15k but the transaction rollback is 7K vs 3k. And "rollback
> changes undo record applied" is ~21k vs ~6k. As i had doubt on the keep
> alive type rollback may be impacting , so i see "select 1 from dual" query
> in the AWR report for that duration and its showing ~845k execution in
> issue period vs ~736k in normal period. So from these it appears to be a
> significant amount was because of the keep alive type query rollback.
>
> "user commits" is ~329k during the issue period vs ~217K during normal
> period. But I do see few of the single row insert/updates happening
> ~1.2million times during issue period vs ~67K times during normal period.
> And from these figures it seems commits may not be happening per each row ,
> but definitely the row by row insert/update should be made batch rather row
> by row. But how will this batching logically help to make DB writer/log
> writer response faster and improve the current situation?
>
> As i mentioned earlier, we have db writer response time logged in awr as
> ~100ms+ during isseu period vs ~35ms during issue period. So is it an
> acceptable range? And whether making the "ASYNCH_IO" OFF for data file and
> temp file is a recommended setting and it will help us getting the db
> writer response faster? or is there any downside to this?
>
> Regarding increasing the buffer cache, I see the db_cache_size is set as
> 6GB in v$parameter but I see in AWR it's showing ~14GB, so not sure from
> where it's fetching that?
>
> Regards
>
> Pap
>
> On Thu, Jan 14, 2021 at 2:08 AM Laurentiu Oprea <
> laurentiu.oprea06_at_gmail.com> wrote:
>
>> From AWR and ash wait chains I might assume that your system is I/O
>> bound. You can simply divide DB_time/elapsed = 58. You have 64 cores this
>> means you are getting close from CPU perspective. you didn't include the
>> host cpu numbers for the issue period. If your CPU is saturated this might
>> be a cause as well.
>>
>> Looks like DB writer cannot write fast enough to clear the buffer cache
>> and LGWR cannot keep up with the commit rate.
>> I`ll ask dev to check the commit strategy, committing less frequently
>> should improve your situation.
>>
>> Another place where you can check is buffer_cache size. If you have
>> available memory you can try to increase it in small increments and try to
>> measure the success rate (either wait events or any metrics app guys have).
>> I`ll personally not change number writer processes.
>>
>> Last, but the most important thing, try to understand from a business
>> perspective what is happening in that interval, why you have this spike in
>> volumes of data. Understanding this will actually help you make the best
>> decisions.
>>
>> For technical details I might be wrong as I kind of lost experience in
>> situations like this, the systems I administer are very powerful exadata
>> machines and can be resource bound only if dev is doing something very
>> wrong.
>>
>>
>> În mie., 13 ian. 2021 la 21:53, Pap <oracle.developer35_at_gmail.com> a
>> scris:
>>
>>> Thank You so much.
>>>
>>> Attached is the DASH results. I am not sure how I should interpret the
>>> WAIT_CHAIN column. But it shows Top ~338 sessions showing to be on ON CPU
>>> only and contributing to ~29% activity followed by free buffer waits which
>>> is ~17% of the overall activity.
>>>
>>> Want to understand, from what figure you came to know this one *"you
>>> have 64 cores most probably you reached almost peak CPU usage during that
>>> period "?*
>>>
>>> Does it mean that we are actually overloading the system during those
>>> ~15minutes duration and thus the contention and slowness is observed
>>> considering current ~64 cpu capacity and thus only option here should be to
>>> ask the application team to reduce the max connection at any point in time
>>> to the database or say reducing the max session limit at DB side?
>>>
>>> Regarding the rollbacks , i need to check , if those are real
>>> "rollbacks" or some keep alive query from application is logging those.
>>>
>>> Thanks And Regards
>>> Pap
>>>
>>>
>>> On Wed, Jan 13, 2021 at 11:53 PM Laurentiu Oprea <
>>> laurentiu.oprea06_at_gmail.com> wrote:
>>>
>>>> Hello Pap,
>>>>
>>>> Is not just those wait events, there is clearly a significant
>>>> difference in load between the 2 periods you mentioned.
>>>>
>>>> -> the moment you open the excel you can see DB time 237.29 (mins)
>>>> vs 874.29 (mins)
>>>> Based on the fact that you have 64 cores most probably you reached
>>>> almost peak CPU usage during that period maybe you was in the position
>>>> where processes competed to get time on cpu.
>>>>
>>>> going down
>>>> -> db block changes 9720600 vs 53870404 looks like transactions do more
>>>> work ( more than 5 times more)
>>>> -> redo size 2448278000 vs 11624470400 correlated with above metric you
>>>> generated 5 times more redo info
>>>> -> user rollbacks / user commits -> both look high , you can check with
>>>> dev guys why you commit so often and why you have the rollbacks (do you
>>>> have any errors in alertlog, to they have any timeout at app level)
>>>>
>>>> As you might be cpu bound during this peak period you might be storage
>>>> bound as well. It is possible for the whole system to be slowed down by
>>>> LGWR.
>>>> You can validate this using Tanel`s chain waits:
>>>>
>>>> https://github.com/tanelpoder/tpt-oracle/blob/master/ash/dash_wait_chains.sql
>>>>
>>>> you can use it like this:
>>>> _at_dash_wait_chains event2 1=1 "TIMESTAMP'2021-01-12 20:00:00'"
>>>> "TIMESTAMP'2021-01-12 20:15:00'"
>>>> and post output
>>>>
>>>> Good luck.
>>>>
>>>>
>>>> În mie., 13 ian. 2021 la 16:05, Pap <oracle.developer35_at_gmail.com> a
>>>> scris:
>>>>
>>>>> Not sure why , but the first email I dropped on this question was
>>>>> bounced back from oracle-l, so trying again.
>>>>>
>>>>> Hello experts,
>>>>>
>>>>> We have a database with Oracle version 11.2.0.4. We are getting
>>>>> complaints of slowness in one of our jobs mainly for ~15minutes duration on
>>>>> a daily basis. And there is no plan change or any Undo reads for the
>>>>> underlying sqls and also there is not one sql which we can singled out for
>>>>> that job duration and that job consists of many small queries(selects,
>>>>> inserts running many thousand times in that duration). But while
>>>>> seeing/comparing the AWR for that ~15minutes period VS another ~15minutes
>>>>> just before that, we are seeing a few odd waits , they are a combination of
>>>>> Configuration(free buffer waits) followed by concurrency(buffer busy, index
>>>>> contention etc) foreground waits. We do see comparatively higher sessions
>>>>> during this ~15minutes window as compared to normal time.
>>>>>
>>>>> In the AWR under section "IOStat by Function summary" the DB writer
>>>>> avg response time is logged as ~103 ms vs in normal period it stays ~35ms.
>>>>> And then checking v$iostat_file , we see it has ASYNCH_IO set as OFF for
>>>>> both "data file" and "temp file" which I have seen in many databases
>>>>> normally set to ON. Also in dba_hist_active_sess_history the top waits
>>>>> events are showing as below. I have attached excel with specific sections
>>>>> of AWR during the issue period and normal period in two different tabs.
>>>>>
>>>>> 1)So , is it correct that by turning this ASYNCH_IO "ON" for
>>>>> data/temp file , will cater all these issues because these all seem to be
>>>>> triggering from the slow DB writer performance?
>>>>>
>>>>> Or
>>>>>
>>>>> 2)Should we ask the application team to reduce the total number of
>>>>> sessions(maybe by decreasing the max connection limit) which are submitting
>>>>> to the Database at that point to reduce contention?
>>>>>
>>>>> *Below is count of waits from Dba_hist_active_sess_history for that
>>>>> ~15minutes window:-*
>>>>>
>>>>> Top two(log file sync and db file async I/O submit) are not associated
>>>>> with application user/session.
>>>>>
>>>>> EVENT COUNT(*)
>>>>> log file sync 260
>>>>> db file async I/O submit 188
>>>>> free buffer waits 72
>>>>> 66
>>>>> 65
>>>>> 63
>>>>> 62
>>>>> db file sequential read 56
>>>>> write complete waits 54
>>>>> 54
>>>>> db file sequential read 51
>>>>>
>>>>> Regards
>>>>>
>>>>> Pap
>>>>>
>>>>>
>>>>>
>>>>>
-- http://www.freelists.org/webpage/oracle-lReceived on Sun Jan 17 2021 - 12:15:56 CET