RE: Controlling load

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Fri, 24 Feb 2023 06:45:07 -0500
Message-ID: <0a8501d94845$71c61310$55523930$_at_rsiz.com>



Another thought, since you know the source of the flood is “prog_batch”, is to have prog_batch query the current number of sessions, submit a maximum of <your decided limit of sessions> - <current number of sessions> when that result is positive, go to sleep for a while (a few seconds? your average batch_prog emitted job run time?), and repeat, reverting to your current sleep time or invocation method when the queue of jobs not yet submitted reaches zero.  

Starting sessions is relatively expensive. Avoiding piling on load often results in all the jobs being complete sooner even though some start later. And if I understand you your GOAL is to minimize the chances that jobs submitted by “prog_batch” will slow down the results of interactive users.  

Whether this approach or the valid suggestion of Ilmar achieves your goal depends a lot on whether the load of the bursty job submission causes the visible delays or ongoing resource consumption of the jobs submitted by “prog_batch.” IF memory serves, resource control doesn’t begin until the session is established, so if the delays being noticed are caused by the burst of logons resource control could make that worse. On the other hand if it IS the resource consumption of the running jobs that is the problem figuring out how to implement Ilmar’s suggestion is a superior gate to hoping that a count of sessions limit will be enough.  

From what you have described though, it seems to me that “prog_batch” is implemented as a denial of service attack against yourself, slamming the database with session initiation overhead.  

mwf  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of yudhi s Sent: Thursday, February 23, 2023 3:31 PM To: Ilmar Kerm
Cc: Oracle-L Freelists
Subject: Re: Controlling load  

"Maybe the attribute you are looking for is CLIENT_OS_USER? This is the "osuser" column in v$session"  

 The osuser is also exactly the same for both the batch and online program. As I mentioned in this thread earlier, the difference is only in the first part of the Program before "_at_". Not sure if by any means I would be able to create a separate consumer group based on that?  

For batch the program in v$session showing up as --> "batch_prog_at_vm000123 (TNS V1-V3)" and for online --> "online_prog_at_vm000123 (TNS V1-V3)"  

On Wed, Feb 22, 2023 at 6:24 PM Ilmar Kerm <ilmar.kerm_at_gmail.com> wrote:

There are multiple attributes for mapping sessions to consumer groups:

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_RESOURCE_MANAGER.html#GUID-7B9B9B1D-53E3-4BF4-8BE6-858256702877

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_RESOURCE_MANAGER.html#GUID-8ACB30CB-3E37-41C2-801C-4976A6CFA904  

Maybe the attribute you are looking for is CLIENT_OS_USER? This is the "osuser" column in v$session  

Also don't forget the mapping priorities:

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_RESOURCE_MANAGER.html#GUID-32AA7BD1-5884-4C25-A67A-22F48983851B  

And privileges, the database users must have privileges to switch to the consumer groups:

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_RESOURCE_MANAGER_PRIVS.html#GUID-C3584C41-A8C5-4DBE-847D-DA990B1F8CD1    

On Wed, Feb 22, 2023 at 2:32 PM yudhi s <learnerdatabase99_at_gmail.com> wrote:

Thank you Ilmar.  

Going through the example which is in the doc - 848666.1, and its making the consumer group based on the DB username. In our case, these two programs are using exactly same osuser, Username, schemaname, service_name. And the client_id,action is populated as null for both of these. The only difference is program column i.e. for batch it looks something like "batch_prog_at_vm000123 (TNS V1-V3)" and for online "online_prog_at_vm000123 (TNS V1-V3)", and ~40+ different "VM******" machines from which these are coming from. So wondering, would we still be able to create different resource consumer group , just based on this substring of program column?  

Additionally, to get the estimate of the CAP/Number of session which we should put for the BATCH programs "resource consumer group". How we should do that? We have total 48*2=96 cores per node. I tried getting the TOTAL AAS aginst those two programs side by side as below. If we eliminate the peak/issue period , the avg active sessions per minute is staying <10 for both the cases. Should we keep it as 40 as the limit for prog_batch?  

https://gist.github.com/databasetech0073/dde4804a030814c93fc2111616cd2407  

Regards

Yudhi    

On Wed, 22 Feb, 2023, 3:12 pm Ilmar Kerm, <ilmar.kerm_at_gmail.com> wrote:

You could try:

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_RESOURCE_MANAGER.html#GUID-8EC6C735-338D-46D4-B346-AD16D0622B30

active_sess_pool_p1 - Specifies maximum number of sessions that can currently have an active call

queueing_p1 - Specified time (in seconds) after which a call in the inactive session queue (waiting for execution) will time out. Default is NULL, which means unlimited.  

Example in: How to Limit max number of concurrently active sessions with Resource Manager (Doc ID 848666.1)  

On Wed, Feb 22, 2023 at 10:42 AM yudhi s <learnerdatabase99_at_gmail.com> wrote:

Thank you for the response.  

 As i mentioned we have 48 core CPU with 2 socket each for a node, so Total ~96 core per node. And the OEM also its showing showing the red line bar on AAS as 96 as Max limit. But during three different issue period we see the below AAS shoot ups above 96 and the wait distribution,

 Basically its combination of wait event from classes like concurrency, configuration and others but CPU part is minimum.  

  1. AAS shoot up to - 180, out of which 7% CPU and 34% (buffer busy waits), ~20% Log file switch checkpoint incomplete, 16% enq:tx allocate ITL entry, 12% Others.
  2. AAS shoot up to - 515, out of which 4% CPU and 46% (library cache lock), ~17% library cache mutex, 12% enq:tx allocate ITL entry entry, 12% Others.
  3. AAS shoot up to - 120, out of which 9% CPU and 30% (buffer busy waits), ~13% Log file switch checkpoint incomplete, 31% enq:tx allocate ITL entry entry, 8% BCT buffer space.

But i think you are correct putting CAP on CPU using DB resource manager may not going to help us in above situation as cpu portion is less than 10%. However, i was thinking if any possible way to put cap on the AAS or the number of session those were spawning from the program prog_batch as because they are the one causing the chaos by submitting lot many session simultaneously. So is that possible with the DB resource manager configuration or by tweaking the existing services in anyway?  

Regards

Yudhi  

On Wed, 22 Feb, 2023, 12:54 pm Ilmar Kerm, <ilmar.kerm_at_gmail.com> wrote:

Resource manager can help you if you are seeing resource shortage (CPU shortage) during these periods and CPU shortage is causing the Concurrency waits.

If you are not seeing CPU shortage, then you need traditional database troubleshooting methods and resource manager will not help you.

For example if you are behind in 19c patching, then for example before 19.15 there was bug 32898678, 33541865 (fixed in 19.17) and multiple others that caused block change tracking (BCT) to freeze the database or just high BCT buffer space waits.  

On Wed, Feb 22, 2023 at 6:08 AM yudhi s <learnerdatabase99_at_gmail.com> wrote:

Thank you Andy.  

I have never used the DB resource consumer group in our databases though. But going through the doc, it seems it can be set for many number of input parameters and also on many resources. Yet to test it , however In most cases I see it's set on amount of cpu. So in our case will it be good idea to determine the cpu percentage and define the plan or we can do it on number of average active sessions which will be again hardcoded value only.  

 We need to set it for multiple input program values in our case. If you can pass on reference to some sample doc with its exact code implementation, that would be great.  

On Wed, 22 Feb, 2023, 2:11 am Andy Sayer, <andysayer_at_gmail.com> wrote:

You can set up resource manager to map different programs to different resource consumer groups. You’ll need to make sure you order the priority of mapping to make sure program is checked before service name (if that’s not the default). It’s all in dbms_resource_manager  

Thanks,

Andy  

On Tue, Feb 21, 2023 at 12:19 PM, yudhi s <learnerdatabase99_at_gmail.com> wrote:

Hello Listers,

On a normal day, we see sessions/AAS per ~1 minute interval as ~50 in this database. and they are mainly from two specific 'programs' (say e.g. prog_batch, prog_online). But during certain periods of the day we saw a large number of sessions with programs as 'prog_batch' (i.e. with AAS>~500) flooding the database and we are seeing large concurrency waits, bct buffer space waits all over the database. And its impacting critical sessions of other programs i.e. prog_online. So considering 'prog_batch' is meant for batch kind of processes and it's okay for those sessions to queue-up and run a few minutes longer, but at the same time we can't afford to impact the sessions from prog_online, as those are latency sensitive online users. But we found both of these programs are pointing to the same database services and running on the same node-1. So , apart from controlling the number of sessions/connections from the application end, can we do some easy fixes with regards to the service config level, which will better control the incoming load? Or say putting some cap on the incoming sessions from prog_batch?

This is a 19C database with 2 node RAC and its an Exadata machine. Each node is a 48 core, 2 socket. Both the programs were using the same service and having preferred nodes as node-1.  

Regards

Yudhi  

--

Ilmar Kerm  

--

Ilmar Kerm  

--

Ilmar Kerm

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Feb 24 2023 - 12:45:07 CET

Original text of this message