Re: DFS file handle / CI enq.
Date: Fri, 6 Nov 2020 12:01:25 -0500
Message-ID: <CAK5zhLLBb+CR1At6+C7VN5H=uyimuL6PEO2JsW1fukYzO8bk+Q_at_mail.gmail.com>
Priit,
There were also a few HangAnalyze dumps generated automatically. They
didn't seem to add any additional information to this. The v$waitchain is
tougher to get. When this problem arises, it happens in the middle of the
night (EST), and typically on-call is pinged and restarts the instances. We
are trying to get a few things run before the restarts, but in this state
it might (won't know until we have tried) take too long.
Henry
On Fri, Nov 6, 2020 at 9:08 AM Priit Piipuu <priit.piipuu_at_gmail.com> wrote:
> Hi,
>
> Did you check wait chain signatures from v$wait_chains or hanganalyze
> dumps? These include background processes being signalled.
>
> On Thu, 5 Nov 2020 at 23:32, Henry Poras <henry.poras_at_gmail.com> wrote:
>
>> I’m getting myself tied in knots trying to debug a sporadic, yet
>> bothersome issue. The system slows/hangs with lots of ‘DFS lock handle’
>> waits. My understanding (lots of which comes from Riyaj’s article
>> https://orainternals.wordpress.com/2011/11/08/troubleshooting-dfs-lock-handle-waits/)
>> is that this wait event waits for the lock handle of a *global* lock
>> request. There are a number of different global resources it might be
>> trying to obtain (determined from p1 parameter). In our case we are waiting
>> for a CI enqueue (id1=10, id2=5), trying to obtain it in Exclusive mode. A
>> CI enqueue, a Cross Instance call, is “used to invoke specific actions
>> in background processes”. So, if our waiting process needs a background
>> process (local or remote) to do work for it, it uses a CI enqueue (the
>> background process holds the enqueue until the work is done. Our process
>> knows the work is done when its DFS wait no longer waits for the CI
>> resource).
>>
>> We can also find out more about the CI resource from id1 and id2. This
>> data is fairly limited and this is where I start to lose it. In our case
>> id1=10, which the docs show is ‘Create Remote parallel query Server’,
>> and id2 = 5 which is ‘Used to queue up interested clients’. The meaning of
>> id2=5 makes it sound like I am queued up trying to do some real work with
>> another enqueue with another id2 value. (I’ll post other id2 values in a
>> second so you can see what I mean.) What is odd, is its behavior is very
>> different from other enqueue waits with which I am familiar. Typically,
>> three double linked lists (owners, waiters, converters queue) hang off from
>> the requested resource. For the CI enqueue it looks like we hop from
>> resource to resource, from id2 to id2. So the process holding the CI
>> enqueue (id1=10, id2=5) is not the final blocker. We might actually want
>> the CI enq in id2=1, which is held by another process.
>>
>> [Some values for id2
>>
>> 1 used to pass in parameters
>> 2 used to invoke the function in backgroud process
>> 3 used to indicate the foreground has not returned
>> 4 mounted excl, use to allocate mechanism
>> 5 used to queue up interested clients]
>>
>>
>>
>> To summarize:
>>
>> · A local process, waiting on ‘DFS lock handle’, needs a
>> background process to do work on local and/or remote nodes (e.g. a Query
>> Coordinator needs slave parallel processes)
>>
>> · The background process is pinged (more on this to follow)
>> telling it to grab the CI enqueue with id1=10 (Create Remote parallel
>> query Server), id2=5 (Used to queue up interested clients). This tells a
>> specific background process what work needs to be done. Additional
>> information (including *lock pointer, resource pointer, and resource
>> name)* can also be sent via the message (ping) to the resource’s value
>> block.
>>
>> · While the background process works, it holds the CI enqueue for
>> which the (QC) process waits on the ‘DFS lock handle’ wait. It waits until
>> the background process completes its work and releases the lock. This is
>> how our process (QC in this case) knows the necessary work is complete and
>> it can continue.
>>
>> My understanding is if a server process wants a lock on a CI resource, it
>> will contact a local GES process (e.g. LMD0) which will then message the
>> same process (LMD0) on the resource’s master node. The master will then
>> contact the owner and requester.
>>
>> Messaging between the GES processes is done using Asynchronous System
>> Traps (AST), which act like an interrupt to the process. These can be
>> either asynchronous (AAST Acquisition AST) or blocking (BAST Blocking AST).
>>
>> *Some more detail*. Oracle automatically produced some ProcessState
>> Dumps (I’m not sure what triggered it or what processes were dumped).
>> Looking at some of these, we see that the ‘ges process’ is a child of the
>> process, not of the session. Also, the ‘ges process’ has two ‘ges enqueues’
>> children. Excerpts from the queues shown for the ‘ges process’ and the two
>> ‘ges enqueue’s are as follows:
>>
>>
>>
>> ges process:
>>
>> AST_Q:
>>
>> PENDING_Q:
>>
>> lp 0x1644b35440 gl KJUSERNL rl KJUSEREX rp 0x17bb7a82e0
>> [0xa][0x1],[CI][ext 0x0,0x0]
>>
>> *master 3 pid 29614* bast 0 rseq 1 mseq 0
>>
>> history FREE > REF_RES > LOC_AST > FREE > REF_RES > GR2CVT
>>
>> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>>
>> GRANTED_Q:
>>
>> lp 0x1644b39f08 gl KJUSEREX rp 0x17b7ab6490 [0xa][0x5],[CI][ext
>> 0x0,0x0]
>>
>> *master 1 pid 29614* bast 0 rseq 19 mseq 0
>>
>> history LOC_AST_FG > CLOSE > FREE > REF_RES > GR2CVT > LOC_AST_FG
>>
>> open opt KJUSERPROCESS_OWNED
>>
>>
>>
>> ges enqueues:
>>
>> GRANTED_Q :
>>
>> lp 0x1663ca0418 gl KJUSERPR rp 0x17bb7a82e0 [0xa][0x1],[CI][ext
>> 0x0,0x0]
>>
>> master 3 pid 3889 bast 0 rseq 1 mseq 0
>>
>> history LOC_AST > LOC_AST > LOC_AST > LOC_AST > LOC_AST > LOC_AST
>>
>> open opt KJUSERPROCESS_OWNED
>>
>> CONVERT_Q:
>>
>> lp 0x1644b35440 gl KJUSERNL rl KJUSEREX rp 0x17bb7a82e0
>> [0xa][0x1],[CI][ext 0x0,0x0]
>>
>> master 3 pid 29614 bast 0 rseq 1 mseq 0
>>
>> history FREE > REF_RES > LOC_AST > FREE > REF_RES > GR2CVT
>>
>> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>>
>> ----------------------------------------
>>
>>
>>
>> ges enqueues:
>>
>> GRANTED_Q :
>>
>> lp 0x1644b39f08 gl KJUSEREX rp 0x17b7ab6490 [0xa][0x5],[CI][ext
>> 0x0,0x0]
>>
>> master 1 pid 29614 bast 0 rseq 19 mseq 0
>>
>> history LOC_AST_FG > CLOSE > FREE > REF_RES > GR2CVT > LOC_AST_FG
>>
>> open opt KJUSERPROCESS_OWNED
>>
>> CONVERT_Q:
>>
>> lp 0x1644b3c900 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
>> [0xa][0x5],[CI][ext 0x0,0x0]
>>
>> master 1 pid 38233 bast 0 rseq 19 mseq 0
>>
>> history REF_RES > LOC_AST > CLOSE > FREE > REF_RES > GR2CVT
>>
>> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>>
>> lp 0x1644efbb60 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
>> [0xa][0x5],[CI][ext 0x0,0x0]
>>
>> master 1 pid 2898 bast 0 rseq 19 mseq 0
>>
>> history REF_RES > LOC_AST > CLOSE > FREE > REF_RES > GR2CVT
>>
>> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>>
>>
>>
>> So it looks like there is an additional queue (pending_q) and we are
>> trying to switch from one resource (id2=5) to another (id2=1). We are also
>> going from a pid on node 1 to the same pid number on node 3. But if pid is
>> ospid (as it seems to be from the State Dump header), how does that work?
>> OS pids are fairly arbitrary. Also, is there any way to get more data about
>> the apparent blocker (node 3; pid 3889) without a System State Dump?
>>
>>
>>
>> One other question. Some of the ges enqueues convert_q entries have a
>> slightly different format. Instead of pid, it is owner, and bast 1. This
>> looks like it is BAST (message) related, but I’m not sure.
>>
>> lp 0x12542ff100 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
>> [0xa][0x5],[CI][ext 0x0,0x0]
>>
>> master 1 *owner 3 bast 1* rseq 76 mseq 0x30004
>>
>> history MSGSENT > GR2CVT > REM_AST > REM_BAST > MSGSENT > GR2CVT
>>
>> convert opt KJUSERGETVALUE
>>
>>
>>
>> Finally, the number of ‘DFS lock handle’ waits is growing faster than I
>> can explain. We go from 10’s to hundreds of thousands within an hour. No
>> sql statements execute at anywhere near that rate.
>>
>>
>>
>> So now I have a few problems. It seems like the issue is most likely
>> either the background process at the head of the queues, or an issue with
>> the inter-process messaging (AST, BAST).
>>
>> The major places I am getting stuck are:
>>
>> 1. What is at the head of the queue? From the ProcessStateDump it
>> looks like node 3; pid 3889.
>>
>> a. Is there a way to confirm this?
>>
>> b. Since this is an ospid how can I track it? (do I need a
>> SystemStateDump?)
>>
>> c. I assume it is a background process at the front of the line.
>> How can I tell why it is stuck? What it is doing?
>>
>> 2. Could the issue be messaging (AST, BAST) and not process
>> related? How do I check?
>>
>> 3. How do I get any of this information? (using a SystemState
>> Dump? Is there another way?)
>>
>> Any suggestions as to what is going on and how to best continue to
>> investigate this?
>>
>>
>>
>> Thanks.
>>
>> Henry
>>
>
-- http://www.freelists.org/webpage/oracle-lReceived on Fri Nov 06 2020 - 18:01:25 CET