DFS file handle / CI enq.

From: Henry Poras <henry.poras_at_gmail.com>
Date: Thu, 5 Nov 2020 17:29:37 -0500
Message-ID: <CAK5zhLKTNaSUWU=yd-Q5V4OUfh=gDxnJJ9ZTtBa8_GwZ8Z=Frg_at_mail.gmail.com>



I’m getting myself tied in knots trying to debug a sporadic, yet bothersome issue. The system slows/hangs with lots of ‘DFS lock handle’ waits. My understanding (lots of which comes from Riyaj’s article https://orainternals.wordpress.com/2011/11/08/troubleshooting-dfs-lock-handle-waits/) is that this wait event waits for the lock handle of a *global* lock request. There are a number of different global resources it might be trying to obtain (determined from p1 parameter). In our case we are waiting for a CI enqueue (id1=10, id2=5), trying to obtain it in Exclusive mode. A CI enqueue, a Cross Instance call, is “used to invoke specific actions in background processes”. So, if our waiting process needs a background process (local or remote) to do work for it, it uses a CI enqueue (the background process holds the enqueue until the work is done. Our process knows the work is done when its DFS wait no longer waits for the CI resource).

We can also find out more about the CI resource from id1 and id2. This data is fairly limited and this is where I start to lose it. In our case id1=10, which the docs show is ‘Create Remote parallel query Server’, and id2 = 5 which is ‘Used to queue up interested clients’. The meaning of id2=5 makes it sound like I am queued up trying to do some real work with another enqueue with another id2 value. (I’ll post other id2 values in a second so you can see what I mean.) What is odd, is its behavior is very different from other enqueue waits with which I am familiar. Typically, three double linked lists (owners, waiters, converters queue) hang off from the requested resource. For the CI enqueue it looks like we hop from resource to resource, from id2 to id2. So the process holding the CI enqueue (id1=10, id2=5) is not the final blocker. We might actually want the CI enq in id2=1, which is held by another process.

[Some values for id2

1 used to pass in parameters
2 used to invoke the function in backgroud process 3 used to indicate the foreground has not returned 4 mounted excl, use to allocate mechanism 5 used to queue up interested clients]

To summarize:

· A local process, waiting on ‘DFS lock handle’, needs a background process to do work on local and/or remote nodes (e.g. a Query Coordinator needs slave parallel processes)

· The background process is pinged (more on this to follow) telling it to grab the CI enqueue with id1=10 (Create Remote parallel query Server), id2=5 (Used to queue up interested clients). This tells a specific background process what work needs to be done. Additional information (including *lock pointer, resource pointer, and resource name)* can also be sent via the message (ping) to the resource’s value block.

· While the background process works, it holds the CI enqueue for which the (QC) process waits on the ‘DFS lock handle’ wait. It waits until the background process completes its work and releases the lock. This is how our process (QC in this case) knows the necessary work is complete and it can continue.

My understanding is if a server process wants a lock on a CI resource, it will contact a local GES process (e.g. LMD0) which will then message the same process (LMD0) on the resource’s master node. The master will then contact the owner and requester.

Messaging between the GES processes is done using Asynchronous System Traps (AST), which act like an interrupt to the process. These can be either asynchronous (AAST Acquisition AST) or blocking (BAST Blocking AST).

*Some more detail*. Oracle automatically produced some ProcessState Dumps (I’m not sure what triggered it or what processes were dumped). Looking at some of these, we see that the ‘ges process’ is a child of the process, not of the session. Also, the ‘ges process’ has two ‘ges enqueues’ children. Excerpts from the queues shown for the ‘ges process’ and the two ‘ges enqueue’s are as follows:

ges process:

    AST_Q:     PENDING_Q:     lp 0x1644b35440 gl KJUSERNL rl KJUSEREX rp 0x17bb7a82e0
[0xa][0x1],[CI][ext 0x0,0x0]

      *master 3 pid 29614* bast 0 rseq 1 mseq 0

      history FREE > REF_RES > LOC_AST > FREE > REF_RES > GR2CVT

      convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK

    GRANTED_Q:     lp 0x1644b39f08 gl KJUSEREX rp 0x17b7ab6490 [0xa][0x5],[CI][ext 0x0,0x0]

      *master 1 pid 29614* bast 0 rseq 19 mseq 0

      history LOC_AST_FG > CLOSE > FREE > REF_RES > GR2CVT > LOC_AST_FG

      open opt KJUSERPROCESS_OWNED

ges enqueues:

      GRANTED_Q :       lp 0x1663ca0418 gl KJUSERPR rp 0x17bb7a82e0 [0xa][0x1],[CI][ext 0x0,0x0]

        master 3 pid 3889 bast 0 rseq 1 mseq 0

        history LOC_AST > LOC_AST > LOC_AST > LOC_AST > LOC_AST > LOC_AST

        open opt KJUSERPROCESS_OWNED

      CONVERT_Q:       lp 0x1644b35440 gl KJUSERNL rl KJUSEREX rp 0x17bb7a82e0
[0xa][0x1],[CI][ext 0x0,0x0]

        master 3 pid 29614 bast 0 rseq 1 mseq 0

        history FREE > REF_RES > LOC_AST > FREE > REF_RES > GR2CVT

        convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK


ges enqueues:

      GRANTED_Q :       lp 0x1644b39f08 gl KJUSEREX rp 0x17b7ab6490 [0xa][0x5],[CI][ext 0x0,0x0]

        master 1 pid 29614 bast 0 rseq 19 mseq 0

        history LOC_AST_FG > CLOSE > FREE > REF_RES > GR2CVT > LOC_AST_FG

        open opt KJUSERPROCESS_OWNED

      CONVERT_Q:       lp 0x1644b3c900 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
[0xa][0x5],[CI][ext 0x0,0x0]

        master 1 pid 38233 bast 0 rseq 19 mseq 0

        history REF_RES > LOC_AST > CLOSE > FREE > REF_RES > GR2CVT

        convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK

      lp 0x1644efbb60 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
[0xa][0x5],[CI][ext 0x0,0x0]

        master 1 pid 2898 bast 0 rseq 19 mseq 0

        history REF_RES > LOC_AST > CLOSE > FREE > REF_RES > GR2CVT

        convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK

So it looks like there is an additional queue (pending_q) and we are trying to switch from one resource (id2=5) to another (id2=1). We are also going from a pid on node 1 to the same pid number on node 3. But if pid is ospid (as it seems to be from the State Dump header), how does that work? OS pids are fairly arbitrary. Also, is there any way to get more data about the apparent blocker (node 3; pid 3889) without a System State Dump?

One other question. Some of the ges enqueues convert_q entries have a slightly different format. Instead of pid, it is owner, and bast 1. This looks like it is BAST (message) related, but I’m not sure.

      lp 0x12542ff100 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
[0xa][0x5],[CI][ext 0x0,0x0]

        master 1 *owner 3 bast 1* rseq 76 mseq 0x30004

        history MSGSENT > GR2CVT > REM_AST > REM_BAST > MSGSENT > GR2CVT

        convert opt KJUSERGETVALUE

Finally, the number of ‘DFS lock handle’ waits is growing faster than I can explain. We go from 10’s to hundreds of thousands within an hour. No sql statements execute at anywhere near that rate.

So now I have a few problems. It seems like the issue is most likely either the background process at the head of the queues, or an issue with the inter-process messaging (AST, BAST).

The major places I am getting stuck are:

  1. What is at the head of the queue? From the ProcessStateDump it looks like node 3; pid 3889.
  2. Is there a way to confirm this?
  3. Since this is an ospid how can I track it? (do I need a SystemStateDump?)
  4. I assume it is a background process at the front of the line. How can I tell why it is stuck? What it is doing?
  5. Could the issue be messaging (AST, BAST) and not process related? How do I check?
  6. How do I get any of this information? (using a SystemState Dump? Is there another way?)

Any suggestions as to what is going on and how to best continue to investigate this?

Thanks.

Henry

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Nov 05 2020 - 23:29:37 CET

Original text of this message