Chains 'stuck in the past'
Date: Fri, 12 Jul 2019 23:20:25 +0100
Message-ID: <CAP0kZ-0DavbDR+d8_OYjXFfH_9bEAEuC8CHv+z9W7-D8pbScSg_at_mail.gmail.com>
All, I am hoping someone can assist me in troubleshooting our CJQ coordinator background process.
Multitenant environment with ~200 PDBs on Exadata Oracle RAC (3 x node) - 12.1.0.2 EE with job_queue_processes = 100.
An issue cropped up a few months ago whereby arbitrary jobs (Chains) across
PDB's would cease starting/running.
To be clear, the job (chains) are NOT just being delayed. They do not
eventually execute after minutes, hours, days, weeks.
The issue is cropping up for random job chains across different pluggable
at completely varying times; we have chains that run every 10 minutes,
every hour, daily, weekly etc; and many different PDB's will run the same
chain code, but the problem will only show up in a subset of (Varying)
PDBs. Its scarily inconsistent thus far.
The symptoms are that a job chain will run successfully for a period of time (Possibly weeks) but eventually, at completely inconsistent timeframes, without any change to the chain and for random PDB's, the chain will complete a run and reschedule for the 'next_run_date'. i.e. state=SCHEDULED. The job chain however then never again invokes so becomes "stuck in the past" so to speak, with a run date which has been bypassed.
For an example, the below hourly chain was scheduled for 03:30 on July 10th
but just never kicked off. This is what we observe...
SQL> list
1* select owner, state, enabled, job_name, job_weight, job_priority,
last_start_date, next_run_date, job_type, job_style from dba_scheduler_jobs
SQL> /
OWNER STATE ENABL JOB_NAME JOB_WEIGHT JOB_PRIORITY LAST_START_DATE NEXT_RUN_DATE JOB_TYPE JOB_STYLE ------------ --------------- ----- ------------------------------ ---------- ------------ ---------------------------------------- ------------------ ----------------- APP1 SCHEDULED TRUE JUSTATESTCHAIN 1 3 10-JUL-19 02.30 AM PST8PDT 10-JUL-19 03.30 AM PST8PDT CHAIN REGULAR
What I know...
It is happening for completely different chains, with varying frequencies
and with multiples of programs or very little programs. No discernible
pattern here.
When a chain completes (But has the problem), there is nothing still
'running' from the perspective of
dba_scheduler_running_chains/dba_scheduler_running_jobs and it sets the
next_run_date as expected so the last run of the chain looks to have fully
completed.
Recreating the chain, program, steps and jobs (With the very same creation
code) causes the chain to get created from scratch and it will run again
for a period of time, and many iterations. This is what I have been doing
over the past few months.
Once the job chains get "stuck in the past" - they never eventually run - I
have left chains sit in this state for over 60 days and they do not
eventually get picked up.
This leads me to believe that it is unlikely to be a resource capacity
issue that is causing this, i.e. with job_queue_processes or slaves not
getting an opportunity to pick the job up and initiate it, or I assume
eventually, they would get picked up and run.
Manipulating job chain weight or priority does not appear to have an
impact.
I have turned on event tracing via event 27402 but nothing has stood out in
terms of the trace file.
'Normal' regular type jobs are not being impacted notably, just job chains!
Within a PDB, when a "stuck in the past" chain occurs, other jobs and
chains continue to invoke and run as usual in the PDB.
Here is the interesting bit though, for a chain that gets stuck like this... If I force that job chain to run right now, i.e. exec dbms_scheduler.run_job('JUSTATESTCHAIN');, then the chain runs but what is quite noticeable is that the REQ_START_DATE of that run from within DBA_SCHEDULER_JOB_RUN_DETAILS coincides to the originally defined run date.
i.e. for the JUSTATESTCHAIN job above, it would show REQ_START_DATE as being "10-JUL-19 03.30 AM". So I wonder, if the metadata of the job looks to be recording the real attempted submission request time of the job chain to run at a point in time, but that the CJQ is just not picking it up or allowing it to go?
This is annoying me with months and have an SR open and ongoing, little
assistance coming through though.
Would anyone have any insight into examining how or tracing how the CJQ
picks up from its queue of jobs to invoke?
Any and all help appreciated.
Regards,
Ruan
-- http://www.freelists.org/webpage/oracle-lReceived on Sat Jul 13 2019 - 00:20:25 CEST