Gaurav Verma
RMAN puzzle: database reincarnation is not in sync with catalog
A very puzzling situation..
Something very puzzling happened with the production database of our customer. Although the client had done a complete, successful backup of PROD in the morning, a subsequent backup attempt of PROD in the afternoon gave this error:
Starting backup at 04172009
released channel: ch00
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 04/17/2009 13:25:45
RMAN-03014: implicit resync of recovery catalog failed
RMAN-06004: ORACLE error from recovery catalog database:
RMAN-20011: target database incarnation is not current in recovery catalog
To counter this, un-registration and re-registered of the database was done so that both DB incarnation and catalog reincarnation will come in sync. Although this had the risk of invalidating the previous PROD backups, this appeared to solve the problem and then the PROD full backups completed successfully.
However, the very next day, the problem re-appeared! We asked the client if any changes were done from their side? The firm answer was “No”. This added to the mystery of the problem.
First impressions..What could have caused the same error to re-appear all of a sudden? We tried running the RMAN crosscheck command, but all of them were giving the same error. A list incarnation of database was not giving any output. It was quite puzzling.
After some research, we came across Note 412113.1 on http://metalink.oracle.com, which pointed towards the possibility of some non-prod instance getting registered with RMAN catalog with the same DBID as production’s.
In addition, Notes 1076158.6 and 1061206.6 were also *probably* relevant, but the relevance was not striking.
On second thoughts..It came to light that the client had configured netbackup agent in another server on the same day, and that server had a database, which was the Disaster Recovery database last time. So, there might be some related conflict due to this condition. Crazily enough, the other database was also called PROD!
This is the output of DBIDs and incarnations in rman catalog. There is only one DBID registered for PROD, but the question was which PROD instance was it referring to?
SQL> select name,DBID,RESETLOGS_TIME from rc_database order by 1;
NAME DBID RESETLOG
---- ---------- --------
PROD 4290406901 03272008
RCAT 356098101 03012006
SQL> select db_key,DBID,name,current_incarnation from rc_database_incarnation order by 1;
DB_KEY DBID NAME CUR
---------- ---------- -------- ---
23531 356098101 RCAT YES
623638 4290406901 PROD NO
623638 4290406901 PROD YES
623638 4290406901 PROD NO
So, in reality, what had happened was that the reincarnation of DR PROD instance had been setup in the rman catalog. To undo this, we reset the database incarnation to the catalog using the RESET INCARNATION TO DBKEY and then did a manual resync of catalog using the rman RESYNC CATALOG command.
SQL> select db_key,dbid,name,CURRENT_INCARNATION from
rc_database_incarnation order by 1;DB_KEY DBID NAME CUR
---------- ---------- -------- ---
23531 356098101 RCAT YES
623638 4290406901 PROD NO
623638 4290406901 PROD NO
623638 4290406901 PROD YES
We did not re-register any instance in catalog. After this, the LIST INCARNATION OF DATABASE command started working fine as we could list the existing backups and other details. A new full online backup of PROD was started and also completed successfully.
As an added precaution, the instance name on the other server was changed from "PROD" to "DRTEST".
10g optimizer case study: Runtime Execution issues with View merging
Preface
After the 10g upgrade for our client’s 11i Application system, we were a little gung-ho, but the self-congratulations were short-lived as we discovered that a bulk load of custom concurrent programs were taking longer.
Aswath Rao, who had solved many performance issues before using the HOTSOS tool, handled it smartly. In this article, we talk about how he approached the problem, and what else could have been looked into.
Essentially, after some SQL plan analysis, he found that the merging of some views was causing a poorly executing plan (which was probably looking good to the 10g optimizer – the stats were recent) to be selected and finally got around it by using a couple time tested and clever hints for avoiding merging. He had to try and test quite a bit, but avoiding the merging worked.
First thoughts..When I first heard about it, my first thoughts went to an initialization variable called optimizer_secure_view_merging and when I searched on Google, I came across http://oracledb.wordpress.com/2007/04/10/execution-plans-differents-with-different-users/, which talks about something similar. Now, interestingly, if you run the bde_chk_cbo.sql script (obtainable from Metalink), it says that this parameter should be set to FALSE. The note 216205.1 – Mandatory parameters for 11i Apps corroborates the same. It was set to its expected value, in which Oracle does not use view merging or predicate move-around. Still, the view merging was definitely happening.
Where’s the bottleneck?So there was this custom concurrent program called XXMRPSTKTRANS and its runtime parameters had suddenly changed (we had the history of previous concurrent programs), an sql of the format INSERT INTO XXCI.XXMRP_STOCK_TRANSFER_RPT_ITEMS <followed by a select query> was taking approximately 97 % of the total concurrent request’s runtime. The query used 967 million Logical IOs.
Lets look at the Hotsos profiling output:
Lets look deeper..From the sql traces, it could be seen that this query was dragging us down:
SELECT PLANS.COMPILE_DESIGNATOR,
ITEMS.ORGANIZATION_ID,
ITEMS.ORGANIZATION_CODE,
ITEMS.PLANNER_CODE,
ITEMS.INVENTORY_ITEM_ID,
ITEMS.DESCRIPTION, ITEMS.
FULL_LEAD_TIME, MSS.SAFETY_STOCK_QUANTITY, NULL
FROM MRP_PLANS_SC_V PLANS,
MRP_SYSTEM_ITEMS ITEMS,
MTL_CATEGORY_SETS CATSETS,
MTL_CATEGORIES CAT,
MTL_ITEM_CATEGORIES ITEMC,
(SELECT INVENTORY_ITEM_ID, ORGANIZATION_ID,
SAFETY_STOCK_QUANTITY,EFFECTIVITY_DATE
FROM MTL_SAFETY_STOCKS MSS1
WHERE EFFECTIVITY_DATE = (SELECT MAX(EFFECTIVITY_DATE)
FROM MTL_SAFETY_STOCKS
WHERE INVENTORY_ITEM_ID = MSS1.INVENTORY_ITEM_ID
AND ORGANIZATION_ID = MSS1.ORGANIZATION_ID
GROUP BY INVENTORY_ITEM_ID, ORGANIZATION_ID ) ) MSS
WHERE PLANS.COMPILE_DESIGNATOR = ITEMS.COMPILE_DESIGNATOR
AND PLANS.PLANNED_ORGANIZATION = ITEMS.ORGANIZATION_ID
AND CATSETS.CATEGORY_SET_ID = ITEMC.CATEGORY_SET_ID
AND CATSETS.STRUCTURE_ID = CAT.STRUCTURE_ID
AND CAT.CATEGORY_ID = ITEMC.CATEGORY_ID
AND ITEMC.INVENTORY_ITEM_ID = ITEMS.INVENTORY_ITEM_ID
AND ITEMC.ORGANIZATION_ID = ITEMS.ORGANIZATION_ID
AND ITEMS.INVENTORY_ITEM_ID = MSS.INVENTORY_ITEM_ID(+)
AND ITEMS.ORGANIZATION_ID = MSS.ORGANIZATION_ID(+)
AND CATSETS.CATEGORY_SET_NAME = :B8
AND ( (:B7 IS NOT NULL AND ITEMS.PLANNER_CODE = :B7 ) OR (:B7 IS NULL) )
AND PLANS.COMPILE_DESIGNATOR = :B6
AND ((:B5 IS NOT NULL AND LENGTH(:B5 ) < 4 AND ITEMS.ORGANIZATION_CODE = :B5 )
OR (:B5 IS NOT NULL AND LENGTH(:B5 ) > 3 AND ITEMS.ORGANIZATION_CODE IN
(SELECT LOOKUP_CODE FROM
FND_LOOKUP_VALUES WHERE
LOOKUP_TYPE = :B5 AND
LANGUAGE = 'US' ) )
OR (:B5 IS NULL ) )
AND ((:B4 IS NOT NULL AND CAT.SEGMENT1 = :B4 )OR (:B4 IS NULL))
AND ((:B3 IS NOT NULL AND CAT.SEGMENT2 = :B3 )OR (:B3 IS NULL))
AND ((:B2 IS NOT NULL AND CAT.SEGMENT3 = :B2 )OR (:B2 IS NULL))
AND ((:B1 IS NOT NULL AND CAT.SEGMENT4 = :B1 ) OR (:B1 IS NULL));
Playing with plans – Worst, Better, Best..
Having determined the sql on which the future tuning efforts must be focused on, lets dive into the different sql plans observed (with and without hints) and the accompanying logical I/O stats.
WITHOUT HINTS - 35 MILLION CONSISTENT GETS (Not good)
Lets talk about the out-of-the-box plan (no hints) first. Obviously, there is a lot of detail here and is difficult to be absorbed at a single glance, but the key thing to note here is that the cost is actually very low (103), so on paper this looks like a great plan. Sure, there are a couple Full table scans, but those tables are really small. The caveat with this plan is that its performance is really bad in real time due to high index scans, which further contribute to excessive CPU delays.
ORDERED HINT - 15 MILLION CONSISTENT GETS
It got a little better with the ORDERED hint, although the plan cost went up a bit:
NO_MERGE(CATSETS) NO_MERGE(CAT) NO_MERGE (PLANS) - 26952 CONSISTENT GETS
Better than the ORDERED hint was the NO_MERGE hint for each of the views, which brought down the consistent gets to ~27,000, all the way from 35 million! Whoa, that was a drastic change. Note that the cost of this plan is 380, as compared to 1634 of the ORDERED plan’s (cost). Interesting observation.
Moral of the story?
Well, the NO_MERGE hint definitely made things better, when used on all the views or inline views in the query. The 10g optimizer was definitely having a challenge around this part. Also, while the overall compile time plan was showing lesser cost, the runtime performance was very skewed.
I am also forced to think that if we had played with optimizer_index_cost_adj parameter a bit at the session level, maybe we could have made some indexes look more expensive and hence pushed them down the priority list for the optimizer. But since this idea did not get tested, its just a thought at this point (it wouldn’t have been a sustainable solution anyway, but I’ve seen it being effective at session level before and can be tried in unwieldy situations).
In retrospect, note that the MRP_PLANS_SC_V internally refers to three more views! This is probably what makes this sql plan a little more special.
SQL> select REFERENCED_NAME,REFERENCED_TYPE,REFERENCED_LINK_NAME,DEPENDENCY_TYPE from dba_dependencies where name like 'MRP_PLANS_SC_V' and REFERENCED_TYPE like 'VIEW';
REFERENCED_NAME REFERENCED_TYPE REFERENCED_LINK_NAME DEPENDENCY_TYPE
------------------------- ------------------------- ------------------------- -------------------------
MFG_LOOKUPS VIEW HARD
MRP_DESIGNATORS_VIEW VIEW HARD
MRP_ORGANIZATIONS_V VIEW HARD
A digressive (?) discussion..
As this wasn’t the only sql tuning issue the customer encountered, this might as well be pointing to an innate issue with the 10g optimizer workload stats that are stored in the sys.aux_stats$ table. Although, the customer had gathered a representative workload sample of system stats using DBMS_STATS.GATHER_SYSTEM_STATS(‘START’) and DBMS_STATS.GATHER_SYSTEM_STATS(‘STOP’), the 10g optimizer does not seem to be churning out realistic execution-wise sql plans.
An interesting face is that the customer’s production machine has 32 CPUs, 192G of RAM and a really fast I/O sub-system, so maybe the workload parameters estimated through the APIs were overly optimistic? Maybe during the sampling, the optimizer API said.. Wow, This is cool. I can really do a lot. It sounds like a possible cause, but one would argue that the oracle API would be smart enough to account for it. Interestingly, the values for CPUSPEEDNW (the parameter used without workload simulation) and CPUSPEED are not very far from each other.
Here is how the stats look right now:
SQL> select sname, pname, pval1 from sys.aux_stats$;
SNAME PNAME PVAL1
------------- ---------- ----------
SYSSTATS_INFO STATUS
SYSSTATS_INFO DSTART
SYSSTATS_INFO DSTOP
SYSSTATS_INFO FLAGS 0
SYSSTATS_MAIN CPUSPEEDNW 983.218164
SYSSTATS_MAIN IOSEEKTIM 10
SYSSTATS_MAIN IOTFRSPEED 4096
SYSSTATS_MAIN SREADTIM 3.866
SYSSTATS_MAIN MREADTIM 5.203
SYSSTATS_MAIN CPUSPEED 972
SYSSTATS_MAIN MBRC 7
SYSSTATS_MAIN MAXTHR 79357952
SYSSTATS_MAIN SLAVETHR 3072
13 rows selected.
One should be aware that these parameters can be set manually using DBMS_STATS.SET_SYSTEM_STATS API. While you do find some general information on the web, unfortunately, not a lot of specific case study material is available on how manually changing each parameter affects the optimizer behavior. I think this is also where most of us fail to understand the true potential of 10g optimizer, both before and after the upgrade.
I sincerely wish that someone authoritative from the product or server technologies support team comes up with informative material on them.
The Cluster said: "I forbid you to migrate SYSTEM tablespace to local"
Introducing Nitin Ramesh
In this article, I would like to introduce Nitin Ramesh, an associate consultant with Oracle GSD India, Bangalore. This is the first time that I am having another Oracle Employee and co-worker publish articles on this blog, and I am excited. We solve many problems every for our ~35 customers on a 24x7 basis, but not many of them make it to articles to help others in need. If only we decided to this is on a regular basis, a lot of problems could be solved by a mere "google".
Coming back to Nitin, he is a pretty creative person and likes to solve intriguing problems. I haven't seen many people do that with sincerity. It is hoped that with this publication, he will come forth with more interesting articles and end up having his own blog.
Credit: The following article has been written by him and edited later on for presentation purposes.
PrefaceIn this article, we will talk about a scenario in which we were required to migrate the SYSTEM tablespace to local. What did you say? Piece of cake? We thought so too!
This process seemed to be simple to be first tried out in the test environment. But it was not all that simple. The first attempt ended with an error.
SQL> EXECUTE DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL('SYSTEM');
BEGIN DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL('SYSTEM'); END;
*
ERROR at line 1:
ORA-06550: line 1, column 7:
PLS-00201: identifier 'DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL' must be
declared
ORA-06550: line 1, column 7:
PL/SQL: Statement ignored
The issue was suspected to be the database not being in the proper mount mode to run the command.
The command was rerun with the following options again..
SQL> startup mount restrict exclusive
and
SQL> startup restrict
But, the results were not positive.
The next suspicion was on the package itself. So that angle had to be cleared...
SQL> select object_name, object_type, status from dba_objects where object_name = 'DBMS_SPACE_ADMIN';
OBJECT_NAME OBJECT_TYPE STATUS
-------------------- ------------------- -------
DBMS_SPACE_ADMIN PACKAGE VALID
DBMS_SPACE_ADMIN PACKAGE BODY VALID
DBMS_SPACE_ADMIN SYNONYM VALID
Hmm, no problem there too.
Maybe the problem was with invoker rights, so we tried giving explicit grant to execute the procedure to sys, but we got the same (lack of) results:
SQL> grant execute any procedure to sys;
What else can we check?The issue was retested after setting the event '201 trace name errorstack, level 3' at the session level. This was tried with the both startup options – “startup restrict “ and “startup mount restrict exclusive”, but that did not seem to help.
SQL> show parameter event;
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
event string 10298 trace name context forever, level 32
SQL> EXECUTE DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL('SYSTEM');
BEGIN DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL('SYSTEM'); END;
*
ERROR at line 1:
ORA-10643: Database should be mounted in restricted mode and Exclusive mode
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 227
ORA-06512: at line 1
How about checking if all the DB options in DBA_REGISTRY were valid?
SQL> select comp_name, version, status from dba_registry;
COMP_NAME VERSION STATUS
-------------------------------------- ------------------------------ ---------------------------------
Oracle Enterprise Manager 10.2.0.3.0 VALID
Oracle XML Database 10.2.0.3.0 VALID
Oracle Text 10.2.0.3.0 VALID
Spatial 10.2.0.3.0 VALID
Oracle interMedia 10.2.0.3.0 VALID
Oracle Database Catalog Views 10.2.0.3.0 VALID
Oracle Database Packages and Types 10.2.0.3.0 VALID
JServer JAVA Virtual Machine 10.2.0.3.0 VALID
Oracle Database Java Packages 10.2.0.3.0 VALID
Oracle XDK 10.2.0.3.0 VALID
Oracle Real Application Clusters 10.2.0.3.0 VALID
All right, no problems there. How about if this were a compatibility issue?
SQL> show parameter compatible;
NAME TYPE VALUE
------------------------------------ ----------- -----------
compatible string 10.2.0
How about any freak triggers during startup:
SQL> select * from dba_triggers where TRIGGERING_EVENT='STARTUP';
no rows selected
Enter Oracle SupportSince we had checked all we could, it was time to get an opinion from Product support. The matter was referred to Kevin Cook from the Advanced Resolution Team (AR). He suggested to retry using the STARTUP UPGRADE option, but before doing so, we tried setting the parameter o7_dictionary_accessibility to FALSE. Yet, the result was undesired. Finally, the STARTUP UPGRADE option had to be used to make it work, BUT with a different twist to it.
When nothing works, we should use the most simplistic case. Oh, did we forget to mention that this was a RAC instance? The catch was that the cluster had to be disabled and the UNDO tablespace for the other instance had to be dropped.
How was it done?All the tablespaces were made ‘read only’ other than SYSTEM, TEMP, SYSAUX and UNDO tablespaces.
SQL> ALTER TABLESPACE <tablespace_name> READ ONLY;
Lets verify it..
SQL> select t.tablespace_name, EXTENT_MANAGEMENT, ENABLED
from dba_tablespaces t, v$datafile d, dba_data_files f
where t.tablespace_name = f.tablespace_name
and f.file_id = d.file#;
Determined the undo tablespace used for the current instance.
SQL> sho parameter UNDO_TABLESPACE;
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
undo_tablespace string UNDO1
As you cannot put UNDO_TS_B or UNDO_TS_C into READ ONLY mode since they are UNDO tablespaces, you must therefore drop these tablespaces, convert the SYSTEM tablespace, then re-create UNDO_TS_B and UNDO_TS_C tablespaces. So, we dropped the UNDO tablespace used by the other instance [UNDO2].
Now, the database was started with the STARTUP UPGRADE option, BUT the cluster_database parameter was set to false in this temp init.ora file:
SQL> startup upgrade pfile='initDUAT1.ora_tmp_oct17';
ORACLE instance started.
Total System Global Area 1.0754E+10 bytes
Fixed Size 2174464 bytes
Variable Size 2889634304 bytes
Database Buffers 7851737088 bytes
Redo Buffers 10649600 bytes
Database mounted.
Database opened.
Take the SYSAUX tablespace offline..
SQL> ALTER TABLESPACE sysaux OFFLINE;
Conversion starts…Execute the DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL procedure specifying the SYSTEM tablespace:
SQL> EXECUTE DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL('SYSTEM');
PL/SQL procedure successfully completed.
Verify your change by querying DBA_TABLESPACES:
SQL> SELECT tablespace_name, extent_management FROM dba_tablespaces WHERE tablespace_name = 'SYSTEM';
TABLESPACE_NAME EXTENT_MAN
------------------------------ -------------------
SYSTEM LOCAL
Once all tablespaces are locally managed within a database, the SYS.FET$ table does not contain any rows:
SQL> select * from sys.fet$;
no rows selected
Issue the following command to disable RESTRICTED mode:
SQL> ALTER SYSTEM DISABLE RESTRICTED SESSION;
System altered.
Issue the following command for each tablespace you placed in READ ONLY mode to return it to READ WRITE:
SQL> ALTER TABLESPACE <tablespace_name> READ WRITE;
Make the SYSAUX tablespace online:
SQL> ALTER TABLESPACE sysaux ONLINE;
ConclusionA simple exercise turned out to be a learning experience, because the API had a possible dependence or conflicts with some internal RAC views. When RAC was disabled and the UNDO tablespace for the other RAC instances were dropped, then the migration API completed successfully. The moral of the story is, when in doubt, use single instance mode.
Deleting statistics or/and dropping indexes on Global temporary tables can help too
Post Upgrade Performance Pangs..
After a 10g upgrade of production 11.5.9 Oracle Application instances, one of our clients was seeing some performance problems relating to pricing, particularly with a custom concurrent program called XXONT_PROGRESS_NOTIFIED.
This triggered the onsite Senior DBA, Brian Fane , into action. While watching the current executing sqls of this job, this SQL statement was seen to be executing somewhat frequently:
SELECT list_header_id
FROM qp_preq_qual_tmp
WHERE list_header_id = :b4
AND line_index = :b3
AND list_line_id = -1:
AND pricing_status_code = :b2
AND pricing_phase_id = :b1:
AND ROWNUM = 1;
In fact, based on one AWR report he saw, it was about 19% of the gets and 5% of the total elapsed time between 0800 and 1300 hrs. It ran nearly a million
times. The execution plan looked like this:
CHILD_NUMBER QUERY
0 SELECT STATEMENT [ALL_ROWS] Cost=1
0 COUNT STOPKEY
0 TABLE ACCESS BY INDEX ROWID QP_PREQ_QUAL_TMP_T Cost=0 Card=1 Bytes=82
0 INDEX SKIP SCAN XXQP_QP_PREQ_QUAL_TMP_T_N1 Cost=0 Card=1 Bytes=
In DEV, the query was not using XXQP_QP_PREQ_QUAL_TMP_T_N1, but instead was using QP_PREQ_QUAL_TMP_N1.
Taking another approach, in a BCV copy of PROD, when he tried removing the statistics, or updating them, it was either causing a full table scan or reverting back to PROD-like behavior of using XXQP_QP_PREQ_QUAL_TMP_T_N1, respectively.
So the variant in the situation was the custom index. The question was whether it was needed at all? He decided to look a little further into this, and found that there are
currently 11 queries in the SQL area that were referring to this index. So somehow, the custom index wasn't really helping. This sql was useful for finding out the current sqls using the custom index was:
SELECT DISTINCT old_hash_value
FROM v$sql_plan
JOIN v$sqltext_with_newlines
using (sql_id)
JOIN v$sql
using (sql_id)
WHERE object_name = 'XXQP_QP_PREQ_QUAL_TMP_T_N1'
AND object_owner = 'QP';
So was this index being used in 9i at all? Was there another way to confirm this for PROD? Luckily, from a conscious choice, we had taken a backup of 9i's perfstat schema, which had historical sql plans being used by the system too (happens with snap_level >=7). Brian proceeded to imported the perfstat data into the BCV copy and ran this query to determine the sql plans that were using the custom index 'XXQP_QP_PREQ_QUAL_TMP_T_N1:
SELECT /*+ PARALLEL(ssp, 8) */
DISTINCT plan_hash_value
FROM perfstat.stats$sql_plan ssp
WHERE object_owner = 'QP'
AND object_name = 'XXQP_QP_PREQ_QUAL_TMP_T_N1';
This query yielded no queries in at least the last 30 days of PROD's 9i days in which this custom index was used. Simply speaking, this index was not being used in the 9i PROD instance.
Brian then worked with a developer to get a script for invoking this piece of code, for testing purposes. He thought of the possibility of dropping the index and letting the CBO do either a full table scan or use another index (after gathering statistics).
Testing Results.. Based on the code Brian got, and by doing two things -- deleting any statistics that exist on the QP_PREQ_QUAL_TMP_T table and dropping the custom
XXQP_QP_PREQ_TMP_T_N1, he did see an improvement of the pricing performance. Here's the summary of the pre-change performance:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 841 0.13 0.12 0 0 0 0
Execute 36094 8.19 8.79 285 371446 118861 36701
Fetch 30680 23.21 24.20 21307 3083457 0 41171
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 67615 31.53 33.12 21592 3454903 118861 77872
After making the changes:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1018 0.18 0.24 0 0 0 0
Execute 36273 8.66 9.38 31 369674 106981 36701
Fetch 30891 12.19 12.47 0 1464382 0 41352
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 68182 21.03 22.10 31 1834056 106981 78053
The new performance shows gets being about half of the pre-change value, and elapsed time is about two thirds of the previous performance.
Here's the code for the deletion of statistics and dropping the index:
BEGIN
dbms_stats.delete_table_stats(ownname => 'QP', tabname => 'QP_PREQ_QUAL_TMP_T', cascade_indexes => true);
END;
/
DROP INDEX QP.XXQP_QP_PREQ_QUAL_TMP_T_N1;
To back out:.
CREATE INDEX QP.XXQP_QP_PREQ_QUAL_TMP_T_N1 ON
QP.QP_PREQ_QUAL_TMP_T
(UNIQUE_KEY, PRICING_STATUS_CODE, PRICING_PHASE_ID);
BEGIN
fnd_stats.gather_table_stats(ownname => 'QP', tabname => 'QP_PREQ_QUAL_TMP_T', CASCADE => FALSE);
fnd_stats.gather_index_stats(ownname => 'QP', indname => 'XXQP_QP_PREQ_QUAL_TMP_T_N1');
END;
/
One question could be: why did we not create a 10g sql profile (9i's stored outline) for fixing the sql at hand? The only caveat there is, you need a sql profile for every possible sql that would potentially use QP.QP_PREQ_QUAL_TMP_T, which is not an easy proposition.
Even though we may gather statistics on all related objects, the CBO may not choose the best sql plan for really high number of executions. On another note, it pays to have the history of sql plans of working scenarios. For promoting the CBO to choose the optimal plan, It may be required to drop some indexes to force a less costlier plan (over a million executions).
Note: This article is being produced with the explicit permission of Brian Fane, and is aimed at sharing tips in the Oracle world for other people who might be in the same situation.
Out Of Time? - Use The Shotgun or Reboot..
Preface
Last weekend was pretty exciting. We were upgrading our customer's production 11.5.9 Oracle Applications instance to 10.2.0.3 and part of the effort was applying the ATG Family pack H.RUP6 patch. We were getting some really weird errors while doing the production run, something we had never seen before in the pre-production rounds (we did 6 of them to be exact). The possibility of rolling back seemed quite real.
Essentially, the AD.I.6 patchset failed while running autoconfig implicitly (the patch completed, but autoconfig was failing). Interestingly, Autoconfig was running fine on the admin/concurrent tier, but not working on the web/forms tier. Just to add to the details, the APPL_TOPs for the middle tiers (shared across multiple tiers) and Admin/concurrent tier were different.
This article talks about how we came over this issue. It is hoped it will help someone else in the future too.
A Race against time..It was a perplexing situation. We were budgeted/negotiated 36 hrs for the entire downtime (Sun JRE Vista Patching, HRMS family pack K3 and 10g upgrade), and time was slipping by. We couldn't afford to lose more time.
Truncated Class File?The error we were getting on the web/forms tier was as follows:
middle_tier_1:web_prod> ./adautocfg.sh
Enter the APPS user password:
The log file for this session is located at:
/ORACLE/apps/prod/admin/prod_middle_tier_1/log/11151046/adconfig.log
AutoConfig is configuring the Applications environment...
AutoConfig will consider the custom templates if present.
Using APPL_TOP location : /ORACLE/apps/prod
Classpath : /usr/java/j2sdk1.4.2_07/jre/lib/rt.jar:/usr/java/j2sdk1.4.2_07/lib/dt.jar:/usr/java/j2sdk1.4.2_07/lib/tools.jar:/ORACLE/apps/prod/common/java/appsborg2.zip:/ORA
CLE/apps/prod/common/java
Exception in thread "main" java.lang.ClassFormatError: oracle/apps/ad/autoconfig/InstantiateFile
(Truncated class file)
at java.lang.ClassLoader.defineClass0(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:539)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)
at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)
at oracle.apps.ad.tools.configuration.Customizer.getProdTopDrivers(Customizer.java:380)
at oracle.apps.ad.tools.configuration.Customizer.getAllDrivers(Customizer.java:358)
at oracle.apps.ad.tools.configuration.VersionConflictListGenerator.getAllConflicts(VersionConflictListGenerator.java:170)
at oracle.apps.ad.tools.configuration.VersionConflictListGenerator.main(VersionConflictListGenerator.java:426)
ERROR: Version Conflicts utility failed.
Terminate.
All right, so a class file is truncated. The very first thing that we thought about was that maybe regenerating the jar files through adadmin utility would fix it. But, it did not help.
Another thought was that a Version Conflicts utility failed message typically means something is point to incorrect code. The most common reason is usually the version of JDK. Someone pointed that the JDK that we were using was 1.4.2_07, whereas 1.3.1 was certified with 11.5.9. But since our AD level was pretty high, we didn't think this was relevant. The application was working fine with JDK 1.4.2_07 before.
Yet another person thought of removing the entry for Web tier from the TNS topology model itself. But, as can be seen here, even that idea did not work.
$ perl $AD_TOP/bin/adgentns.pl appspass=XXXXXX contextfile=$APPL_TOP/admin/prod_middle_tier_1.xml
##########################################################################
Generate Tns Names
##########################################################################
Logfile: /ORACLE/apps/prod/admin/prod_middle_tier_1/log/11151856/NetServiceHandler.log
Classpath :
/usr/java/j2sdk1.4.2_07/jre/lib/rt.jar:/usr/java/j2sdk1.4.2_07/lib/dt.jar:/usr/java/j2sdk1.4.2_07/lib/tools.jar:/ORACLE/apps/prod/common/java/appsborg2.zip:/ORACLE/apps/prod/common/java
Updating s_tnsmode to 'generateTNS'
UpdateContext exited with status: 0
Exception in thread "main" java.lang.ClassFormatError: oracle/apps/ad/tools/configuration/NetServiceHandler (Truncated class file)
at java.lang.ClassLoader.defineClass(ClassLoader.java:539)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)
at java.lang.ClassLoader.loadClass(ClassLoader.java:235
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)
Error generating tnsnames.ora from the database, temperory tnsnames.ora will be generated using
templates
Instantiating Tools tnsnames.ora
Exception in thread "main" java.lang.ClassFormatError: oracle/apps/ad/autoconfig/InstantiateFile
(Truncated class file)
at java.lang.ClassLoader.defineClass0(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:539)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)
at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)
Error in instantiating tools tnsnames.ora:
Exception in thread "main" java.lang.ClassFormatError: oracle/apps/ad/autoconfig/InstantiateFile (Truncated class file)
at java.lang.ClassLoader.defineClass0(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:539)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)
at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)
Error in instantiating web tnsnames.ora:
adgentns.pl exiting with status 256
ERRORCODE = 256 ERRORCODE_END
Well, so far, no luck. Then, we did realize that the context file is also uploaded into the Database by OAM functionality (remember the OAM screen through which you can edit the autoconfig tokens?). We had thoroughly checked the context file of the middle tier, but maybe there was a corruption in the database copy ? We came across some metalink notes (e.g. Note:463895.1, Note:352916.1), which talked about purging all uploaded context files (kept in applsys.FND_OAM_CONTEXT_FILES). So potentially, we could take a backup of applsys.FND_OAM_CONTEXT_FILES table, delete the entry corresponding to middle_tier_1 and then autoconfig on middle_tier_1 to reload the data. So, this was a realistic possibility.
Wait a minute .. Can we take the shotgun approach?OK, so while we were thinking of all these options, the thought came to my mind that we had 5 middle tiers, so why don't we try to run autoconfig on one of the other middle tiers. Most surprisingly, Autoconfig ran fine on another middle tier!!
Hmph, all of them were sharing the same $JAVA_TOP, so what could make the other middle tiers behave differently? Maybe a difference in an autoconfig token or environment variable? But, even after comparing all the entries in XML file and environment variables, there was no perceptible difference. Strange, very strange.
This was a good validation point for us, as we could conclude that there was something wrong with middle_tier_1, but not with the others. So, we could simply continue patching with another middle tier and hope to not get similar issues in the future. middle_tier_1 could be taken out of the load balancer, dealt with later and when fixed, put back in the load balancer. In the interim, it sounded like a workable strategy, so we took it. This is what saved us from spinning our wheels on this issue.
So what was it, really?We continued patching and successfully migrated the production instance to 10g (Yippee!).
One thing did strike me though much later: the versions of the InstantiateFile.class were different in $JAVA_TOP and $AD_TOP when we were getting this error. For example, this was the situation when we were having this error:
middle_tier_1:web_prod> adident Header $JAVA_TOP/oracle/apps/ad/autoconfig/InstantiateFile.class
/ORACLE/apps/prod/common/java/oracle/apps/ad/autoconfig/InstantiateFile.class:
$Header InstantiateFile.java 115.212 2007/07/10 11:20:16 schagant ship $
middle_tier_1:web_prod> adident Header $AD_TOP/java/oracle/apps/ad/autoconfig/InstantiateFile.class
/ORACLE/apps/prod/ad/11.5.0/java/oracle/apps/ad/autoconfig/InstantiateFile.class:
$Header InstantiateFile.java 115.203 2006/11/01 08:05:36 subhroy ship $
Later on, after the outage, when I checked the versions again, this is what I saw:
middle_tier_1:web_prod> adident Header $JAVA_TOP/oracle/apps/ad/autoconfig/InstantiateFile.class
/ORACLE/apps/prod/common/java/oracle/apps/ad/autoconfig/InstantiateFile.class:
$Header InstantiateFile.java 115.212 2007/07/10 11:20:16 schagant ship $
/ORACLE/apps/prod/ad/11.5.0/java/oracle/apps/ad/autoconfig/InstantiateFile.class:
$Header InstantiateFile.java 115.212 2006/11/01 08:05:36 subhroy ship $
I am now forced to think that maybe this was a contributor to the issue at hand. If we had known that this .class file had multiple copies across the APPL_TOP and JAVA_TOP, we could have simply run adadmin with the Maintain files->copy files to destination option to sync up the duplicate copies. But then, why didn't the other middle tiers complain? The APPL_TOP and JAVA_TOP was shared amongst all of them. So maybe, it wasn't the real problem.
Later on, when we rebooted middle_tier_1 box (RHAS3 32bit linux), autoconfig ran successfully! This thought was based on Note 556107.1 Java.Lang.Classformaterror: oracle/apps/ibe/store/StoreCurrency (Truncated class file). If it was really this which resolved the issue, it must have been some corrupted shared module in memory, which got cleared off?
Come to think of it, when nothing seems to work on Windows, we do reboot it.
In Retrospect..So, I don't have all the answers, but a little lateral thinking saved us and enabled us to proceed. Sometimes, the shotgun approach does work. It pays to have multiple middle tiers in the architecture, so that there is no single point of failure. Also, sometimes, rebooting can resolve some really weird errors.
I'm reminded of a phrase in Hindi -- अकल बड़ी या भैंस ? (loose translation: sometimes brute force is superior to elaborate reasoning). You can be the judge.
The CIO asked: How long has my production database been down?
And I had no answer for him. I couldn't blame him. CIOs want to know this kind of information. Better still, he would have liked a pie chart depicting it like this:
I wish..Well, for once, it would have been nice if Oracle 9i or 10g kept the historical startup or shutdown information in the v$ or some dba_* tables. A simple query off the table would have got me the answer.
Anyways, it set me thinking. There are always other ways to get the same information. Statspack or AWR was one possibility, but I was not sure if it really gathers the details information about when the instance was shutdown or started up (historically) -- they sure are aware if there was an instance restart between two snaps.
An Aha moment..But wait, the database alert log has information about each startup and shutdown! So if we could mine the alert log for the right regular expressions, and then find the time difference between time stamps, it could be done.
This method would not give you the overall downtime for the production instance, including the downtime for the middle tiers or Apache web server, but the same idea could probably be extended for the other services, but in this article, the scope is just the database. There is an auxiliary script (get_epoch.sh) supplied here that would be useful in this quest.
Auxiliary script: get_epoch.shAlso available for download here.
#!/usr/bin/perl
#
# The format of input date is: Thu Jun 5 21:15:48 2008
#
# NOTE: The format of `date` command is: Thu Jun 5 21:15:48 EDT 2008
# -- it has the timezone in the output as well
# BUT this script does not assume that since the timestamps in
# alert log dont have the timezone in it
#
# This script heavily uses this function to convert a UTC timestamp
# into seconds after 1 jan 1970:
# timelocal($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst);
use Time::Local;
my $wday = $ARGV[0];
my $month = $ARGV[1];
# convert the month shortname into 0-11 number
if ( $month eq "Jan" ) { $mon = 0 }
elsif ( $month eq "Feb" ) { $mon = 1 }
elsif ( $month eq "Mar" ) { $mon = 2 }
elsif ( $month eq "Apr" ) { $mon = 3 }
elsif ( $month eq "May" ) { $mon = 4 }
elsif ( $month eq "Jun" ) { $mon = 5 }
elsif ( $month eq "Jul" ) { $mon = 6 }
elsif ( $month eq "Aug" ) { $mon = 7 }
elsif ( $month eq "Sep" ) { $mon = 8 }
elsif ( $month eq "Oct" ) { $mon = 9 }
elsif ( $month eq "Nov" ) { $mon = 10 }
elsif ( $month eq "Dec" ) { $mon = 11 };
my $mday = $ARGV[2];
# initialize time varialble and split hours (24 hr format), minutes, seconds into an array
my $time = $ARGV[3];
@time = split /:/, $time;
# if the timezone is left out of the input, the position of year becomes 5th in ARGV
my $year = $ARGV[4];
#######################################################################################
# I found that by excluding $wday, the seconds results (EPOCH) is more
# accurate, so $wday parameter has been omitted from the call to
# timelocal() function.
#######################################################################################
$epoch= timelocal($time[2], $time[1], $time[0], $mday, $mon, $year);
print "$epoch\n";
The main script..Due to formatting issues, the main script is available for download here.
Sample Usage and output..I realized that it would probably make more sense to have an optional cutoff date to calculate the downtime from, so that was added to the version 2 of the script. The version 1 which calculates the downtime from the first database startup time is uploaded here.
sandbox:sandbox> ./calculate_downtime.sh
Correct Usage: calculate_downtime.sh alertlogfilepath [cutoff_date in format Sat Jun 7 08:49:34 2008]
sandbox:sandbox> ./calculate_downtime.sh $DATA_DIR/admin/bdump/alert*.log Fri Mar 28 15:20:59 2008
Cutoff date is : Fri Mar 28 15:20:59 2008
Shutdown times:
Timestamp -- epoch (seconds)
Wed Jan 9 17:53:08 2008 - 1199919188
Wed Jan 16 12:05:09 2008 - 1200503109
Fri Jan 18 11:19:42 2008 - 1200673182
Thu Jan 24 17:34:15 2008 - 1201214055
Fri Feb 15 09:00:44 2008 - 1203084044
Wed Feb 20 16:50:14 2008 - 1203544214
Wed Mar 12 12:43:26 2008 - 1205340206
Fri Mar 28 15:21:59 2008 - 1206732119
Thu Apr 3 11:03:52 2008 - 1207235032
Thu Apr 3 11:10:20 2008 - 1207235420
Thu Apr 3 11:15:44 2008 - 1207235744
Thu Apr 3 11:22:38 2008 - 1207236158
Thu Apr 3 11:27:36 2008 - 1207236456
Thu Apr 3 11:34:35 2008 - 1207236875
Thu Apr 3 11:41:36 2008 - 1207237296
Mon May 12 14:17:13 2008 - 1210616233
Thu Jun 5 10:36:58 2008 - 1212676618
shutdown_counter=17
Startup times:
Timestamp -- epoch (seconds)
Wed Jan 9 17:50:42 2008 -- 1199919042
Thu Jan 10 09:43:18 2008 -- 1199976198
Thu Jan 17 12:00:03 2008 -- 1200589203
Fri Jan 18 11:26:13 2008 -- 1200673573
Wed Jan 30 12:19:21 2008 -- 1201713561
Tue Feb 19 22:57:38 2008 -- 1203479858
Wed Mar 12 12:39:03 2008 -- 1205339943
Mon Mar 24 13:44:20 2008 -- 1206380660
Thu Apr 3 11:00:33 2008 -- 1207234833
Thu Apr 3 11:07:12 2008 -- 1207235232
Thu Apr 3 11:14:01 2008 -- 1207235641
Thu Apr 3 11:20:54 2008 -- 1207236054
Thu Apr 3 11:25:25 2008 -- 1207236325
Thu Apr 3 11:31:53 2008 -- 1207236713
Thu Apr 3 11:40:18 2008 -- 1207237218
Tue Apr 29 16:50:49 2008 -- 1209502249
Mon Jun 2 14:20:38 2008 -- 1212430838
Thu Jun 5 10:38:39 2008 -- 1212676719
startup_counter=18
As per the alert log, The instance is currently up
Here are the downtime windows ...
Wed Jan 9 17:50:42 2008 -- Wed Jan 9 17:53:08 2008
Thu Jan 10 09:43:18 2008 -- Wed Jan 16 12:05:09 2008
Thu Jan 17 12:00:03 2008 -- Fri Jan 18 11:19:42 2008
Fri Jan 18 11:26:13 2008 -- Thu Jan 24 17:34:15 2008
Wed Jan 30 12:19:21 2008 -- Fri Feb 15 09:00:44 2008
Tue Feb 19 22:57:38 2008 -- Wed Feb 20 16:50:14 2008
Wed Mar 12 12:39:03 2008 -- Wed Mar 12 12:43:26 2008
Mon Mar 24 13:44:20 2008 -- Fri Mar 28 15:21:59 2008
Thu Apr 3 11:00:33 2008 -- Thu Apr 3 11:03:52 2008
Thu Apr 3 11:07:12 2008 -- Thu Apr 3 11:10:20 2008
Thu Apr 3 11:14:01 2008 -- Thu Apr 3 11:15:44 2008
Thu Apr 3 11:20:54 2008 -- Thu Apr 3 11:22:38 2008
Thu Apr 3 11:25:25 2008 -- Thu Apr 3 11:27:36 2008
Thu Apr 3 11:31:53 2008 -- Thu Apr 3 11:34:35 2008
Thu Apr 3 11:40:18 2008 -- Thu Apr 3 11:41:36 2008
Tue Apr 29 16:50:49 2008 -- Mon May 12 14:17:13 2008
Mon Jun 2 14:20:38 2008 -- Thu Jun 5 10:36:58 2008
Thu Jun 5 10:38:39 2008 --
Downtime 1 : Wed Jan 9 17:53:08 2008 (1199919188) --> Thu Jan 10 09:43:18 2008 (1199976198) = 57010 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Jan 9 17:53:08 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 2 : Wed Jan 16 12:05:09 2008 (1200503109) --> Thu Jan 17 12:00:03 2008 (1200589203) = 86094 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Jan 16 12:05:09 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 3 : Fri Jan 18 11:19:42 2008 (1200673182) --> Fri Jan 18 11:26:13 2008 (1200673573) = 391 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Fri Jan 18 11:19:42 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 4 : Thu Jan 24 17:34:15 2008 (1201214055) --> Wed Jan 30 12:19:21 2008 (1201713561) = 499506 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Thu Jan 24 17:34:15 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 5 : Fri Feb 15 09:00:44 2008 (1203084044) --> Tue Feb 19 22:57:38 2008 (1203479858) = 395814 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Fri Feb 15 09:00:44 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 6 : Wed Feb 20 16:50:14 2008 (1203544214) --> Wed Mar 12 12:39:03 2008 (1205339943) = 1795729 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Feb 20 16:50:14 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 7 : Wed Mar 12 12:43:26 2008 (1205340206) --> Mon Mar 24 13:44:20 2008 (1206380660) = 1040454 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Mar 12 12:43:26 2008 - so not accruing
Running Cumulative downtime = 0 seconds
Downtime 8 : Fri Mar 28 15:21:59 2008 (1206732119) --> Thu Apr 3 11:00:33 2008 (1207234833) = 502714 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Fri Mar 28 15:21:59 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 502714 seconds
Downtime 9 : Thu Apr 3 11:03:52 2008 (1207235032) --> Thu Apr 3 11:07:12 2008 (1207235232) = 200 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:03:52 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 502914 seconds
Downtime 10 : Thu Apr 3 11:10:20 2008 (1207235420) --> Thu Apr 3 11:14:01 2008 (1207235641) = 221 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:10:20 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503135 seconds
Downtime 11 : Thu Apr 3 11:15:44 2008 (1207235744) --> Thu Apr 3 11:20:54 2008 (1207236054) = 310 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:15:44 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503445 seconds
Downtime 12 : Thu Apr 3 11:22:38 2008 (1207236158) --> Thu Apr 3 11:25:25 2008 (1207236325) = 167 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:22:38 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503612 seconds
Downtime 13 : Thu Apr 3 11:27:36 2008 (1207236456) --> Thu Apr 3 11:31:53 2008 (1207236713) = 257 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:27:36 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503869 seconds
Downtime 14 : Thu Apr 3 11:34:35 2008 (1207236875) --> Thu Apr 3 11:40:18 2008 (1207237218) = 343 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:34:35 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 504212 seconds
Downtime 15 : Thu Apr 3 11:41:36 2008 (1207237296) --> Tue Apr 29 16:50:49 2008 (1209502249) = 2264953 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr 3 11:41:36 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 2769165 seconds
Downtime 16 : Mon May 12 14:17:13 2008 (1210616233) --> Mon Jun 2 14:20:38 2008 (1212430838) = 1814605 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Mon May 12 14:17:13 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 4583770 seconds
Downtime 17 : Thu Jun 5 10:36:58 2008 (1212676618) --> Thu Jun 5 10:38:39 2008 (1212676719) = 101 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Jun 5 10:36:58 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 4583871 seconds
Calculating lifetime of instance as per criteria specified..
Starting time being used = Fri Mar 28 15:20:59 2008 -- 1206732059
Ending time epoch = Sat Jun 7 11:49:49 2008 -- 1212853789
Total lifetime in seconds = 6121730
Beginning Fri Mar 28 15:20:59 2008, The instance was down 74 % of the time
Application of this data..Now, this data could be put in some kind of dashboard for CIO meetings. This would give them an approximate idea of how long do their databases and (the dependent middle tier or admin tier services) remain down due to maintenance. Sure, this method cannot distinguish between unplanned and planned maintenance, but its probably a good start.
Yet another External XML parse error: for BATCH CLOSE processing of IBYSCHEDULER module .. this time
While I have written a previous article on XML parsing error for Online iPayment transactions in Oracle Applications 11i, For the want of XML parsing, iPayment was lost; for the want of not being able to take payment, business was lost., the customer had always suffered another issue with the batch close processing carried out by the IBYSCHEDULER module: iPayment Scheduler program.
Unknown to me, there was another person from the customer's production support who was working diligently with Oracle Support and development to have this addressed. This is the story of Patrick Baker, , who gets the credit for having this resolved, over a period of one long year. Interestingly, the solution was to use another copy of the xmlparserv2 archive file on the concurrent manager tier.
The version of the Oracle Applications was 11.5.10.2 (as per FND & ATG) and 11.5.9 for some other products.
The problem and error messageAlmost every night, the customer would get this error in the BATCH CLOSE processing program (the details of the servers and domain name have been blurred out to protect the data integrity of the customer):
+---------------------------------------------------------------------------+
iPayment: Version : 11.5.0 - Development
Copyright (c) 1979, 1999, Oracle Corporation. All rights reserved.
IBYSCHEDULER module: iPayment Scheduler
+---------------------------------------------------------------------------+
Current system time is 28-MAR-2007 00:16:23
+---------------------------------------------------------------------------+
Processing BATCH CLOSE operations ..
empty batch for account (payee id=087295,account id=XXXXXXX Corp:087295:944599:XXXXXX:944599:CORPORAT,batch id=3343)
exception occured for (payee id=087295,account id=XXXXXXXX Corp:087295:944599:XXXXXXX:944599:CORPORAT,batch id=3344) External XML parse error. Document passed to iPayment by external application http://ipayment.xxxxxxxx.com:8000/servlet/oramipp_ptk generated XML parse error Start of root element expected. .
The Stack Trace is -
oracle.apps.iby.exception.PSException: External XML parse error. Document passed to iPayment by external application http://payment.xxxxxxxx.com:8000/servlet/oramipp_ptk generated XML parse error Start of root element expected. .
at oracle.apps.iby.util.bpsUtil.raiseException(bpsUtil.java:159)
at oracle.apps.iby.net.XMLMessenger.deliverDoc(XMLMessenger.java:138)
at oracle.apps.iby.payment.proc.BatchCCPayment.closeBatch(BatchCCPayment.java:1147)
at oracle.apps.iby.scheduler.SchedBatchClose.schedPmt(SchedBatchClose.java:124)
at oracle.apps.iby.scheduler.Scheduler.doProcess(Scheduler.java:260)
at oracle.apps.iby.scheduler.Scheduler.init(Scheduler.java:297)
at oracle.apps.iby.scheduler.SchedInitiator.runProgram(SchedInitiator.java:200)
at oracle.apps.fnd.cp.request.Run.main(Run.java:161)
Finished processing BATCH CLOSE
+---------------------------------------------------------------------------+
Start of log messages from FND_FILE
+---------------------------------------------------------------------------+
+---------------------------------------------------------------------------+
End of log messages from FND_FILE
+---------------------------------------------------------------------------+
Successfully resubmitted concurrent program IBYSCHEDULER with request ID 32185967 to start at 29-MAR-2007 00:15:56 (ROUTINE=AFPSRS)
+---------------------------------------------------------------------------+
Executing request completion options...
Finished executing request completion options.
+---------------------------------------------------------------------------+
Concurrent request completed
Current system time is 28-MAR-2007 00:22:09
+---------------------------------------------------------------------------+
Lets think a bit..OK, lets try to make some sense out of it. Since this error was being received in the output of a concurrent manager, obviously, the xmlparser class file was on the concurrent manager tier and NOT on the iPayment tier (which was used by the online transactions).
A different solution..On the dedicated iPayment tier, the same error message was resolved by using $JAVA_TOP/xmlparserv2.zip file in the $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties, but in this case, the solution was to the use $IAS_ORACLE_HOME/xdk/lib/xmlparserv2.jar instead in different files:
To implement the solution, open the file $APPL_TOP/admin/adovars.env on the concurrent manager tier,Note: Make sure the content for CLASSPATH and AF_CLASSPATH is included in 1 line if you do a cut and paste. The entire value SHOULD BE 1 SINGLE LINE, otherwise the value will get corrupted.
and in the values of CLASSPATH and AF_CLASSPATH variable, add $IAS_ORACLE_HOME/xdk/lib/xmlparserv2.jar:
before $JAVA_TOP/appsborg2.zip:
Note: this must be done in both: CLASSPATH and AF_CLASSPATH
To prevent the entries in $APPL_TOP/admin/adovars.env to be over-written, you can either add this at the "end of the file"
in the # BEGIN Customizations and # END Customizations tags or you can create your own
custom autoconfig template file for adovars.env with the changes.
E.g.
# Begin customizations
....
CLASSPATH=......:$IAS_ORACLE_HOME/xdk/lib/xmlparserv2.jar:$JAVA_TOP/appsborg2.zip:....
....
....
AF_CLASSPATH=......:$IAS_ORACLE_HOME/xdk/lib/xmlparserv2.jar:$JAVA_TOP/appsborg2.zip:....
....
# End customizations
After this, the concurrent managers do need to be bounced and then the IBYSCHEDULER module: iPayment Scheduler program gives expected results.
A new learning..From this experience, It now seems that it is possible to make $IAS_ORACLE_HOME/xdk/lib/xmlparserv2.jar
work for the parsing needs on the dedicated iPayment tier for servicing online transactions too.
For that, you need to put the following entry in $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties:
Put this entry in the # BEGIN customizations and # END customizations (should be done at the END of the file):
# BEGIN customizations
...
...
wrapper.classpath=/ORACLE/qa/9iAS/xdk/lib/xmlparserv2.jar
...
...
# END customizations
Vrroom! Go (Con)Figure : Speeding up the model preloading in configurator...
What got triggered as a human mistake in a cloned instance at a customer environment, caused a new discovery by the onsite senior DBA, Brian Fane
Someone from the customer released a sysadmin custom job that caused the configurator servlet JVMs in Production to get bounced! This was an unexpected situation and while the users sat twiddling their thumbs while the configurator models were pre-loading, Brian started digging into the current executing sql of the DB sessions doing the preloading.
To give an idea of the patch levels, the CZ patchset installed was 11i.CZ.I(2690432) and the FND & ATG patchsets were 11i.FND.H(3262159) and 11i.ATG_PF.H.5(5473858) respectively.
He happened to note that the maximum time seemed to be taken by the following sql:
SELECT cz_lce_load_specs.lce_header_id,
cz_lce_texts.lce_text,
cz_lce_headers.component_id
FROM apps.cz_lce_load_specs,
apps.cz_lce_headers,
apps.cz_lce_texts
WHERE cz_lce_load_specs.attachment_comp_id = :1
AND cz_lce_load_specs.net_type = :2
AND cz_lce_texts.lce_header_id = cz_lce_load_specs.lce_header_id
AND cz_lce_texts.lce_header_id = cz_lce_headers.lce_header_id
AND cz_lce_headers.deleted_flag = :3
AND cz_lce_load_specs.deleted_flag = :4
ORDER BY cz_lce_texts.seq_nbr;
When he ran it a BCV copy of the PROD (which was a replica of PROD as of that morning - daily process), it did ~37,400 gets/execution, and almost all of them were physical reads. Most of this comes from a full table scan on CZ_LCE_LOAD_SPECS, which right around 37,150 blocks. Query time was between 1 and 2 seconds (TOAD doesn't get any more precise when dealing with values > 1 second).
His Solution... So well, he thought, why not add an index to cz_lce_load_specs that may speed up the query:
CREATE INDEX cz.blf_test ON cz.cz_lce_load_specs
(attachment_comp_id, net_type, deleted_flag)
COMPUTE STATISTICS;
This reduced the gets from 37,400 to 20. Execution dropped to ~10 ms.
Some more data points... Let's try some testing in DEV and see how this index performs in the wild :)
DECLARE
CURSOR c_driver IS
SELECT attachment_comp_id, net_type
FROM apps.cz_lce_load_specs
WHERE deleted_flag = '0'
AND ROWNUM < 100;
counter NUMBER := 0;
v4 varchar2(10) := '0';
v1 number;
v2 varchar2(2000);
v3 number;
BEGIN
FOR r_driver IN c_driver LOOP
BEGIN
SELECT cz_lce_load_specs.lce_header_id,
cz_lce_texts.lce_text,
cz_lce_headers.component_id
INTO v1, v2, v3
FROM apps.cz_lce_load_specs,
apps.cz_lce_headers,
apps.cz_lce_texts
WHERE cz_lce_load_specs.attachment_comp_id = r_driver.attachment_comp_id
AND cz_lce_load_specs.net_type = r_driver.net_type
AND cz_lce_texts.lce_header_id = cz_lce_load_specs.lce_header_id
AND cz_lce_texts.lce_header_id = cz_lce_headers.lce_header_id
AND cz_lce_headers.deleted_flag = v4
AND cz_lce_load_specs.deleted_flag = v4
ORDER BY cz_lce_texts.seq_nbr;
EXCEPTION
WHEN others THEN NULL;
END;
END LOOP;
END;
/
Observation data for without the index creation on cz_lce_load_specs: 1) First passive configurator model loading (preloading):
Web response
time
Buffer gets
Disk reads
Executions
Sql time
for execution
Buffer gets/
execution
2:14.6
2,403,145
35,551
68
110.45
35,340
2:02.0
2,367,735
0
67
94.66
35,340
2:03.3
2,367,735
0
68
94.7
35,340
2) The models are already preloaded now and they are being launched:
Web response
time
Buffer gets
Disk reads
Executions
0:05.6
0
0
0
0:06.4
0
0
0
0:05.5
0
0
0
3) Launch a configurator model, change an attribute of a selection for an item or sub-item and save the configuration (do not submit an order yet):
Web response
time
Buffer gets
Disk reads
ExecutionsSql time
for execution
Buffer gets/
execution
0:30.1
600,259
3
17
23.76
35,309
0:05.9
0
0
0
0:04.3
0
0
0
Observation data for WITH the index creation on cz_lce_load_specs:
1) First passive load of the configurator model (preloading):
Web response
time
Buffer gets
Disk reads
Executions
Sql time
for execution
Buffer gets/
execution
0:22.7
3,240
3
67
0.03
48
0:23.1
3,141
0
67
0.04
47
0:26.1
3,141
0
67
0.02
47
2) First WebUI launch, after the preloading with the index:
Web response
time
Buffer gets
Disk reads
Executions0.04.3
0
0
0
0.04.4
0
0
0
0.04.8
0
0
0
3) Launch a configurator model, change an attribute of a selection for an item or sub-item and save the configuration (do not submit an order yet):
Web response
time
Buffer gets
Disk reads
Executions
Sql time
for execution
Buffer gets/
execution
0.05.1
209
0
17
0.01
12
0.04.8
0
0
0
0.04.0
0
0
0
Real time benefits.. To illustrate this point better after putting it in production, here is a quick comparision of the preload timings of a some configurator models that were grouped in 2 configurator servlets JVMs:
Before the index:
GROUP29
Max preloading time was 857 secs taken by M91R
Min preloading time was 852 secs taken by M91
Avg preloading time was 0:14:13 ( 853 seconds ) secs/model over a total of 6 models GROUP30
Max preloading time was 1684 secs taken by TDXSR-CG
Min preloading time was 1302 secs taken by TDXSCSEAT
Avg preloading time was 0:25:56 ( 1556 seconds ) secs/model over a total of 13 models
After the index:
GROUP29
Max preloading time was 54 secs taken by M91R
Min preloading time was 53 secs taken by M91-C
Avg preloading time was 0:0:53 ( 53 seconds ) secs/model over a total of 6 models GROUP30
Max preloading time was 96 secs taken by TDXSC2-CG
Min preloading time was 82 secs taken by TDXSR-HD
Avg preloading time was 0:1:30 ( 90 seconds ) secs/model over a total of 14 models
Conclusion Although this index could have been provided by Oracle, it was a lucky discovery to speeden up the preloading time for configurator models. With the new index, it is now possible for the client to bounce anytime and have 10-20 most popular models preloaded in within 2-3 mins, as compared to 10-15 mins before, which is really something.
Interestingly, the same columns got indexes in the 11.5.10 release as seen from http://etrm.oracle.com:
Note: This article is being produced with the explicit permission of Brian Fane, and is aimed at sharing tips in the Oracle world for other people who might be in the same situation.
The right combination of options to trace configurator servlet..
A long while back, we had faced a weird situation with configurator and desperately needed to trace it to find out what was happening behind the scenes.
After consulting many a metalink notes, however hard I tried, the tracing would simply not happen. The right sequence of character and options is quite tricky and not very well documented. (Not surprising, as Configurator is an acquired product)
After many attempts, the tracing did work and this article is more as a reminder to myself and any other people out there who want to trace the confiurator servlet. Be reminded though, that this tracing is more like FRD log tracing (for Oracle Forms) and is NOT sql tracing.
Having said so, here are the magical options that you need to add to $IAS_ORACLE_HOME/Apache/Jserv/etc/zone.properties file:
zone.properties
==========
....
....
servlets.default.initArgs=cz.activemodel=|/tl=@$TRACEDIR/ciotracesupport.lce| /gt=$TRACEDIR/ciotracesupport.java|/nolp|/nodp|/noatp|
....
....
where, $TRACEDIR is the directory where the trace (.lce) file will be produced. Also, please note that the /gt= option was just wrapped around to the below line for reading purposes. In reality, it needs to be in the same line as the /tl= option.
If anyone of you reading this article has a better idea than this, please let me know through the comment mechanism.
So you thought you upgraded your OLAP database to 10g..
Well, so did we. In fact, the dba_registry view showed the version of OLAP option as 10.2.0.3. Here, see it for yourself:
1 select comp_name , version, status from dba_registry
2* where comp_name like '%Analytic%'
SQL> /
COMP_NAME VERSION STATUS
-------------------------- ------------ -------
OLAP Analytic Workspace 10.2.0.3.0 VALID
Seems pretty convincing, does it not? One would think so.
An unexpected error..But, a month later, when the customer tried to do an archive and restore of a demand plan, they received the following error:
Demand Planning: Version : 11.5.0
Copyright (c) 1979, 1999, Oracle Corporation. All rights
reserved.
+---------------------------------------------------------------------------+
Current system time is 30-MAY-2008 08:24:19+---------------------------------------------------------------------------+
**Starts**30-MAY-2008 08:24:19**Ends**30-MAY-2008 08:24:21
ORA-35071: EIF file ARCH4027.eif cannot be imported because
analytic workspace ODPCODE has not been upgraded to version 10.0.0.0.
Start of log messages from FND_FILE
+---------------------------------------------------------------------------+
Error in restore process: see batch log for details
ORA-35071: EIF file ARCH4027.eif cannot be imported because
analytic workspace ODPCODE has not been upgraded to version 10.0.0.0.
+---------------------------------------------------------------------------+
End of log messages from FND_FILE
+---------------------------------------------------------------------------+
Executing request completion options...
Finished executing request completion options.
+---------------------------------------------------------------------------+
Exceptions posted by this request:
Concurrent Request for "Restore demand plan" has
completed with error.
Metalink, our friend in need..So we turned to our friend, Metalink and came across Note 390004.1, which says that Imported the Analytic Workspace in an different database release, but the AWs are still of the old version.
Eh! So we tried to confirm the same and found that there were indeed 4 analytical workspaces that were still of 9.1 version, including the ODPCODE AW:
SQL> select aw_name , aw_version
from all_aws
where aw_version like '9%';
AW_NAME AW_V
------------------------------ ----
ZPBANNOT 9.1
ZPBDATA 9.1
MSD4029 9.1
ODPCODE 9.1
Strangely, the other AWs were all 10.2:
SQL> select aw_name , aw_version
from all_aws
where aw_version like '10%';
AW_NAME AW_V
------------------------------ ----
EXPRESS 10.2
AWMD 10.2
AWCREATE 10.2
AWCREATE10G 10.2
AWXML 10.2
AWREPORT 10.2
XWDEVKIT 10.2
MMSD4027 10.2
MSD4027 10.2
MMMSD4027 10.2
PMMSD4027A0 10.2
PMMSD4027A21 10.2
PMMSD4027A22 10.2
PMMSD4027A26 10.2
PMMSD4027A32 10.2
PMMSD4027A33 10.2
PMMSD4027A55 10.2
PMMSD4027A57 10.2
PMMSD4027A58 10.2
MSD4027A58 10.2
PMMSD4027A78 10.2
PMMSD4027A79 10.2
PMMSD4027A80 10.2
PMMSD4027A86 10.2
PMMSD4027A87 10.2
PMMSD4027A88 10.2
PMMSD4027A93 10.2
PMMSD4027A95 10.2
PMMSD4029A 10.2
PMMSD4029A5 10.2
PMMSD4029A3 10.2
PMMSD4029A1 10.2
MMMSD4029 10.2
MMSD4029 10.2
ZPBCODE 10.2
35 rows selected.
SQL>
Where did our OLAP upgrade procedure go wrong..So the question before us was that how did our production upgrade procedure, the one which we thought was so meticulously planned and executed, go wrong?
For upgrading the ODP workspaces, we had followed Note 418200.1 - ODP: Steps When Upgrading the RDBMS Version Used With ODP. We had even
got a
clarification from the author of that note about what steps were needed
for ODP
workspaces upgrade.
Unfortunately, The following note does not have any direct reference to upgrading OLAP workspaces:
Note 362203.1
Interopratbility notes Oracle Applications Release 11i with Oracle 10g Release 2 (10.2.0)
We had found the following Notes on OLAP upgrade/exporting/clone:
472215.1 How to Export AWs or a Plan in Demand Planning
418200.1 ODP Steps When Upgrading the RDBMS Version Used With ODP
412308.1 How to Check the Version for ODP 11.5.10?
339110.1 How To Clone the OLAP Portion of ODP 11.5.10
Note 418200.1 is the most relevant for the upgrade activity, but unfortunately, even that does not have any reference to dba_aws view to check the aw_version column of each workspace and use the dbms_aw.convert API for the workspaces that could not be upgraded due to some reason.
So that is definitely one thing that should be part of that note. We would be sending a note to the author to include these steps into note 418200.1.
Another gotcha while implementing the solution (exec dbms_aw.convert API) is that it does not accept the OWNER name of the workspace in the first argument. Note 390004.1 actually sets a false expectation on this front.
Also, you need to be logged in as the owner of the workspace while issuing the dbms_aw API or you would get this error:
SQL> show user
USER is "SYS"
SQL> exec dbms_aw.convert('ODPCODE');
BEGIN dbms_aw.convert('ODPCODE'); END;
*
ERROR at line 1:
ORA-33262: Analytic workspace ODPCODE does not exist.
ORA-06512: at "SYS.DBMS_AW", line 58
ORA-06512: at "SYS.DBMS_AW", line 134
ORA-06512: at "SYS.DBMS_AW", line 1245
ORA-06512: at line 1
Oh, Oh.. A Gotcha while implementing the solution..
To make matters worse, by a copy-paste, I specified the owner name along with the workspace like this:
SQL> exec dbms_aw.convert('apps.odpcode');
It came back with an error like you cant specify the owner name of the table.
After that, the name of the aw got changed to ODPCODE_ (with the _ in the end), but it upgraded to 10.2 as per dba_aws or all_aws.
SQL> select aw_name, aw_version
from all_aws
where aw_name like 'ODP%';
AW_NAME AW_V
------------------------------ ----
ODPCODE_ 10.2
Essentially, the dbms_aw.convert had left the ODPCODE workspace in a transient state and not done a good job of cleaning up after the exception.
A workaround for cleaning up...Well, there is an API called dbms_aw.aw_rename that would do the trick of putting the name back to ODPCODE like this:
SQL> show user
USER is APPS
SQL> exec dbms_aw.aw_rename('ODPCODE_','ODPCODE');
PL/SQL procedure successfully completed.
SQL> select aw_name, aw_version from all_aws where aw_name='ODPCODE';
AW_NAME AW_V
------------------------------ ----
ODPCODE 10.2
Now just to make sure that it was really 10.2, we ran the dbms_aw.convert API again and got this message, which implied that it was already in 10.2 format:
SQL> exec dbms_aw.convert('ODPCODE');
BEGIN dbms_aw.convert('ODPCODE'); END;
*
ERROR at line 1:
ORA-33298: Analytic Workspace ODPCODE is already in the newest format allowed
by the current compatibility setting
ORA-06512: at "SYS.DBMS_AW", line 1267
ORA-06512: at line 1
After this, the MSDRESTORE module: Restore demand plan concurrent job completed successfully.
Conclusion...There were multiple learnings from this Sev 1 experience in production instance:
- Ask your customer to test exhaustively. In this case, the customer is a very small shop with an IT department of 5-6 people and yet to graduate to the enterprise level thinking for managing their IT systems.
Unfortunately, in this case, we had very little say in their testing plan.
- In this case, we were caught a little off guard because we did not have exhaustive expertise in OLAP upgrades and had to rely on published metalink notes, which did not cross link all the known upgrade issues.
- While executing the dbms_aw APIs, it is best to be logged in as the owner of the analytical workspace being worked on, or you might land up in a worse soup than you signed up for.
Time saving tactic: How we saved 6 hrs of downtime in production 10g upgrade
PrefaceSo you want to upgrade your database to 10g from 9i. Well, welcome to the club.
If you have RAC, then you will definitely need to install the CRS and database binaries, along with some RDBMS patches.
When we were challenged with keeping the downtime to within 24 hrs by a customer, that set us thinking as to what time saving tactics could be employed to achieve this end.
Divide and conquer the downtime..The old strategy of divide and conquer is really time tested and works well in most paradigms. In this article, we demonstrate how its possible to split the 10g upgrade downtime into two logical parts:
Part 1)
Install the 10g technology stack ahead of time, including the database, any patchsets (usually 10.2.0.3) and rdbms patches. Before doing this, we shutdown the 9i RAC/DB binaries. After the 10g CRS/DB technology stack installation, you shutdown the 10g CRS , bring up the 9i RAC processes and carry on as it nothing happened.
At this stage, 9i and 10g technology stack will co-exist peacefully with each other. The production system can run on 9i RAC for a week or more till you decide to do the actual upgrade.
Part 2)
In the subsequent main upgrade outage, you can shutdown the 9i RAC/DB, bring up DB using 10g CRS/DB oracle home and do the actual upgrade. This outage could last anywhere from 10-24 hrs or even upto 32 hrs depending on your pre-production timing practice. It is assumed that one would do at least 2-3 rounds of upgrade before gaining the confidence of doing the production round.
Adopting this split strategy saved us about 6 hrs in the main upgrade downtime window and we were able to do the upgrade in a window of 16-20 hrs. The size of the database was ~1 TB on HP-UX PA RISC 64 bit OS.
How was it done..When you do an 9i->10g upgrade for RAC, the following happens:
1) OCR device location is automatically picked up from /var/opt/oracle/srvConfig.loc (9i setup file)
2) The OCR device's contents is upgraded to 10g format
3) A new file called /var/opt/oracle/ocr.loc is created with the OCR device name
Since we had to preserve the 9i OCR device for running 9i RAC after the 10g CRS/DB techstack installation, we did the following:
1) Got a new set of OCR and Voting device for 10g. This was a separate set of devices in addition to the 9i OCR and voting disks. Then we brought down the 9i CRS processes.
2) copied the contents of 9i OCR into the 10g OCR device using the dd command:
$ whoami
oracle
$ dd if=/dev/rawdev/raw_9i_ocr.dbf of=/dev/rawdev/raw_10g_ocr.dbf bs=1024
3) Edited the srvConfig.loc file to point to the /dev/rawdev/raw_10g_ocr.dbf file
4) Did the 10g CRS installation and ran root.sh that upgraded the OCR device to 10g format
6) Brought down 10g CRS services with $ORA_CRS_HOME/bin/crsctl stop crs and also disabled the automatic startup of CRS with /etc/init.d/init.crs crs disable
7) Re-pointed the 9i OCR device back in /var/opt/oracle/srvConfig.loc as /dev/rawdev/raw_9i_ocr.dbf
Then we brought up the 9i CRS services again and then brought up the Database with 9i binaries.
puzzling RMAN: No channel to restore a backup or copy of datafile?
An RMAN puzzle
In this article, we will look at a puzzling situation that happened at one of our environments, when we were carrying out an RMAN restoration. The RMAN backup was taken using a catalog (not using controlfile) and our mission was to restore PROD to UAT. Simple, right?
Wrong!
An Arrrgh! Moment..
We were using the following restoration script:
RMAN> run
2> {
3> set until time "to_date('05/13/08 14:40:00','MM/DD/YY hh24:mi:ss')";
4> allocate auxiliary channel 'adisk_0' type DISK;
5> allocate auxiliary channel 'adisk_1' type DISK;
6> allocate auxiliary channel 'adisk_2' type DISK;
7> allocate auxiliary channel 'adisk_3' type DISK;
8> DUPLICATE TARGET DATABASE TO UAT;
9> }
The backup location was available and so were adequate backup channels, but we always ended up getting this laconic error:
...
...
Starting restore at 13-MAY-08
released channel: adisk_0
released channel: adisk_1
released channel: adisk_2
released channel: adisk_3
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/13/2008 15:25:38
RMAN-03015: error occurred in stored script Memory Script
RMAN-06026: some targets not found - aborting restore
RMAN-06100: no channel to restore a backup or copy of datafile 471
RMAN-06100: no channel to restore a backup or copy of datafile 466
RMAN>
RMAN> **end-of-file**
RMAN>
Recovery Manager complete.
No channel to restore?...Huh? Well, as you can see, the channels were very well defined and even the rman log showed that the channels were being allocated. We even tried increasing the channels to 8 or 10, and also tried allocating auxiliary channels (was irrelevant), but to no avail.
Another option that was tried was plain and simple restoration using the RMAN> restore datafile 1; command. Although that failed since the paths of datafile 1 was not switched to UAT's convention, it did not prove anything.
My teammates, Dhandapani Perungulam and Sandeep Rebba were trying out different options. Then someone had the bright idea of checking the timestamp of the completed backup. It was possible that we were trying to restore to a point in time before the backup had completed.
Verifying the rman backup completion time from catalogFirst of all, we needed to set the NLS_DATE_FORMAT variable to include the HH24:MI format:
$ export NLS_DATE_FORMAT='dd-mon-rr hh24:mi'
Then we needed to connect to the target (source instance for replication in RMAN terminology) and RMAN catalog:
$ rman catalog rman@rman target sys@prd
Recovery Manager: Release 10.2.0.3.0 - Production on Fri May 23 14:34:02 2008
Copyright (c) 1982, 2005, Oracle. All rights reserved.
rman database Password:
target database Password:
connected to target database: PRD (DBID=2530951715)
connected to recovery catalog database
RMAN>
This command will give us details of the backupsets and the time when the backup completed:
RMAN> list backup of database completed between '12-may-2008' and '14-may-2008;
...
...
...
BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
2794098 Incr 0 53.28G SBT_TAPE 00:38:54 13-may-08 23:11
BP Key: 2794121 Status: AVAILABLE Compressed: NO Tag: TAG20080513T203101
Handle: racprod1_PRD<PRD_116563:654647578:1>.dbf Media:
List of Datafiles in backup set 2794098
File LV Type Ckp SCN Ckp Time Name
---- -- ---- ---------- --------------- ----
7 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/owad01.dbf
295 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/system06.dbf
379 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/rbs11.dbf
459 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_interface01.dbf
460 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_summary01.dbf
481 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_tx_data12.dbf
509 0 Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_archive02.dbf
BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
2794099 Incr 0 55.62G SBT_TAPE 00:39:54 13-may-08 23:12
BP Key: 2794122 Status: AVAILABLE Compressed: NO Tag: TAG20080513T203101
Handle: racprod1_PRD<PRD_116562:654647562:1>.dbf Media:
List of Datafiles in backup set 2794099
File LV Type Ckp SCN Ckp Time Name
---- -- ---- ---------- --------------- ----
5 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/system05.dbf
17 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/rbs15.dbf
18 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_interface03.dbf
22 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_summary05.dbf
398 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/discoverer01.dbf
480 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_tx_data11.dbf
508 0 Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_nologging02.dbf
RMAN>
As we can see, 13-may-08 22:32 is the golden time. Anything after that in the until time statement will do.
Further progress..So then, we tried with this script:
RMAN>
RMAN> run {
2> allocate auxiliary channel 'adisk_0' type DISK;
3> allocate auxiliary channel 'adisk_1' type DISK;
4> allocate auxiliary channel 'adisk_2' type DISK;
5> allocate auxiliary channel 'adisk_3' type DISK;
6> allocate auxiliary channel 'adisk_4' type DISK;
7>
8> set until time "to_date('09/013/2008 22:40:00','DD/MM/YYYY HH24:MI:SS')";
9>
10> DUPLICATE TARGET DATABASE TO UAT;
11> }
Although, we got the following error for the archivelogs, we were able to open the database after manually supplying them and then doing an SQL> alter database open resetlogs; :
...
...
Starting recover at 13-MAY-08
starting media recovery
Oracle Error:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/u01/oracle/uatracdata/system01.dbf'
released channel: adisk_0
released channel: adisk_1
released channel: adisk_2
released channel: adisk_3
released channel: adisk_4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/13/2008 22:47:51
RMAN-03015: error occurred in stored script Memory Script
RMAN-06053: unable to perform media recovery because of missing log
RMAN-06102: no channel to restore a backup or copy of log thread 2 seq 111965 lowscn 7246877796511
RMAN-06102: no channel to restore a backup or copy of log thread 1 seq 143643 lowscn 7246877796417
ConclusionSome RMAN errors can be really hard to understand or debug. Especially, it seems that the no channel to restore a backup or copy seems to be a very generic error message in RMAN, which could really mean anything and can be very misleading.
For example, we can see that at the end, it gave the RMAN-06102: no channel to restore a backup or copy of log thread 2 seq 111965 lowscn 7246877796511 error, but that meant that it could not find the archive logs. After supplying the archive logs, the database opened up fine with the resetlogs option.
10gR2 CRS case study: CRS would not start after reboot - stuck at /etc/init.d/init.cssd startcheck
PrefaceI had recently done a 10gR2 CRS installation on SuSE linux 9.3 (2.6.5.7-244 kernel) and noticed that after a reboot of the RAC nodes, the CRS would not come up!
The CSS daemon was stuck at the /etc/init.d/init.cssd startcheck command:
raclinux1:/tmp # ps -ef | grep css
root 6929 1 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 6960 6928 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 6963 6929 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 7064 6935 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
Debugging..To debug this more, I went to the $ORA_CRS_HOME/log/<nodename>/client and checked the latest files there:
raclinux1:/opt/oracle/product/10.2.0/crs/log/raclinux1/client # ls -ltr
total 435
-rw-r----- 1 root root 2561 May 18 23:20 ocrconfig_8870.log
-rw-r--r-- 1 root root 195 May 18 23:22 clscfg_8924.log
-rw-r----- 1 root root 172 May 18 23:29 ocr_15307_3.log
-rw-r----- 1 root root 172 May 18 23:29 ocr_15319_3.log
-rw-r----- 1 root root 172 May 18 23:29 ocr_15447_3.log
...
...
...
drwxr-x--- 2 oracle dba 3472 May 19 08:10 .
drwxr-xr-t 8 root dba 232 May 19 13:50 ..
-rw-r--r-- 1 root root 2946 May 19 14:11 clsc.log
-rw-r--r-- 1 root root 7702 May 19 14:11 css.log
I did a more of the clsc.log & css.log and saw the following errors:
$ more clsc.log
...
...
...
2008-05-19 14:11:29.912: [ COMMCRS][1094672672]clsc_connect: (0x81c74b8) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
2008-05-19 14:11:31.582: [ COMMCRS][1094672672]clsc_connect: (0x817e3f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))
2008-05-19 14:11:31.583: [ default][1094672672]Terminating clsd session
$ more css.log
...
...
...
2008-05-19 02:42:48.307: [ OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:42:48.308: [ OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:42:48.308: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:42:48.308: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.
2008-05-19 02:43:41.982: [ OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:43:41.983: [ OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:43:41.983: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:43:41.983: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.
2008-05-19 02:46:40.204: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9
2008-05-19 14:11:28.217: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9
2008-05-19 14:11:37.186: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9
So it was pointing towards the OCR being not available, as could be verified by the /tmp/crsctl.<PID> files too:
raclinux1:/tmp # ls -ltr crsctl*
-rw-r--r-- 1 oracle dba 148 May 19 02:44 crsctl.6826
-rw-r--r-- 1 oracle dba 148 May 19 02:44 crsctl.6679
-rw-r--r-- 1 oracle dba 148 May 19 02:44 crsctl.6673
-rw-r--r-- 1 oracle dba 148 May 19 02:49 crsctl.7784
-rw-r--r-- 1 oracle dba 148 May 19 02:49 crsctl.7890
-rw-r--r-- 1 oracle dba 148 May 19 02:49 crsctl.7794
-rw-r--r-- 1 oracle dba 148 May 19 13:55 crsctl.7034
-rw-r--r-- 1 oracle dba 148 May 19 13:55 crsctl.6886
-rw-r--r-- 1 oracle dba 148 May 19 13:55 crsctl.6883
-rw-r--r-- 1 oracle dba 148 May 19 14:18 crsctl.6960
-rw-r--r-- 1 oracle dba 148 May 19 14:18 crsctl.7064
-rw-r--r-- 1 oracle dba 148 May 19 14:18 crsctl.6963
raclinux1:/tmp # more crsctl.6963
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Permission denied] [13]
Permission issue!Duh! So it was a permission issue on the OCR disk (at this moment), which could expand into a permissions issue for Voting and asm disks later:
raclinux1:/tmp # ls -ltr /dev/raw/raw*
crw-rw-r-- 1 root disk 162, 9 Nov 18 2005 /dev/raw/raw9
crw-rw-r-- 1 root disk 162, 8 Nov 18 2005 /dev/raw/raw8
crw-rw-r-- 1 root disk 162, 7 Nov 18 2005 /dev/raw/raw7
crw-rw-r-- 1 root disk 162, 6 Nov 18 2005 /dev/raw/raw6
crw-rw-r-- 1 root disk 162, 5 Nov 18 2005 /dev/raw/raw5
crw-rw-r-- 1 root disk 162, 4 Nov 18 2005 /dev/raw/raw4
crw-rw-r-- 1 root disk 162, 3 Nov 18 2005 /dev/raw/raw3
crw-rw-r-- 1 root disk 162, 2 Nov 18 2005 /dev/raw/raw2
crw-rw-r-- 1 root disk 162, 15 Nov 18 2005 /dev/raw/raw15
crw-rw-r-- 1 root disk 162, 14 Nov 18 2005 /dev/raw/raw14
crw-rw-r-- 1 root disk 162, 13 Nov 18 2005 /dev/raw/raw13
crw-rw-r-- 1 root disk 162, 12 Nov 18 2005 /dev/raw/raw12
crw-rw-r-- 1 root disk 162, 11 Nov 18 2005 /dev/raw/raw11
crw-rw-r-- 1 root disk 162, 10 Nov 18 2005 /dev/raw/raw10
crw-rw-r-- 1 root disk 162, 1 Nov 18 2005 /dev/raw/raw1
I enabled read and write permission for the raw devices using the # chmod +rw /dev/raw/raw* devices. but even after that the latest /tmp/crsctl.<PID> files being generated were showing this message:
raclinux1:/tmp # more crsctl.6960
Failure -2 opening file handle for (vote1)
Failure 1 checking the CSS voting disk 'vote1'.
Failure -2 opening file handle for (vote2)
Failure 1 checking the CSS voting disk 'vote2'.
Failure -2 opening file handle for (vote3)
Failure 1 checking the CSS voting disk 'vote3'.
Not able to read adequate number of voting disks
At this point, I just chowned /dev/raw/raw* to oracle:dba like this:
raclinux1:/tmp # chown oracle:dba /dev/raw/raw*
After 1-2 mins, the CSS came up:
raclinux1:/tmp # ps -ef | grep css
root 6929 1 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 10900 6929 0 14:39 ? 00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle 10980 10900 0 14:40 ? 00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd; /opt/oracle/product/10.2.0/crs/bin/ocssd || exit $?'
oracle 10981 10980 0 14:40 ? 00:00:00 /bin/sh -c ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd; /opt/oracle/product/10.2.0/crs/bin/ocssd || exit $?
oracle 11007 10981 2 14:40 ? 00:00:00 /opt/oracle/product/10.2.0/crs/bin/ocssd.bin
root 12013 7414 0 14:40 pts/2 00:00:00 grep css
raclinux1:/tmp #
The CRS components came up fine automatically:
raclinux1:/opt/oracle/product/10.2.0/crs/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
The ASM and RAC instances also came up fine:
raclinux1:/opt/oracle/product/10.2.0/crs/bin # ps -ef |grep smon
oracle 12257 1 0 14:41 ? 00:00:00 asm_smon_+ASM1
oracle 13100 1 0 14:41 ? 00:00:02 ora_smon_o10g1
root 32282 7414 0 14:55 pts/2 00:00:00 grep smon
For the long term..To make this change permanent, I put it in /etc/init.d/boot.local file, along with the modprobe hangcheck-timer command:
raclinux1:/opt/oracle/product/10.2.0/crs/bin # more /etc/init.d/boot.local
#! /bin/sh
#
# Copyright (c) 2002 SuSE Linux AG Nuernberg, Germany. All rights reserved.
#
# Author: Werner Fink <werner@suse.de>, 1996
# Burchard Steinbild, 1996
#
# /etc/init.d/boot.local
#
# script with local commands to be executed from init on system startup
#
# Here you should add things, that should happen directly after booting
# before we're going to the first run level.
#
chown oracle:dba /dev/raw/raw*
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180
ConclusionIf simple things are permissions are not correct on the OCR devices, it can hold down the CRS daemons and the ASM/DB instances. It may be needed to put workarounds in /etc/init.d/boot.local for getting around the situation.
Considerations for virtual IP setup before doing the 10gR2 CRS install
Preface
Many a times, simple things can bite you when you are planning something really important, like a jumping the gun for virtual IP installing 10gR2 CRS.
The CRS installation assumes that the virtual IP hostname and IP information should be either in the DNS or in local /etc/hosts file ONLY. It does not assume that the virtual IP is already tied up to a network interface on the RAC nodes.
A cryptic error... If the virtual IP is already attached to network interface during the CRS install, it throws a cryptc error, saying " the virtual interface seems to be already assigned to another system in the network ", which can be pretty misleading. We faced this situation recently during one of our upgrades:
,
The problem... The problem was that the racprod1-vip virtual hostname was already associated with a network interface alias lan1:801:
Please note: The hostnames, domain and IPs mentioned in this diagram are completely fictional and have been fudged from their original values. The current representation is enough for the sake of illustration.
The solution... Needless to say, the solution was to dis-associate the racprod1-vip with lan1:801, after which the 10gR2 CRS installation proceeded fine.
Conclusion.. Many people do not appreciate the fineties involved in CRS installation, especially around virtual IP. The actual association of virtual IP hostname and network interface is done by the $ORA_CRS_HOME/bin/vipca script, which is automatically run as part of $ORA_CRS_HOME/root.sh
crsctl start crs does not work in 10gR2
Preface
Nowadays, we are doing a 10g upgrade for one of our clients and hit upon the idea of pre-staging the 10gR2 CRS + DB technology stack on their RAC servers that are running 9iR2 RAC on HP service gaurd already. This is nothing but a downtime reduction technique, that saved about 5-6 hours. Thankfully, the idea worked, but not before some excitement.Suprise, surprise..
A week after doing the 10gR2 CRS + DB installation on the pre-production servers, when we were starting the real Database upgrade, we had to bring up the 10gR2 CRS.Redemption
I was surprised to see that the 10gR2 CRS services would not come up. We had tried the following three things:
1) Uncommenting the crs, css, and evm daemons in /etc/inittab
2) Issued /etc/init.d/init.crs enable
3) Issued /etc/init.d/init.crs start
4) Isused $ORA_CRS_HOME/bin/crsctl start crs
I was pretty aghast. Apart from thinking of logging a tar, on searching metalink, we came across a new command that I had not tried in 10gR1.
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./srvctl status nodeapps -n raclinux1
PRKH-1010 : Unable to communicate with CRS services.
[Communications Error(Native: prsr_initCLSS:[3])]
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./srvctl start nodeapps -n raclinux1
PRKH-1010 : Unable to communicate with CRS services.
[Communications Error(Native: prsr_initCLSS:[3])]
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./srvctl status nodeapps -n raclinux1
PRKH-1010 : Unable to communicate with CRS services.
[Communications Error(Native: prsr_initCLSS:[3])]
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
This is then when we tried the crsctl start resources and the CRS actually came up:Conclusion
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./crsctl start resources
Starting resources.
Successfully started CRS resources
raclinux1:/opt/oracle/product/10.2.0/CRS/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
At this point, I am not sure why the behaviour changed in 10gR2, whether it was intentional or un-intentional or whether this is a bug. But I am glad that we have a workaround. Everyday is a new learning.
Troubleshooting done to make root.sh work after a 10gR2 CRS (10.2.0.1) installation on HP-UX PA RISC 64-bit OS
PrefaceThis posting describes the troubleshooting done to make root.sh work after a 10gR2 CRS (10.2.0.1) installation on HP-UX PA RISC 64-bit OS. We were upgrading from 9i (9.2.0.6) to 10gR2.
The problem(root@testdb):/stage11i/software# /u01/oracle/testdb/CRS/root.sh
WARNING: directory '/u01/oracle/testdb' is not owned by root
WARNING: directory '/u01/oracle' is not owned by root
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured
Checking to see if any 9i GSD is up
Setting the permissions on OCR backup directory
Setting up NS directories
PROT-3: Failed to create export file
Failed to upgrade Oracle Cluster Registry configuration
The fix was mentioned in Bug 5487374 -- TST&PERF:CRS UPGRADE NEEDS TO CREATE OCR BACKUP DIR DURING OUT OF PLACE UPGRADE
We had to create sub-directory "crs" under /u01/oracle/testdb/CRS/cdata/crs10g and re-run root.sh. The root.sh expects to find a subdirectory for the previous RAC cluster's name.
More trouble..There were some more errors while running root.sh
...
...
Format of 1 voting devices complete.
/u01/oracle/testdb/CRS/install/rootconfig[938]: /sbin/init.d/init.crs: Execute permission denied.
/u01/oracle/testdb/CRS/install/rootconfig[941]: /sbin/init.d/init.cssd: Execute permission denied.
CRS daemons not set to start. See /var/adm/messages for details.
(root@testdb):/u01/oracle/testdb/CRS/cdata/crs10g# echo $?
1
Fix:
As per the SR 6807470.992 we performed the following action Plan.
Login as root and grant the following permissions -
# cd /sbin/init.d
# chmod 755 init.crs
# chmod 755 init.crsd
# chmod 755 init.cssd
# chmod 555 init.evmd
Then check the permissions are correct -
(root@testdb):/sbin/init.d# ls -rlt init*
-rwxr-xr-x 1 root sys 2226 Mar 3 18:32 init.crs
-rwxr-xr-x 1 root sys 4845 Mar 3 18:32 init.crsd
-rwxr-xr-x 1 root sys 36798 Mar 3 18:32 init.cssd
-rwxr-xr-x 1 root sys 3185 Mar 3 18:32 init.evmd
When we tried to execute root.sh we are getting following warning:
(root@testdb):/stage11i/software# /u01/oracle/testdb/CRS/root.sh
WARNING: directory '/u01/oracle/testdb' is not owned by root
WARNING: directory '/u01/oracle' is not owned by root
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured
At this point, we edited the file /var/opt/oracle/scls_scr/testdb/oratest1/cssfatal from disable to enable in both the nodes (there is only 1 line in this file)
(root@testdb):more /var/opt/oracle/scls_scr/testdb/oratest1/cssfatal
enable
Also give permissions as follows:
(root@testdb):/var/opt/oracle# chmod -R 777 /var/opt/oracle/scls_scr
Then started the CRS.
# $ORA_CRS_HOME/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
Troubleshooting case study for 9i RAC ..PRKC-1021 : Problem in the clusterware
Preface This is a troubleshooting case study for a 9i RAC environment on HP-UX PA RISC 64 bit, which was using HP ServiceGuard solution (third party) for cluster maintenance.
The problem The problem was that the srvctl command was giving the following error:
(oracle@rac2): gsdctl stat
PRKC-1021 : Problem in the clusterware
Failed to get list of active nodes from clusterware
(oracle@rac1): srvctl config
PRKC-1021 : Problem in the clusterware
(oracle@rac1): pwd
/u01/oracle/uatdb/9.2/bin
(oracle@rac1): ./lsnodes
rac1
rac2
app1
app2
Troubleshooting Approach Followed note 178683.1 to edit $ORACLE_HOME/bin/gsd.sh like this and traced gsd.sh:
# added -DTRACING.ENABLED=true -DTRACING.LEVEL=2 before -classpath
exec $JRE -DPROGRAM=gsd -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath $CLA
SSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME
$ $ORACLE_HOME/bin/gsd.sh
...
...
[main] [13:15:42:852] [line# N/A] my property portnum=null
[main] [13:15:42:942] [NativeSystem.<init>:123] Detected Cluster
[main] [13:15:42:946] [NativeSystem.<init>:124] Cluster existence = true
[main] [13:15:42:947] [UnixSystem.<init>:118] Going to load SRVM library
[main] [13:15:42:950] [UnixSystem.<init>:118] loaded libraries
[main] [13:15:42:950] [OPSMDaemon.main:726] Initializing the daemon ...
[main] [13:15:42:951] [OPSMDaemon.<init>:188] NativeDebug is set to true
[main] [13:15:42:952] [OPSMDaemon.<init>:188] UnixSystem.initializeDaemon: groupName is opsm
[main] [13:15:42:953] [OPSMDaemon.<init>:188] Unsatisfied Link Error caught. Could not initialize the cluster
[main] [13:15:42:954] [OPSMDaemon.main:726] initializeDaemon status = false
[main] [13:15:42:955] [OPSMDaemon.main:726] Failed to initialize and register with clusterware
[main] [13:15:42:955] [OPSMDaemon.main:726] OPSMErrCode = 1003
[main] [13:15:42:958] [OPSMDaemon.main:729] java.rmi.RemoteException: Unable to initialize with the clusterware
java.rmi.RemoteException: Unable to initialize with the clusterware
at oracle.ops.mgmt.daemon.OPSMDaemon.<init>(OPSMDaemon.java:195)
at oracle.ops.mgmt.daemon.OPSMDaemon.main(OPSMDaemon.java:726)
[main] [13:15:42:958] [line# N/A] Exiting from main..no point trying to start the daemon
At this point, one option was to initialize the srvconfig raw device (OCR) and then add the RAC instances etc manuall using srvctl add command:
(oracle@rac1): srvconfig -init
java.lang.UnsatisfiedLinkError: readRawObject
at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
Exception in thread "main" (oracle@rac1):
(oracle@rac1): srvconfig -version
java.lang.UnsatisfiedLinkError: readRawObject
at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
If the config file is pointing to a raw device the following type of output should be returned:
$ raw device version "9.0.0.0.0"
Since we were not getting that output, there was either a problem in accessing the OCR raw device or the soft link was not working due to a permissions issue. Outputting the contents of the OCR on the standard output using $ dd if=/dev/orar1/rrawuat.conf bs=1500 showed that the OCR device was readable all right.
$ more /var/opt/oracle/srvConfig.loc
srvconfig_loc=/dev/orar1/rrawuat.conf
$ ls -l /dev/orar1/rrawuat.conf
crw-rw---- 1 oracle dba 64 0x110004 Apr 11 2007 /dev/orar1/rrawuat.conf
Then one idea was to relink the srv* binaries using the make command, but that also resulted in error:
(oracle@rac2): cd $ORACLE_HOME/srvm/lib
(oracle@rac2): make -f ins_srvm.mk install
nm -gp /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a | grep T | grep Java | awk '{ print "-u " $3 }' > /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def;
/u01/oracle/uatdb/9.2/bin/echodo ld +s -G -b -o libsrvm.sl -c
/u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/ -L/usr/lib -lc -lclntsh -lwtc9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9
-lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm `cat /u01/oracle/uatdb/9.2/lib/sysliblist` ;
rm /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def
ld +s -G -b -o libsrvm.sl -c /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/
-L/usr/lib -lc -lclntsh -lwtc9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm -l:libcl.sl -l:librt.sl -lpthread -l:libnss_dns.1 -l:libdld.sl
ld: Mismatched ABI (not an ELF file) for -lclntsh, found /u01/oracle/uatdb/9.2/lib32//libclntsh.sl
Fatal error.
*** Error exit code 1
Stop.
The turning point Saw bug 6281672 and kind of got hint from there.
Compared the file $ORACLE_HOME/lib32/libsrvm.sl on another RAC system (duat).
(oracle@duat1): ls -l $ORACLE_HOME/lib*/libsrvm*
-rwxr-xr-x 1 oracle dba 57344 Nov 7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-rwxr-xr-x 1 oracle dba 57344 Nov 7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x 1 oracle dba 36864 Nov 7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-rwxr-xr-x 1 oracle dba 36864 Nov 7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
(oracle@duat1):
On rac1/2, the /u01/oracle/uatdb/9.2/lib32/libsrvm.sl file was missing. Saw that the .sl and .sl0 files were copied of each other.
so on rac1/2, did the following:
$ cd /u01/oracle/uatdb/9.2/lib32
$ cp libsrvm.sl0 libsrvm.sl
Before:
(oracle@rac2): ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x 1 oracle dba 57344 Oct 17 2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x 1 oracle dba 36864 Feb 9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x 1 oracle dba 106496 Nov 30 2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
After:
(oracle@rac2): ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x 1 oracle dba 57344 Mar 5 15:14 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-r-xr-xr-x 1 oracle dba 57344 Oct 17 2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x 1 oracle dba 36864 Feb 9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x 1 oracle dba 106496 Nov 30 2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
The fact that the size of libsrvmocr.sl0 was not matching with libsrvmocr.sl did not seem to be a showstopper. The Oracle system seemed to be taking libsrvm.sl into account for srv* related commands.
After this, the gsdctl and srvctl commands started working:
(oracle@rac1): gsdctl stat
GSD is running on the local node
(oracle@rac1): srvctl config
uat
(oracle@rac2): gsdctl stat
GSD is not running on the local node
(oracle@rac2):
(oracle@rac2): gsdctl start
Successfully started GSD on local node
(oracle@rac2): srvctl status database -d uat
Instance uat1 is not running on node rac1
Instance uat2 is not running on node rac2
But there was "just" one more problem.. Interestingly, srvctl stop database/instance worked fine at this point, but srvctl start did not.
(oracle@rac2): srvctl start instance -d uat -i uat1
PRKP-1001 : Error starting instance uat1 on node rac1
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.
ORA-01078: failure in processing system parameters local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.
(oracle@rac2): srvctl start instance -d uat -i uat2
PRKP-1001 : Error starting instance uat2 on node rac2
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat2' reserved.
ORA-01078: failure in processing system parameters
SQL> Disconnected
ORA-00132: syntax error or unresolved network name 'uat2' reserved.
initUAT2.ora:
-------------
uat1.local_listener='uat1'
uat2.local_listener='uat2'
uat1.remote_listener='uat2'
uat2.remote_listener='uat1'
(oracle@rac1): strings spfileuat1.ora | grep listener
uat1.local_listener='uat1'
uat2.local_listener='uat2'
uat1.remote_listener='uat2'
uat2.remote_listener='uat1'
Tnsping utility was working for both UAT1 and UAT2 service names..
(oracle@rac2): tnsping uat1
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:19:36
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat1)))
OK (0 msec)
(oracle@rac2): tnsping uat2
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:19:39
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat2)))
OK (0 msec)
Added entries for local_listener and remote_listener to $TNS_ADMIN/tnsnames.ora on rac1 (and rac2 resp) and made sure that tnsping to them was working:
local_listener=
(DESCRIPTION=
(ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522))
)
remote_listener=
(DESCRIPTION=
(ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522))
)
(oracle@rac2): tnsping local_listener
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:44:05
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)))
OK (10 msec)
(oracle@rac2): tnsping remote_listener
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:44:13
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)))
OK (10 msec)
To get around this issue graciously, srvctl setenv command could be used:
$ srvctl setenv instance -d uat -i uat1 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1
$ srvctl getenv instance -d uat -i uat1
TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1
$ srvctl setenv instance -d uat -i uat2 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2
$ srvctl getenv instance -d uat -i uat2
TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2
(oracle@rac1): srvctl start instance -d uat -i uat1
(oracle@rac1): srvctl status instance -d uat -i uat1
Instance uat1 is running on node rac1
(oracle@rac1): srvctl start instance -d uat -i uat2
(oracle@rac1): srvctl status instance -d uat -i uat2
Instance uat2 is running on node rac2 Some pending issues.. Some issues were still pending, pointing towards some more mis-configuration, but at least we were able to get over the initial error.
(oracle@rac1): srvconfig -version
java.lang.NullPointerException
at java.text.MessageFormat.format(Compiled Code)
at java.text.MessageFormat.format(MessageFormat.java)
at java.text.MessageFormat.format(MessageFormat.java)
at oracle.ops.mgmt.nls.MessageBundle.getMessage(MessageBundle.java:225)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
Workflow performance case study: Dont Repeat History, Learn from it
A Sanskrit Shlok underlying the importance of careful planning
Preface They say: "Those who do not know history are condemned to repeat it."But why repeat history if you dont have to?
Many questions come to light on this topic of discussion:
- Is there a better way of designing conditional implementation in workflow?
- What are the merits/demerits of repeated checking of a particular condition as opposed to executing something when the actual condition happens? What are the performance implications?
- Are there workflow APIs available which simulate event based subscription model?
- What are the performance advantages of using the basic Workflow APIs that simulate this event based subscription model?
The intended audience is workflow designers, support personnel and Oracle Apps DBA alike. This case study can potentially help users to design their workflow more optimally and succintly, which would have a minimal impact on the execution. The I/O profile and time taken for Workflow Background Process bears out all this information. As far as possible, trends have been depicted from the customer's Production system. The Business Scenario
The business requirement to be implemented was during the ATP (Available to Promise) phase in the life of an Order line. ATP simply means that we can guarentee the customer that a particular Item can be made available from a particular warehouse on a particular date.
When scheduling an Order Line, if a financial hold existed on the Order line due to some reason, it needed to be unschedule. Also, as soon as the financial hold was released by the Accounts Receivables department, the requirement was to proceed with the remaining business flow in the OEOL workflow design.
How It Was DesignedThis is how it was designed:
Let us translate the flow:
- When the WF background program executes custom ATP process, it checks if the order is on hold or not.
- If the order header or one of the lines is on hold the custom activity "Unschedule Lines On Hold" will unschedule all those lines and waits for an hour before checking the line status again.
- This loop repeats until all holds on the header/line are released.
Checking its Node Attributes tab:
Now, checking its actual timeout value:
So, the bottomline is that the WAIT activity was timing out after every 15 mins (60 x 0.0416 ~ 15 mins). Just imagine the volume of OEOL item keys at this activity on any given hour. What kind of a load would they be putting on the system.
Now, is that an intelligent way of designing the workflow for this requirement? Some shocking revelations.. A little analysis of the volume in WF_ITEM_ACTIVITY_STATUSES_H (history) table showed something shocking (!!):
The XXONT_CUSTOM_ATP->WAIT step was generating the highest history records. In this case, it had created 27 million rows (75% of the entire volume) out of 35 millions rows in in WF_ITEM_ACTIVITY_STATUSES_H table.
PS: This is one of the reasons as to why this case study is very interesting: simply because its impact was tremendous. How are history records created?
Lets clear up some air and get our basics right about how and when are history records created.
When the workflow engine encounters an activity which has a status of COMPLETE, it moves (figuratively) that record from WF_ITEM_ACTIVITY_STATUSES to WF_ITEM_ACTIVITY_STATUSES_H and creates a new record in WF_ITEM_ACTIVITY_STATUSES with status ACTIVE.
Therefore, the more times a COMPLETed activity is re-visited, the more history records for that activity are created. This can happen when a loop having a WAIT activity with a short timeout is designed. The more the workflow background process flows through the activities of such a loop, the more records will get created in WF_ITEM_ACTIVITY_STATUSES_H.
The performance implications...Lets try to think as to what are the other implications of such poor design?
- The system keeps looping on idle timeout activity every 15 mins
- Immediately after the timeout completes, the activity becomes eligible to be processed by the workflow.
- The next activity (check if hold has been released) checks if the hold has been released?
- Statistically speaking, as the probability of a financial hold being released is not immediate, the control goes back to the WAIT activity, wherein the cycle repeats
- Every 15 mins, the backlog (timedout or deferred) count of unprocessed OEOL item type bloats
- Due to such frequent time-out activities, other more important processes/activities in the OEOL workflow are not executed in a realistic timeframe
- Valuable system resource are spent on scheduling and executing the loop's activities on the Production machine
- The OS on Production machine spends vital I/O resources in servicing these application calls
This is like a vicious cycle. The more data you have in in WF_ITEM_ACTIVITY_STATUSES and WF_ITEM_ACTIVITY_STATUSES_H tables, the more time it takes to query and insert into them (also, the more indexes you have, the more time it would take to delete/update/insert due to automatic index re-organization effort required). It just kills the online performance at times.
The following diagram bears out the above ideas. It shows the top activity (in terms of number of item_keys) in the breakup of OEOL backlog. It can be seen that, without fail, XXONT_CUSTOM_ATP was the top wait activity in the OEOL backlog. Not only was it a liablity to itself, it was also a liability to other legitimate activites perfectly eligible to be processed:
Design Considerations: The Danger Of Introducing Infinite Loops
There is another big danger here: that of introducing infinite loop in the execution of Workflow Background Process (also referred to as FNDWFBG) for OEOL.
Let us assume that there is a standard WAIT activity with a very short timeout/wait time in a loop with a condition checking activity. When the condition is checked, the activity will be deferred or timed out for a very short time.
Meanwhile, Workflow Background Process would be busy processing all other deferred or timed out activities. If FNDWFBG is not done processing other deferred activities by the time the wait/time-out time of previous activities expires, the RE-timedout or deferred activities will be enqueued and re-processed AGAIN by the very same FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) job over and over, as long as the condition (e.g "line on hold") is met.
This creates an infinite loop, so FNDWFBG process almost never completes, eating up CPU time, creating massive DBMS archive logs (if turned on), and pumping rows into the WF activity statuses table. This would also create unnecessary rows in the WF_ITEM_ACTIVITY_STATUS_H table, which can be nemesis of online order entry performance in the 11i EBusiness Suite later on.
It is a good idea for workflow activities in custom workflows to have Wait/Timeout time greater than the longest normal execution time of Workflow Background process.
Also, a workflow activity having a near zero Relative (Wait) Time, or Wait Mode stated as Absolute Date, but no Absolute Date specified, which is equivalent to zero Relative Wait Time can cause the same issue (infinite loop) to occur for FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) runs with Process Deferred = Yes.
Hence, these poor design practices should be avoided, because, remember, once a runtime for a particular WF definition gets initiated and even if the workflow definition is corrected later and uploaded, the previous runtime data will not be automatically corrected. So its prudent to take in good design considerations before letting your production data run on it.
More design considerations..Whenever possible, standard Wait activity ( WF_STANDARD.WAIT() ) should not be placed in a loop with a small wait time (shorter than several hours). Alternative workflow designs should be evaluated.
WF_STANDARD.WAIT() with wait time of zero or near zero, or standard Defer activity (WF_STANDARD.DEFER()), should be NEVER be placed in a loop in a workflow process design diagram.
Extending the Wait time should only be used if this solution is not acceptable for a very good reason.
The Superior Design Alternative....For the above business scenario, the best solution is not to use Wait activity in a loop at all. A much more attractive option is to use published Workflow blocking/contnuation APIs like WF_STANDARD.BLOCK() or OE_STANDARD_WF.STANDARD_BLOCK() should be used for stopping processing flow of a workflow item_key. WF_ENGINE.COMPLETEACTIVITY() should be used for resuming processing flow of a workflow item_key.
An additional advantage of using OE_STANDARD_WF.STANDARD_BLOCK is that any OEOL workflow item which is blocked on OE_STANDARD_WF.STANDARD_BLOCK can also be progressible through Order Entry form via "Actions" button Progress Order option. This gives more flexibility to the user/designer.
If one wants to progress the flow of a workflow item programatically, WF_ENGINE.COMPLETEACTIVITY() API can be called. This API notifies the Workflow Engine that the specified activity has been completed for a particular item. It signals the Workflow Engine that an asynchronous activity has been completed. This procedure requires that the activity currently has a status of 'Notified' (Internal status equivalent to BLOCKED). An optional activity completion result can also be passed. The result can determine what transition the process takes next.
Example
/* Complete the BLOCK activity which is used in multiple
* subprocesses in parallel splits. */
wf_engine.CompleteActivity('ORDER', '10045',
'XXONT_CUSTOM_ATP:WAIT', 'null');
To enhance the design AND achieve the same business functionality, we need to replace the WAIT activity with a BLOCK activity which will wait till the last hold is released from the order header/line without causing any looping. The COMPLETEACTIVITY API would be called from the Financial Hold release form at Order Header level.
Its as simple as that. The new design would look like:
Seeing the Performance Benefits in Production..
Adopting this approach will create only 1 record in WF_ITEM_ACTIVITY_STATUSES table and 2 records in WF_ITEM_ACTIVITY_STATUSES_H table.
Contrast this against continuous looping till a condition is met: for each execution of the WF logic, records are created in WF_ITEM_ACTIVITY_STATUS* tables.
As a result of this change, the landscape of OEOL backlog also changed drastically.
The average runtime of the Workflow background Process for OEOL also came down consistently.
Summary of Learnings
- It pays to know the published Workflow APIs. The Workflow Developer Guide is a good place to start. Spend time reading the fine manual. More information can never hurt.
- It pays to spend time and effort in evaluating re-design of loops in workflow design early, especially since once initiated, a specific version of workflow runtime data does not automatically get updated after a newer version of workflow design is uploaded (through WFLOAD)
- Poor design involving WAIT and poll activities in a loop can cause infinitely running Workflow Background Process (FNDWFBG)
- Short-timeout activities in loops should be replaced with OE_STANDARD_WF.STANDARD_BLOCK or WF_STANDARD.BLOCK APIs and WF_ENGINE.COMPLETEACTIVITY APIs
- Simple design changes can bring HUGE performance benefits, especially when a lot of volume is involved
- One thing which was also learnt in hindsight was that the Development/Testing team needs to be extremely careful to account for all instantiated financial Holds. We ran into a problem with the removal of the loop where a certain set of financial holds were not being completed. For this set of scenarios (automatically released holds not going through the quick order form) we created a custom concurrent program to identify the blocks and call the WF_ENGINE.COMPLETEACTIVITY API.
Simply put, Know your facts, Be smart and Dont repeat history if you dont have to. Above all, do you testing well.
Suggested Related Readings
When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization --
Posted on 17 May 2007; 12:09:31 PM
http://blogs.oracle.com/gverma/performance/om_wf_xxont_cascade
A Performance Case Study: Taming the Workflow Purge --
Posted on 21 March 2007; 6:58:11 PM
http://blogs.oracle.com/gverma/2007/03/21#a72
We're Skewed! Using 10053 event for being Histogram-wise --
Posted on 27 March 2007; 3:02:33 PM
When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization
Order management workflow interacts with a variety of components like core workflow, order management and pricing application call interfaces.
Customizing it without understanding the implications can be a double edged sword and the nature of customization may unwittingly worsen the situation. An undesirable combination of the aspects can cause a very difficult to diagnose and troubleshoot situation as shall be demonstrated in this case study.
This article is essentially the distilled wisdom of severity 1 situation at a client site who was unable to ship ordered items to their customers since OM order line workflow background process had a severe performance issue. The business impact was tremendous: $ 1.5 to 2 million worth of shipping and invoicing was being prevented.
Many a times, performance troubleshooting involves having some functional insight as well. TKPROF and 10046 trace is not the panacea for all performance issues. In this case, some insight into the nature of Order Management APIs was also required for getting traction on the performance problem. A painful discovery path followed, riddled with bumps and insightful discoveries. The troubleshooting approach used was really out of the box and involved breaking down the code being executed and using logical deduction.
The generic learnings from the ordeal are presented in this case study. It is hoped that the learnings will help oracle applications user community to be concious of hidden implications of customizing OM workflow.
Feeling the pinch...The client had heavily customized their order management workflow during an implementation of Order to cash business cycle of Oracle applications 11i.
Immediately after the April 2007 financial quarter end, the OEOL workflow background process started exhibiting extremely long execution times, such that the backlog of unprocessed OEOLs came to ~150,000. This was a sharp contrast from the normal processing throughput.
This was a moving problem. The hanging of OEOL workflow background process was quite apparenly due to a 'few' order lines, belonging to a few orders. Once the 'problematic' orders' OEOL workflow were either CANCELLED or ERRORED, the backlog started clearing at a slow pace. This went on for a little while, until similar Order lines OEOL workflow seemed to take inordinately long to process and held up the entire workflow queue.
Coming to grips with it.. As a knee jerk approach, many things were tried, including killing the currently running workflow background process and re-submitting it. The clincher was that once or twice the workflow background process was killed , it marked the OEOL ITEM_KEY being processed into ERROR state and hence took it out of the deferred queue. This forced the Workflow background process to consider the next deferred OEOL item_keys in line.In ~50% of the cases, it would work and the newly thread would start exhibiting a more acceptable throughput. But after a certain point, even killing and re-submitting workflow background process was simply not working. It did seem that the logic being executed by FNDWFBG was taking extremely long for processing one order line.
The observation was baffling. The following query was used to find out approximately which order line was a single thread of workflow background:
Query 1:
SQL> select * from (select to_char(enq_time,'dd-mon-rr hh24:mi') ,
user_data
from wf_deferred_table_m where state = 0 and corrid = 'APPSOEOL'
order by priority, enq_time) where rownum < 5;
TO_CHAR(ENQ_TIM USER_DATA(ITEMTYPE, ITEMKEY, ACTID, FUNCTION_NAME, PARAM_LIST, RESULT)
----------------------------------------------------------------------
02-apr-07 14:25 WF_PAYLOAD_T('OEOL', '30273337', 141255, NULL, NULL, NULL)
02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458068', 140850, NULL, NULL, NULL)
02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458075', 140850, NULL, NULL, NULL)
02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458089', 140850, NULL, NULL, NULL)
4 rows selected.
Tying it back to the order management tables, the following query tells us order level details like order_number, line_number, ordered_item etc, which is even more useful:
Query 2:
select distinct oh.org_id, oh.order_number,
ol.line_number, ol.shipment_number,
ol.option_number, ol.ordered_item, ol.line_id, wiash.begin_date,
priority, wdtm.enq_time
from applsys.wf_deferred_table_m wdtm,
applsys.wf_process_activities wpa,
applsys.wf_item_activity_statuses wiash,
apps.oe_order_lines_all ol,
apps.oe_order_headers_all oh
where substr (wdtm.corrid,5) = 'OEOL'
and (wdtm.delay < sysdate or wdtm.delay is null)
and wpa.instance_id=wdtm.user_data.actid
and wiash.item_type=wdtm.user_data.itemtype
and wiash.item_key=wdtm.user_data.itemkey
and wiash.process_activity=wdtm.user_data.actid
and wiash.item_key = ol.line_id
and oh.header_id = ol.header_id
and wdtm.state = 0
group by
oh.org_id, oh.order_number, ol.line_number, ol.shipment_number,
ol.option_number,
ol.ordered_item, ol.line_id, wiash.begin_date, priority, wdtm.enq_time
order by priority, wdtm.enq_time ;
Observing the output over sometime gave the capability of identifying if workflow background process was stuck on a specific OEOL workflow.
Limited insight by conventional analysis..
The conventional debugging approaches provided limited insight into the root cause of the issue.
10046 trace could not bring out the drill down of time spent on pl/sql API calls clearly. In addition, there was an OS file limit on the trace file on most production servers, even if the max_dump_file_size=unlimited is used.
Another option was to use dbms_profiler API on FNDWFBG call. It could have worked, except the FNDWFBG API call never completed, so the data could never be collected.
Searching the metalink gives a host of notes for getting OM debugs (level 5), Pricing debug etc.
1. Note 255040.1:bde_wf_data.sql - Profiling Workflow Runtime Data related
2. Note 163208.1:bde_last_analyzed.sql - Verifies Statistics for all
3. Note 353991.1:OMSuiteDataChk.sql and OMSuiteDataChk11i10.sql
This is all good information, except that there can also be too much of un-meaningful and un-directed information. Interpreting detailed application debug logs can be useful when it is known what should be checked.
Taking the whitebox approach....
Lot of debug/trace information could be generated, but it was probably best to take the FNDWFBG execution apart in slow motion. Once the execution model has been understood, more brainstorming using divide and conquer strategy is effective.
An alternate way of looking at FNDWFBG is that it is nothing but a serial simulator of steps outlined in the oracle workflow designer.
A whitebox (a software engineering term, as opposed to blackbox) approach was the need of the hour, coupled with some simple thinking. Using the workflow status monitor from Workflow Administrator responsbility in Oracle Applications 11i, the step at which an OEOL is currently at can be seen.
This may also be verified by the output of $FND_TOP/sql/wfstat.sql. Combining the observations from Query 2 above and drilling down via Workflow Monitor, it was clear that almost all the lines having performance issue were at Fulfill-Deferred step, which means that if FNDWFBG were to pick them up, it would then execute the next activity in line, which was Fulfill step.
Note: A quick workaround was to 1) Kill the FNDWFBG process 2) SUSPEND (this status avoids any processing by workflow) the problematic OEOL workflow via workflow monitor and 3) resubmit the FNDWFBG process.
Irreconcilable paradoxes of a Test case..
We really needed to analyze the drill down on the bad test case.
From the online examination of sqls being executed by FNDWFBG session, it was seen that a lot of time was being spent on XXONT Cascade Terms and Agreement, multiple times for apparently processing one OEOL ITEM_KEY. This was a really paradoxical observation and did not make any sense.
The only acid test of this was to call OE_FULFILL_WF.START_FULFILLMENT for specific OEOL line id or ITEM_KEY. The script run_start_fulfillment_with_om_debug.sql (Listing in Appendix A) was used to simulate a single execution of Fulfill step, which in turn called OE_FULFILL_WF.START_FULFILLMENT and did a rollback at the end. One such case ran for ~6 hours and finally finished.
Most surprisingly, an examination of raw trace file did show that sub-sequent steps like XXONT Cascade Terms and Agreement and Invoice Interface- Line had indeed been called multiple times. Therefore, it seemed that a single call to OE_FULFILL_WF.START_FULFILLMENT for one kind of OEOL was triggering off sequential execution of OM workflow for other OEOLs in the same Order, which were somehow related to the order line in question.
$ grep -i xxont_cascade_pkg.set_ln_trm_agr_proc prod_ora_677526.trc
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
value="XXONT_CASCADE_PKG"
value="XXONT_CASCADE_PKG"
This simply did not make any sense because Fulfill and XXONT Cascade Terms and Agreement step were discrete steps, as verfiable from viewing the OM workflow diagram flow above.
Another very interesting and puzzling observation was that without fail, the top 2 sql calls in TKPROF output (raw trace sorted with sort=prsela,exeela,fchela) showed that ~45% of the time was spent on OE_FULFILL_WF.START_FULFILLMENT API and ~45% of it was spent on the call to xxont_cascade_pkg.set_ln_trm_agr_proc API.
"tkprof output for 10046 of start_fulfillment API call"
Resolving the paradox...
The answer, which is difficult to digest, does make sense. Order management has unique processing requirements for configured ordered items created using Oracle Configurator module. Since OE_FULFILL_WF.START_FULFILLMENT is a workflow aware API (notice the _WF in its name), it was obviously calling more Workflow routines inside it to progress the business flow to next logical set of activities.
We first need to understand the concept of Fulfillment set. A fulfillment set is a set of related order lines that represent a configurable item. Whenever a fulfillment set pr a model is ready to fulfill or get built in production floor - i.e., all members' OEOL workflow are internally in FULFILL_LINE:Notified (blocked) status. The last member reaching this activity signals completion of FULFILL_LINE activities of all other members before proceeding itself. Since the XXONT Cascade Terms and Agreement step comes right after FULFILL_LINE step, XXONT Cascade Terms and Agreement step will be executed for all members of the current model or fulfillment set in that one go.
The relation between other lines whose OM workflow was being resumed and the problematic line in question could be found using the following sql:
Select top_model_line_id from oe_order_lines_all where line_id=<problematic_line>;
This query defines a FULFILLMENT SET (a set of lines which are related together
and constitute an orderable item):
Select line_id from oe_order_lines_all where top_model_line_id=<top_model_line_id_from_above_query>;
The user first configures the components of a confgurable product/item and then the item is actually built. Depending on the kind of configured model (Assemble to order -- ATO or Pick to order -- PTO), the oe_order_lines_all.flow_status_code (order management workflow status) stays at either SHIPPED or BOOKED status or becomes AWAITING_FULFILLMENT (internally blocked).
Here is an example of this intermediate stage of processing for different kind of configured Items (ATO/PTO/Stock):
Now, since a lot of time was being spent on XXONT Cascade Terms and Agreement step, After reviewing the code in the package XXONT_CASCADE_PKG.pkb, it could be seen that OE_ORDER_PUB.PROCESS_ORDER() API was being called to sync up the terms and agreements across Order header and line in question.
There were some penalties for doing this: the payment term and agreement fields on the order line triggered repricing of the order lines. Also, for especially large orders (~4000 lines) with mostly mis-matched payment terms and agreement lines, repeated calls to OE_ORDER_PUB.PROCESS_ORDER() were causing a huge overhead.
The reason why the timings for OE_FULFILL_WF.START_FULFILLMENT() showed high in the trace files was that FULFILL_LINE step syncronizes accross all components in a configurable model or fulfillment set. The following output from a workflow query in Appendix B lists out the time recorded by workflow for executing XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE step:
Visualizing it in slow motion...
This means that, whenever a fulfillment set pr a model is ready to fulfill - i.e., all members OEOL workflow is internally in FULFILL_LINE:Notified (blocked) status, the last member reaching this activity will complete FULFILL_LINE activities of all other members before proceeding itself.
Since the Cascading activity comes right after FULFILL_LINE, it means that it will be executed for all members of the current model or fulfillment set in that one go. The aggregate timings for completing all other related OEOL workflows in the same fulfillment set goes into the timing recorded for OE_FULFILL_WF.START_FULFILLMENT() step.
The following diagrams attempt to represent the dynamics more clearly:
BEFORE calling OE_FULFILL_WF.START_FULFILLMENT() step:
AFTER calling OE_FULFILL_WF.START_FULFILLMENT() step:
Projecting the poor performance..
Based on the situation described above, a simple projection of poor performance can be done for a particular kind of order. Lets assume that there are 100 lines in a configurable model/Item. Also, lets assume that for 75% of the order lines, payment terms or/and agreements do not match with those in the Header. For each mis-matched (terms and agreements - with header) line, the time taken for pricing varied from 150 seconds to 750 seconds.
When we calcualte the cumulative time taken for executing xxont_cascade_pkg.set_ln_trm_agr_proc API for about 75% mis-matched lines of 100 lines, with ~5 mins taken for each line, the overall time taken comes to 75 x 5 mins ~ 8 hrs. That's the time for processing OE_FULFILL_WF.START_FULFILLMENT() step for one model. That was exactly the performance problem being experienced.
Now, just imagine the performance strain if the order had 10 such configurable items.
Mathmatically speaking:
The Solution: Simplify by taking the Highest Common Factor...
There were two important dimensions to the solution:
There was a bug in 11.5.9 Order management module which prevented the cascading order header changes to order lines from working properly (Profile option OM: Sales Order Form: Cascade Header Changes to Line should be Yes). This was prompting the calls to OE_ORDER_PUB.PROCESS_LINE() APIs.
Patch 5981436 is to be delivered by Oracle order management development for resolving this issue. This fix is also made part of 5950291 -11.5.9 OM CUMULATIVE PATCH. This would automatically match the payment terms and agreements from Order Header to order lines for majority of the cases, except if the user purposely chose to say No to the cascade option.
Another smart option was to do batch processing of all the mis-matched component lines of a configured model in ONE call to PROCESS_LINE() API by passing them as a pl/sql table argument. Also, this was to be done ONLY if the order line in question was for a Shippable Item. This was especially beneficial since only 1-5% of sub-components in a configured model are shippable. So that made it quite restrictive. This single change took out 95-97% of the repetitive work being done by OE_ORDER_PUB.PROCESS_ORDER() calls.
--
-- Calling API to update Terms and/or Agreement value(s) on the order line
--
oe_order_pub.process_line (p_line_tbl => l_line_tbl,
x_line_out_tbl => o_line_tbl,
x_return_status => l_return_status,
x_msg_count => l_msg_count,
x_msg_data => l_msg_data);
Seeing the performance Boost...
Before putting in the code change:
For a fulfillment set with top_model_line_id=9905479, there were 59 mis-matched lines, and the avg time for processing one call of process_line() was ~ 580 seconds.
After putting in the code change:
The time taken to same 59 mis-matched lines (of the same fulfillment set with top_model_line_id=29905479) in one pls/sql table (as an argument) in process_line() API call was ~ 625 seconds.
Putting statspack data into perspective...
This also explained why statspack reports showed xxont_cascade_pkg.set_ln_trm_agr_proc as one of the major I/O offenders per execution:
Summary of learnings from the Case study..
- Don't have expensive custom activities defined after START_FULFILLMENT step in OM workflow. It will pronounce the performance hit many times, especially when Oracle configurator and order management modules are being used in tamdem (Better, don't have expensive workflow custom activities at all)
- Batch processing should be over piece-meal processing, especially when an API has provision for the same. This reduces most of the repetitive processing
- The whitebox (drill down) approach works for taking apart a baffling performance problem. Trying to simulate a test case usually leads to the heart of the performance issue in a more reliable way
- Getting extensive debugs and traces is great, but only when it is known what is being looked for. Asking the right probing questions is very important. Question, question, question.
- A well though out plan with minimal tracing and using a drill down approach, can bring better results than a shot gun or blunderbuss approach
- Sometimes, a high level functional knowledge of the processing being done can be very useful in understanding the nature of problem. A balance between strict technical and pure functional knowledge can be fruitful towards solving performance issues
TRIVIA: Why was the customization put in the first place?
Cascading is the process in which values from the sales order header level automatically populate to the corresponding fields at the sales order line level.
Due to a bug in 11.5.9, Agreement field at Header level was not populating to the child lines of a configuration.
Due to this, the autoinvoicing program split lines into separate invoices by agreement. There were thousands of invoices where the components were being invoicing separately from the top model. It was an invoicing fiasco.
Appendix A:
sandbox:dev> more run_start_fulfillment_with_om_debug.sql
var resultout varchar2(1000)
set serveroutput on time on timing on
prompt
prompt setting userid, responsbility and application id for JHASLAGE FOR order management super user
prompt
define user_id=7962
define resp_id=52560
define appl_id=660
define item_key=29904309
define activity_id=141236
declare
l_debug_file varchar2(400);
l_org_id NUMBER;
l_count NUMBER;
l_result VARCHAR2(30);
l_spid varchar2(20);
begin
fnd_global.apps_initialize(&user_id,&resp_id,&appl_id);
select p.spid
into l_spid
from v$process p, v$session s
where s.audsid=userenv('sessionid')
and paddr = addr;
OE_Standard_WF.OEOL_SELECTOR
(p_itemtype => 'OEOL'
,p_itemkey => '&&item_key'
,p_actid => 12345
,p_funcmode => 'SET_CTX'
,p_result => l_result
);
oe_debug_pub.debug_off;
--enable OM debug
oe_debug_pub.debug_on;
oe_debug_pub.initialize;
l_debug_file := OE_DEBUG_PUB.Set_Debug_Mode('FILE');
oe_debug_pub.setdebuglevel(5);
dbms_output.put_line('Debug file is : '|| l_debug_file);
--enable trace
EXECUTE IMMEDIATE ('alter session set max_dump_file_size=unlimited');
EXECUTE IMMEDIATE ('alter session set tracefile_identifier=''gverma_start_fulfillment_with_om_debug_&&item_key''');
EXECUTE IMMEDIATE ('ALTER SESSION SET EVENTS ''10046 TRACE NAME CONTEXT FOREVER, LEVEL 12''');
OE_FULFILL_WF.START_FULFILLMENT('OEOL','&&item_key',&&activity_id,'RUN',:resultout);
--disable trace
EXECUTE IMMEDIATE ('ALTER SESSION SET EVENTS ''10046 TRACE NAME CONTEXT OFF''');
--turn debug off
oe_debug_pub.debug_off;
dbms_output.put_line('File name '||OE_DEBUG_PUB.G_DIR||'/'||OE_DEBUG_PUB.G_FILE);
dbms_output.put_line('Trace file generated for this request:'||'/*'||l_spid||'*.trc');
rollback;
EXCEPTION WHEN OTHERS THEN
dbms_output.put_line('reached error in script ' ||sqlerrm);
rollback;
end;
/
select :resultout from dual
/
Appendix B:
col ITEM_KEY for a10 Head "Item Key"
col BEGIN_DATE for a18 head "Begin Date"
col END_DATE for a18 head "End Date"
col ROOT_ACTIVITY for a30 head "Root Activity"
col RESULT for a15
col STATUS for a8 Head Status
col time_taken_mins for 99999 head "Time(Mins)"
col step format a60
set pages 100 time on timing on lines 150
spool get_details_for_wf_step
select ias.item_key,
to_char(ias.begin_date,'DD-MON-RR HH24:MI:SS') begin_date, to_char(ias.end_date,'DD-MON-RR HH24:MI:SS') end_date,
(ias.end_date-ias.begin_date)*24*60 time_taken_mins,
pa.process_name||'->'||pa.instance_label step,
ias.activity_status Status,
ias.activity_result_code Result
from wf_item_activity_statuses ias,
wf_process_activities pa
where ias.item_type = 'OEOL'
and ias.process_activity = pa.instance_id
and ias.begin_date >= sysdate-14
and pa.process_name||'->'||pa.instance_label = 'XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE'
and ias.activity_status = 'COMPLETE'
--
-- and the step was not cancelled
--
and ias.activity_result_code != '#FORCE'
--
-- it took more than so many mins to complete this step
--
and (ias.end_date-ias.begin_date)*24*60 > 30
order by (ias.end_date-ias.begin_date)*24*60 desc, ias.begin_date
/
spool off
A SQL Tuning Case Study: Could we K.I.S.S. Please?
Preface They say, it pays to K.I.S.S (Keep It Simple and Smart). The simplest ideas are the greatest ideas.This philosophy also applies to sql coding when we tell the oracle SQL engine what to fetch and how to fetch it (Indirectly). There are great rewards for simplistic thinking.
Lets discuss this principle through a case study in which the user was trying to fetch very simple data through an over-kill query, involving RA_* views which were internally using some of the same tables. The SQL engine was therefore just trying to fetch and filter too much data from the same sources. It was not certainly efficient.
In this case study, we see how its important to understand the "Why" of a query, rather than the "How" of a query. Many times, just rethinking the whole approach pays great dividends. The Poorly Performing Query The client had a custom concurrent program which had this query as the worst offender.. it was taking 90% of the time and the program used to take ~ 1 hour to finish.
The formatted query is listed below:
SELECT SUBSTR(A_CUST.CUSTOMER_NUMBER,1,8) CUST_NUMBER,
HZ_CUST_SITE_USE.LAST_UPDATE_DATE LAST_UPDATE_DATE,
SUBSTR(A_CUST.CUSTOMER_NUMBER,1,8) BILL_CUST_ID
FROM
RA_CUSTOMERS A_CUST,
HZ_CUST_ACCT_SITES_ALL HZ_CUST_ACCT_SITE,
HZ_CUST_SITE_USES_ALL HZ_CUST_SITE_USE,
RA_ADDRESSES_ALL RAA
WHERE
A_CUST.CUSTOMER_NUMBER = :B1
AND A_CUST.CUSTOMER_ID = HZ_CUST_ACCT_SITE.CUST_ACCOUNT_ID
AND HZ_CUST_ACCT_SITE.CUST_ACCT_SITE_ID =
HZ_CUST_SITE_USE.CUST_ACCT_SITE_ID
AND HZ_CUST_ACCT_SITE.PARTY_SITE_ID = RAA.PARTY_SITE_ID
AND HZ_CUST_ACCT_SITE.STATUS = 'A'
AND HZ_CUST_SITE_USE.STATUS = 'A'
AND HZ_CUST_SITE_USE.SITE_USE_CODE = 'BILL_TO'
AND RAA.ADDRESS1 = 'ICC LEASE'
/
Lets look at the TKPROF stats and runtime SQL plan used along with row counts.
Its interesting to observe that while only ~3000 rows are returned as the output of this sql, a huge number of rows were fetched from one of the HZ* tables.
Lets check the underlying tables being used by the Views RA_ADDRESSES_ALL and RA_CUSTOMERS. RA_ADDRESSES_ALL was a synonym for RA_ADDRESSES_MORG, while RA_CUSTOMERS was a synonym for RA_HCUSTOMERS.
1 select referenced_name, referenced_type
2 from dba_dependencies
3* where name='RA_HCUSTOMERS'
SQL> /
REFERENCED_NAME REFERENCED_TYPE
----------------------- -----------------
HZ_CUST_ACCOUNTS TABLE
HZ_PARTIES TABLE
SQL> l
1 select referenced_name, referenced_type from dba_dependencies
2 where name='RA_ADDRESSES_MORG'
3* and referenced_type='TABLE'
SQL> /
REFERENCED_NAME REFERENCED_TYPE
------------------------------ -----------------
HZ_PARTY_SITES TABLE
HZ_LOCATIONS TABLE
HZ_CUST_ACCT_SITES_ALL TABLE
HZ_LOC_ASSIGNMENTS TABLE
Lets also check the data volume involved in the base HZ* tables being referred and the percetange of estimation sample Percentage used.. (~70%) by Gather statistics program:
Do we need this complexity?
This is a good time for Reality Check: This is certainly overkill.
Why would you want to join complex views with 3-4 base tables being referred inside the views. Not an intelligent approach, shall we say? Better, why wouldnt one want to query the same data by understanding the data model relationship between the base tables involved and see if the same results can be achieved through them.
Spending time on checking the data volume involved, the % Estimation by Gather Stats, index cardinality etc would not lead to anything conclusive or direct inference in this case.
Another good learning from the http://etrm.oracle.com website (requires a Metalink Login: Note:150230.1) regarding the RA_* views is that they should not be used directly in ANY CUSTOMIZATIONS ever! All the more reason to eliminate the views from the query altogether and find an alternate way of getting the same result.
The Better Way...
Now, Only 1 column was being really queried. Inspite of this, after understanding the object model relationship diagrams for HZ* tables, at least five HZ* tables were required to be queried for a supplied customer_number value.
What was the SQL trying to achieve:
For a particular Customer_number, It wanted the find the Last_update_date for an ACTIVE Site which was being used as an ACTIVE BILL_TO location, with an address of ICC LEASE in its information.
Lets look at the index information:
select column_position, column_name
from dba_ind_columns
where index_name='&1'
/
Enter value for 1: HZ_LOCATIONS_N1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
1 ADDRESS1
SQL> /
Enter value for 1: HZ_PARTY_SITES_U1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
1 PARTY_SITE_ID
SQL> /
Enter value for 1: HZ_PARTY_SITES_N2
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
1 LOCATION_ID
SQL> /
Enter value for 1: HZ_CUST_ACCT_SITES_N1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
1 PARTY_SITE_ID
SQL> /
Enter value for 1: HZ_CUST_SITE_USES_N1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
1 CUST_ACCT_SITE_ID
2 SITE_USE_CODE
3 PRIMARY_FLAG
The same SQL query could be re-written such that it fetched the same data using base HZ_* tables, after connecting the foreign Key relationships.
Such an approach had many advantages:
- RA_* views avoided as they are not supported except for in standard 11i products, also, its highly inefficient to access 7 tables/views for getting 1 column
- The new query fetches less than half the number of consistent read buffers as compared to the previous query (verified using set autotrace on sql command)
- The new query has just 14 access steps as opposed to 21 access steps for previous un-tuned query
- The new query uses three unique Primery Key indexes (primary key) out of the five tables and two Foriegn key indexes for the remaining two, which is a really selective way of querying data
The performance improvement was very visible as the improved query ran within 2 minutes. Compare that with its previous cousin which used to run for ~1 hour.
SELECT SUBSTR(hca.account_NUMBER,1,8) CUST_NUMBER,
hcsu.LAST_UPDATE_DATE LAST_UPDATE_DATE,
SUBSTR(hca.account_NUMBER,1,8) BILL_CUST_ID
FROM
hz_cust_accounts hca,
hz_cust_acct_sites_all hcas,
hz_cust_site_uses_all hcsu,
hz_party_sites hps,
hz_locations loc
WHERE
--
-- Joining all the tables through primray keys
-- and foreign key relationships first
--
hca.account_number = :B1
and hca.cust_account_id = hcas.cust_account_id
and hcas.party_site_id = hps.party_site_id
and hca.party_id = hps.party_id
and hps.location_id = loc.location_id
and hcas.cust_acct_site_id = hcsu.cust_acct_site_id
--
-- the actual data query criteria starts here
--
AND hcas.STATUS = 'A'
AND hcsu.STATUS = 'A'
AND hcsu.SITE_USE_CODE = 'BILL_TO'
AND loc.ADDRESS1 = 'ICC LEASE'
--
-- Could perhaps add this criteria too ???
-- AND hcas.bill_to_flag = 'Y'
--
/
Before:
After:
Summary It pays to ask the "why" questions sometimes, rather than focusing on "How" and "What" and going on a wild goose chase. The "Why" questions can get you more quickly to the answer by re-thinking the strategy.
Einstein himself said, "everything should be made as simple as possible, but no simpler."
The defence rests.
Related Links
Other Performance Related Articles