Eric S. Emrick
Getting a Handle on Logical I/O
As it turned out, the partitioning approach did indeed reduce the CBC collisions; albeit another form of contention surfaced as a corollary, cache buffer handles latch collisions. I must admit I had a very limited knowledge of buffer handles prior to being made aware of this situation. My colleague pointed me to a very interesting article on Jonathan Lewis' site. This article gives a pithy description of buffer handles. I highly recommend you carve out a few minutes to read it. Not only might you learn something about buffer handles, you might be surprised that the more traditional notions of logical I/O do not really suffice. I was first suitably introduced to the buffer is pinned count statistic during a Hotsos training course. Essentially, this statistic indicates the presence of latch-reduced logical I/O.
While, generally speaking, Oracle recommends that hidden parameters not be changed, sometimes they need to be modified to accommodate very specific issues your database is encountering. In this particular case, increasing the value of the _db_handles_cached parameter got rid of the newly surfaced collisions on the cache buffer handles latch. I love learning from others’ experiences. It is amazing how many interesting little tales such as this exist. Also, this type of unforeseen contention shifting reinforces the need to properly test production changes - or maybe better said, the ability to properly test production changes.
Low Cardinality != Bitmap Index
What I found were hundreds of thousands of db file sequential read wait events to insert a single row. I checked out the data dictionary for any supporting indexes and found 10 indexes on the table, 4 of which were bitmap indexes. Fortunately, this was a 10g database, so the object number associated with the sequential reads were easily plucked using a simple AWK script.
wait #22: nam='db file sequential read' ela= 377 file#=17 block#=20988904 blocks=1 obj#=725386 tim=2691112678912
I found that nearly 99.99% of these wait events were owed to this object, a bitmap index.This application is not your standard OLTP as the underlying table gets loaded with thousands of rows each day with SINGLE ROW inserts. The dreaded concurrency and deadlocking did not come into play, well, because the load process is single threaded. However, all queries against this table need to perform very quickly. So, in that sense it has an OLTP face. Here is the rub. First, I asked if these indexes (in particular the bitmap indexes) could be dropped prior to their "load" and recreated after. The answer I received was essentially, "no, that is the way the application works." I then asked them to tell me why this index was a bitmap index. The developer stated the rationale was the fact that the data was uniformly distributed over 6 distinct values. I suppose that seems reasonable. I then asked the developer if this column was used in join conditions for other queries. The answer was a resounding NO.
Not to my surprise the index built as a standard b*tree index was just as efficient and lacked the horrific index maintenance overhead associated with SINGLE ROW inserts. The only reason the index was defined as a bitmap index was its cardinality and nothing more. I had them drop the index. The load that was taking 20+ hours to complete finished in under a minute. The lesson here is: Know your data, know your code and then evaluate the use of bitmap indexes to support your table access. The simple fact of low cardinality does not alone justify the use of a bitmap index. As a matter of fact, this bitmap index was so chubby that after it was re-created post load, it had been reduced in size by 99%. I suppose that is another point: Bitmap indexes aren't necessarily space savers either if used in an improper context.
BTW, the hundreds of thousands of blocks reads were not what you might have thought: locks against rows with the same bitmap as the inserted value for the bitmap column. Oracle was ranging over the index nonsensically looking for the proper place to dump the row. As the hundreds of thousands of sequential reads rolled by not a single TM lock was obtained and ZERO db block changes had accumulated. It was only when the row finally inserted that a few blocks changes showed up. This is just another example of a peculiarity with bitmap indexes that can crop up if used unlawfully.
Database Continuity
I was recently discussing this desire with a colleague. I told him that I felt there was a need for a really good backup and recovery book. Actually, I expounded a bit and said that there is a need for a good database continuity book. It just feels that backup and recovery is an overused phrase for a dramatically underutilized and uncultivated set of skills. After all, how frequently are we involved in backup and recovery exercises? I would guess that backup and recovery activities comprise less than 5% of the time spent by a DBA during the course of any given year. That would be less than 100 hours in a full work year. I suspect it could be much less for some.
Isn't spending little or no time on backup and recovery a good thing? That does imply our systems are resilient and few faults surface that require us to exercise our recovery plan. And, in the age of RMAN we simply don't have to worry about the nuances of recovery, right? RMAN knows exactly what is needed for restoration, and all the DBA needs to do is execute a few commands to restore and recover the database. What technology has afforded us with regard to ease of backup configurations and redundant infrastructure, it has equally reduced our ability to confidently take control when up against a critical database recovery scenario. In short, we become complacent and our knowledge of backup and recovery diminishes over time. How confident are we that our backup strategy meets the recovery point (RPO) and recovery time (RTO) objective of our business?
I digress. Let’s get back to the conversation with my colleague and this notion of database continuity. I defined for him database continuity as follows: Database continuity is a superset of knowledge, processes and tools that fulfill the data protection requirements of an organization. By consequence, backup and recovery become processes in the continuity methodology. Database continuity is a broadened perspective of Oracle database recovery and is intended to include: disaster recovery, standby databases, archive log management, user-managed backups, RMAN, RPO and RTO, etc. Each of these aspects of database continuity requires the DBA to have a firm understanding of Oracle database recovery. If we truly understand recovery these different continuity dimensions converge rapidly. You can plug in your knowledge of recovery to assist with any dimension. So, while the notion of database continuity has greater breadth at face value, it can be reduced to recovery mechanics, constructs and objectives.
That being said, I have many ideas about a book on Oracle database continuity. However, I want to hear from you. What do you find lacking in the backup and recovery books on the market? Maybe one text speaks to an aspect for which you wish the author had given more detail. Or, maybe there is an overindulgence of certain topics that you wish had been left out. What material would help you retain and reuse your recovery knowledge? I am not out to write a book on RMAN or Data Guard; thousands of pages have already been devoted to the treatment of these technologies. I view guides on such topics as utilities to affect my recovery objectives and mobilize my recovery knowledge.
RMAN, RAC, ASM, FRA and Archive Logs
Update: 03/26/2008
A colleague of mine has been doing some testing using RMAN, RAC, ASM, FRA for archive log management. Also, he has tested the integration of DataGuard into this configuration. To be more precise, he has tested using an FRA residing in an ASM disk group as the only local archive log destination. In addition to the local destination, each archive log is sent to the standby destination. Based on his testing this approach is rather robust. The archive logs are backed up via the "BACKUP RECOVERY AREA" command with a regular periodicity. This enables the FRA's internal algorithm to remove archive logs that have been backed up, once the space reaches 80% full. No manual intervention is required to remove the archive logs. Moreover, the archive logs in this configuration will only be automatically deleted from the FRA if both of the following are true: 1) the archive log has been backed up satisfying the retention policy and 2) the archive log has been sent to the standby. When there is a gap issue with the standby database, the archive logs are read from the FRA and sent to the standby. It works real nice!
Last Blog Entry (sysdate-364)
Physical Standby Turbo Boost
In some cases you can dramatically speed up your recovery time by copying a small subset of your production database to your standby environment and resume recovery. For example, if a large percentage of your database's write activity is absorbed by a small subset of your database you are primed for a standby recovery turbo boost. Notice I did not say small percentage of your data files. After all, you could have 90% of your writes going to 10% of your data files, but those data files might comprise 90% of your database footprint. In most cases a small percentage of your database files equates to a small subset of your database, but not always.
If a vast majority of writes go against a small subset of your database, how would copying these files to your standby give your recovery a boost? During recovery if Oracle does not need to recover a file it won't. All of those redo entries dedicated to recovering those files will just get passed over. Knowing this simple fact can help you get your physical standby database back on track to meet the needs of your business quickly.
The first order of business is to determine if the write skew condition exists in your database: those files, if copied to your standby, benefiting your recovery time the most. Fortunately, this information can be easily gathered using the v$filestat and v$datafile dynamic performance views in your production database. The following query will get you the top N most written to files in your database.
select * from
(select a.name, b.phyblkwrt from v$datafile a, v$filestat b
where a.file# = b.file# order by 2 desc)
where rownum < N;
If you know the data files that are getting written to the most in production then you also know the most frequently written to files on your standby during recovery. If Oracle can skip over redo entries during recovery then you avoid all of that physical and logical I/O against your standby data files. To recover a database block you have to perform a read and a write of that block. If your writes are somewhat evenly distributed amongst the files in your database then it will be more difficult to get that turbo boost. But, if 60+% of your database writes are absorbed by <= 10% of the database footprint you could gain a significant boost in the recovery time by shipping those files to your standby.
I know this is a rather short post, but this little tidbit just might help you get out of a physical standby database recovery dilemma.
Logical Reads and Orange Trees
If I had to gather up 313 oranges from an orchard using a basket that could only hold 25 oranges, then it would take me at least 13 visits to one or more trees to complete the task. Don't count the trees. Count the visits.
Oracle Riddles: What's Missing From This Code?
/* Script blog.sql
spool blog.out
set feed on echo on;
select * from v$version;
drop table mytable;
create table mytable (col1 number) tablespace users;
insert into mytable values (3);
commit;
begin
for i in 1..15 loop
insert into mytable select * from mytable;
commit;
end loop;
end;
/
analyze table mytable compute statistics;
select count(*) from mytable;
select blocks from dba_tables where table_name = 'MYTABLE';
select blocks from dba_segments where segment_name = 'MYTABLE';
select index_name from user_indexes where table_name = 'MYTABLE';
set autot traceonly;
select * from mytable;
set autot off;
REM Bookends to show no DML or DDL statement has been executed.
select statistic#, value from v$mystat where statistic# in (4,134);
... missing statement
REM Bookends to show no DML or DDL statement has been executed.
select statistic#, value from v$mystat where statistic# in (4,134);
set autot traceonly;
select * from mytable;
set autot off;
select blocks from dba_tables where table_name = 'MYTABLE';
select blocks from dba_segments where segment_name = 'MYTABLE';
select index_name from user_indexes where table_name = 'MYTABLE';
select count(*) from mytable;
spool off;
End Script blog.sql */
/* Output
oracle@eemrick:SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
PL/SQL Release 10.2.0.1.0 - Production
CORE 10.2.0.1.0 Production
TNS for Solaris: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
5 rows selected.
oracle@eemrick:SQL> drop table mytable;
Table dropped.
oracle@eemrick:SQL> create table mytable (col1 number) tablespace users;
Table created.
oracle@eemrick:SQL> insert into mytable values (3);
1 row created.
oracle@eemrick:SQL> commit;
Commit complete.
oracle@eemrick:SQL> begin
2 for i in 1..15 loop
3 insert into mytable select * from mytable;
4 commit;
5 end loop;
6 end;
7 /
PL/SQL procedure successfully completed.
oracle@eemrick:SQL> analyze table mytable compute statistics;
Table analyzed.
oracle@eemrick:SQL> select count(*) from mytable;
COUNT(*)
----------
32768
1 row selected.
oracle@eemrick:SQL> select blocks from dba_tables where table_name =
'MYTABLE';
BLOCKS
----------
61
1 row selected.
oracle@eemrick:SQL> select blocks from dba_segments where segment_name =
'MYTABLE';
BLOCKS
----------
64
1 row selected.
oracle@eemrick:SQL> select index_name from user_indexes where table_name =
'MYTABLE';
no rows selected
oracle@eemrick:SQL> set autot traceonly;
oracle@eemrick:SQL> select * from mytable;
32768 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1229213413
-----------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
-----------------------------------------------------------------------------
0 SELECT STATEMENT 32768 65536 26 (4) 00:00:01
1 TABLE ACCESS FULL MYTABLE 32768 65536 26 (4) 00:00:01
-----------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2248 consistent gets
0 physical reads
0 redo size
668925 bytes sent via SQL*Net to client
24492 bytes received via SQL*Net from client
2186 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
32768 rows processed
oracle@eemrick:SQL> set autot off;
oracle@eemrick:SQL> REM Bookends to show no DML or DDL statement has been
executed.
oracle@eemrick:SQL> select statistic#, value from v$mystat where statistic#
in (4,134);
STATISTIC# VALUE
---------- ----------
4 18 <-- Statistic #4 is user commits
134 461920 <-- Statistic #134 is redo size
2 rows selected.
oracle@eemrick:SQL> ... missing echo of statement
oracle@eemrick:SQL> REM Bookends to show no DML or DDL statement has been
executed.
oracle@eemrick:SQL> select statistic#, value from v$mystat where statistic#
in (4,134);
STATISTIC# VALUE
---------- ----------
4 18
134 461920
2 rows selected.
oracle@eemrick:SQL> set autot traceonly;
oracle@eemrick:SQL> select * from mytable;
32768 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1229213413
-----------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
-----------------------------------------------------------------------------
0 SELECT STATEMENT 32768 65536 26 (4) 00:00:01
1 TABLE ACCESS FULL MYTABLE 32768 65536 26 (4) 00:00:01
-----------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
173 consistent gets
0 physical reads
0 redo size
282975 bytes sent via SQL*Net to client
1667 bytes received via SQL*Net from client
111 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
32768 rows processed
oracle@eemrick:SQL> set autot off;
oracle@eemrick:SQL> select blocks from dba_tables where table_name =
'MYTABLE';
BLOCKS
----------
61
1 row selected.
oracle@eemrick:SQL> select blocks from dba_segments where segment_name =
'MYTABLE';
BLOCKS
----------
64
1 row selected.
oracle@eemrick:SQL> select index_name from user_indexes where table_name =
'MYTABLE';
no rows selected
oracle@eemrick:SQL> select count(*) from mytable;
COUNT(*)
----------
32768
1 row selected.
oracle@eemrick:SQL> spool off;
End Output */
Clue: The missing statement is not "alter system set do_less_work = true;"
Increasing the Longevity of Your CPU
One of my current assignments is to evaluate the potential to increase CPU headroom on a server running a large OLTP Oracle database. Of course, any project such as this is typically motivated by the desire to save money by foregoing a seemingly imminent hardware upgrade. Realizing more CPU headroom for your Oracle database server can be achieved, but not limited to, the following approaches:
1. Add more same-speed CPUs to your existing server.
2. Replace your existing CPUs with faster CPUs.
3. Replace your existing CPUs with a greater number of faster CPUs.
4. Commission a new server platform with more same-speed and/or faster CPUs.
5. Commission a new server platform with a greater number of slower CPUs.
6. Chronologically distribute the load on the system to avoid spikes in CPU.
7. Reduce the work required of the system to satisfy the business.
More times than not I suspect approaches 1-5 are chosen. I am of the opinion that 1) and 3) are more predictable when trying to evaluate the expected CPU headroom yield. Propositions 2), 4) and 5) can be a little less predictable. For example, if I double the speed of my current CPUs will I yield the upgraded CPU cycles as headroom? That is, if I am running 10x500MHz and upgrade to 10x1GHz will I now have the additional 5GHz as headroom? It has been my experience that upgrades such as these do not produce such predictable results, especially if your current box approximates 100% utilization. Certainly, moving to a new server with a greater number of same-speed and/or faster CPUs is a tricky proposition. New servers need to be tested using Production volume with great rigor. While at face value 500MHz would appear to be universally “portable” to any server, there are many other factors that can influence your CPU horsepower: memory architecture, amount of processor cache, processor crosstalk, etc. Options 1-5 can all be very costly and in some cases yield undesirable and unpredictable results.
If you have the luxury of distributing the load on your system to avoid spikes in CPU then that is a great first option. It could buy you more time to evaluate a longer-term solution. For example, shifting any batch jobs to off-peak OLTP hours might give you immediate relief.
What if we can simply “do less” to satisfy the needs of the business? This concept is not new to most Database Administrators and rings a rather cliché tone. After all, aren’t we brow-beaten by the dozens of books and countless articles that speak to database and SQL optimization? The “do less” principle is very sound, but it can be intractable. Reducing the work required of an application often requires management support and can run into political obstacles at every turn. Getting Application Developers and Database Administrators to work in lockstep can require a significant effort. If Management, Developers and Database Administrators buy into a synergistic endeavor the benefits can be amazing – and can save the company a large sum of money.
If you are lucky enough to be working on a project where the common goal of all parties is to reduce the CPU load on your system then I have learned a few things that I hope can help you.
Identify the Targets for Optimization
Identify those SQL statements that contribute the greatest to the CPU load on your database server. These statements usually relate to those that produce the most logical I/O on your database. Caution needs to be taken when trying to identify these statements. You shouldn’t focus solely on those statements that have the highest logical I/O (LIO) to execution ratio. Often you will find statements that are well optimized but are executed with extremely high frequency. Look for the aggregate LIO footprint of a SQL statement. Without Statspack or AWR this analysis might be very difficult. However, if you collect this diagnostic data you can use the LEAD analytical function to craft a nice SQL statement to identify the top CPU consuming statements on your system (join stats$sql_summary and stats$snaphot).
Don’t limit your SQL statement identification to just those statements flagged by your analysis as a top CPU consumer. Go another step and identify the most frequently executed statements. Some of the most frequently executed statements are the most optimized on your system. These statements if executed by many programs concurrently can influence concurrency and thusly CPU load. One approach I took recently identified the top 20 CPU consuming statements during a 12 hour window of each week day. I then ran the same analysis against the most frequently executed statements on the system. The results yielded only 31 distinct statements as 9 were on both lists. The amazing thing is that, on average, these 31 statements contributed to 58% of all logical reads on the system and 59% of all executions. Keep in mind that there were over 20 thousand distinct statements cached in the Shared Pool. It is rather amazing that such a small subset of the application footprint contributed so greatly to the aggregate load.
Ask The Right Questions
The identification phase is crucial as you want to optimize that which will yield the greatest benefit. Subsequent to the identification phase the Database Administrators and Developers can sit and discuss approaches to reduce the load incurred by these SQL statements. Here are some of the key points I have taken away during such collaborative efforts.
1. Is there a better execution plan for the statement? Optimization is often achieved by rewriting the query to get at a better execution plan. While I don’t like hinting code, they can relieve pressure in a pinch.
2. Does the statement need to be executed? If you see SQL statements that seldom/never return rows (rows/exec approaches 0) there is a possibility it can be eliminated from your application.
3. Does the statement need to be executed so frequently? You might be surprised that Developers often have other application-side caching techniques that can dramatically reduce the frequency of a statement’s execution against the database. Or, the application might simply call the statement needlessly. It doesn’t hurt to ask!
4. Are the requirements of the business immutable? Sometimes you can work an optimization by simply redefining what is required. This is not the tail wagging the dog here. It is possible that the business would be completely happy with a proposed optimization. For example, can the query return just the first 100 rows found instead of all rows.
5. Do the rows returned to the application need to be sorted? Highly efficient SQL statements can easily have their CPU profile doubled by sorting the output.
6. Are all columns being projected by a query needed? If your application retrieves the entire row and it only needed a very small subset of the attributes it is possible you could satisfy the query using index access alone.
7. Is the most suitable SQL statement being executed to meet the retrieval requirements of the application? Suitability is rather vague but could apply to: the number of rows fetched, any misplaced aggregation, insufficient WHERE clause conditions etc.
8. Are tables being joined needlessly? I have encountered statements that Developers have determined are joining a table, projecting some of its attributes, without using its data upon retrieval. The inclusion of another table in such a manner can dramatically increase the logical I/O required. This is extremely difficult for a DBA to discern without intimate application code knowledge.
9. How well are your indexes clustered with your table(s)? Sometimes data reorganization techniques can greatly reduce the logical I/O required of a SQL statement. Sometimes IOTs prove to be very feasible solutions to poor performing queries.
10. Can I add a better index or compress/rebuild an existing index to reduce logical I/O? Better indexing and/or index compression could take a query that required 10 logical I/O operations down to 5 or 6. This might feel like a trivial optimization. But, if this statement is executed 300 times each second that could save your system 1,500 logical I/Os per second. Never discount the benefit of a 50% reduction of an already seemingly optimized statement.
11. Can I reorganize a table to reduce logical I/O?
I suspect most of us have read that 80% of optimization is application centric (I tend to feel that the percentage is higher). Usually the implication is that the SQL being generated and sent to the database is 80% of the target for optimization. More specifically, optimization requires the tuning of SQL 80% of the time. However, don’t limit your efforts to optimize your application to “tuning the SQL.” Sometimes a portion of your optimization will include “tuning the algorithms” used by the application. Needless execution and improper execution of SQL statements can be equally destructive. Hardware resources, in particular CPUs, can be very expensive to purchase and license for production Oracle databases. It is well worth the effort to at least investigate the possibilities of increasing CPU headroom by decreasing CPU utilization.
Update: An astute reader suggested I mention Cary Millsap's Optimizing Oracle Performance with regard to this topic. I highly recommend reading this book as it weighs in heavy on Oracle optimization and Method-R. Trust me if you have optimization on the brain don't miss this read.
Oracle Riddles: What's The Point?
Don't Get Caught With Your GUI Down
I am not basing my opinion on one simple question concerning user-managed backups, but a host of other questions given as mental exercises. What are some of the Oracle wait events? What do they represent? How would you go about troubleshooting systemic response time degradation in your production database? What is extended SQL tracing and why use it? Time after time candidates struggled to give lucid, well thought out responses. A vast majority of responses could be summarized as, "I would go into OEM and check for A or B." I don't have a problem with using OEM, but usually the A’s and B’s had little relevance to the question.
The herd of available DBAs that are able to navigate the database using native SQL to get at critical performance diagnostic information has thinned dramatically. Sometimes I wonder what would happen to some of these shops being supported by some I interview if OEM, Database Control or Grid Control took the night off. When relegated to digging into the database and troubleshooting armed only with a SQL prompt, many appear to be lost. I certainly appreciate what the Oracle GUI database management tools bring to the table. I even like them. My point is, don't throw away your shovel just because you have a snow blower. The day will come when your GUI will fail you and it will be just you and your SQL prompt.
P.S.> Oracle does not lock the content of its data files during the course of a user-managed hot backup. Actually, Oracle only locks one thing, the master checkpoint SCN inside the file header. Some other constructs in the file header stay mutable. Blocks in data files being backed up can be modified as per normal database operation. The changes to blocks are indeed recorded in the redo, but they are not replayed when the END BACKUP is issued. More redo is possible because Oracle must accommodate the potential presence of fractured blocks.
SQL Gone Wild!
The child table had millions of rows that would have been orphaned had the delete succeeded. Keep in mind the constraint was NOT defined with ON DELETE CASCADE. Also, a single column index on the child table was associated with the child key. The stage was set for a swift and proper decline by Oracle to perform our delete. But this did not happen. Oracle was visiting ALL of the child rows then returning ORA-00292 "... - child record found." Yes, each and every child index entry was being visited. My colleague opened as SR with a very elegant little test case that reproduces the problem. Here it is. Try it for yourself and watch the trace with wonder and amazement. We have performed the test in 8i, 9i and 10g with the same results.
DROP TABLE CHILD;
DROP TABLE PARENT;
CREATE TABLE PARENT (COL1 NUMBER);
ALTER TABLE PARENT ADD CONSTRAINT PARENT_PK PRIMARY KEY (COL1);
CREATE TABLE CHILD (COL1 NUMBER);
CREATE INDEX CHILD_IX_01 ON CHILD (COL1);
ALTER TABLE CHILD ADD CONSTRAINT CHILD_FK_01 FOREIGN KEY (COL1) REFERENCES PARENT;
INSERT INTO PARENT VALUES (999999999999);
INSERT INTO CHILD VALUES (999999999999);
COMMIT;
-- Insert approximately 1 million records into CHILD
begin
for i in 1..20 loop
insert into child select * from child;
commit;
end loop;
end;
/
alter session set events '10046 trace name context forever, level 12';
DELETE FROM PARENT WHERE COL1 = 999999999999;
Why doesn't Oracle stop once it encounters the first index entry indicating a foreign key violation has just occurred? Isn't a single found entry sufficient to fail my statement? It seems a bit indulgent to check each and every child row irrespective of my barbaric attempt to break my own business rules. Is it a classic case of stupid is as stupid does? Nope. It is a good old fashioned Oracle bug.
By the way, the Oracle support analyst recommended putting the index associated with a child key in a read only tablespace as a workaround. Think about that for a second...
Training Class (Final Day)
- Tuning Block Space Usage.
- Tuning I/O.
- Tuning PGA and Temporary Space.
- Performance Tuning: Summary.
I found the Tuning I/O lecture somewhat interesting. The first portion of the lecture focused on the advantages and disadvantages of the various forms of RAID protection. While informative, I could've spent 5 minutes on Google had I not already been armed with the knowledge of this technology. The remainder of this lecture focused on ASM (Automatic Storage Management). This rather non-trivial feature in 10g sounds very cool; define some Data disk group(s) , the relevant protection and striping granularity and let Oracle do the all of the I/O tuning. Of course, this is a severe over simplification of what it really does (or doesn't, as your mileage may vary). But, the point is, this feature is supposed to free the DBA from the often times laborious chore of tuning the I/O subsystem. Truthfully, I think the degree to which Oracle touts the hands-off nature of this feature is overstated; especially for busy production systems. I, nor anyone in the class, had worked with the product. Consequently, I feel there are probably very few shops out there migrating their production databases to ASM. Is it more of a political battle? After all, if DBAs will be able to someday create and manage the logical volumes/file systems this might make the System Administrators feel a little encroached upon. It is just a hunch, but widespread conversions to ASM will probably not happen anytime soon. Anyone reading this blog have any good/bad experience with ASM in a production environment? I am very interested in your feedback.
The most engaging lecture of the day was the Tuning Block Space Usage. I am really keen to the Automatic Segment Space Management (ASSM) feature. This feature warrants serious consideration given the upside: free list elimination and a considerably more robust approach to reusing blocks for inserts. As much as I liked the discussion on ASSM, the subsequent topic grabbed my utmost attention: segment shrinking. What a great (and might I add way overdue) feature. If one of my production environments was on 10g today I could see using this tool to reclaim vast amounts of space in some of my very large heap tables, index-organized tables and indexes. Oracle claims that the majority of the work can be done online. Moreover, the indexes associated with your heap tables are still usable even after the row movement inherent to the SHRINK has completed. I like the idea of having the freedom to perform these "online" activities, but I still prefer to perform these kinds of operations during quite periods. The course material gives a fantastic, albeit brief, description of the mechanics. Very nice Oracle! Once again, are there any readers of this blog that have experience with this feature and want to share your experiences?
The final two lectures, Tuning PGA and Temporary Space and Performance Tuning Summary, were good, but not great. The material seemed to belabor a few points.
In summary, if you are considering taking this course I think you are best served if you do not have much 10g experience in production environments. If your experience with 10g and some of the "tuning" features is even moderate, I recommend you not take the course. Your time would be better spent reading up on this material in the Oracle documentation set.
Eric's rating of the course: B+.
Training Class (Day 3)
- Tuning the Shared Pool.
- Tuning the Buffer Cache.
- Automatic Shared Memory Management.
- Checkpoint and Redo Tuning.
Apparently, Oracle is migrating some of its serialization protection from latches to mutexes. For example, the structures previously protected by the Library Cache Pin latch are now protected by a mutex and evidenced by the cursor:pin S wait event. Actually there are several new mutexes and mutex related wait events new to 10g. For example:
- cursor:mutex indicates mutex waits on parent cursor operations and statistic block operations.
- cursor:pin events are waits for cursor pin operations (library cache pin now protected by mutex).
There are a couple interesting facts about Oracle and mutexes. A mutex get is about 30-35 instructions, compared to 150-200 instructions for a latch get. Also, a mutex is around 16 bytes in size, compared to 112 bytes for a latch in Release 10.2 (in prior releases, it was 200 bytes).
One of the appeals of the mutex, per the documentation, is the reduced potential for false contention. That is, a mutex can protect a single structure; often times stored with the structure it protects. However, latches often protect many structures (see cache buffers chain latch) and can yield what the documentation calls false contention. It is called false contention because "the contention is for the protection mechanism rather than the target object you are attempting to access." This all sounds really great, right? Well, maybe. If Oracle goes to more widespread use of mutexes instead of latches to protect target objects that would be a boatload more mutexes. I am sure the porters at Oracle are not intending to use mutexes exclusively in the future. But, I can see where contention in Oracle could be dramatically reduced at the cost of CPU cycles and memory. What would happen if Oracle protected each buffer with a mutex? While each mutex is less expensive with regard to memory and CPU than an individual latch, you will need considerably more mutexes for each replaced latch. 50 mutexes used to replace a single latch could run the CPU up considerably for the "same" application workload.
I have one final note on mutexes. As of version 10.2.0.2 a SELECT against V$SQLSTAT and searches of child cursor lists are mutex protected.
I found the Tuning the Buffer Cache discussion somewhat interesting. Unless you have been hiding under a rock the past 4-5 years, I am sure you have heard the Oracle experts preaching the notion that ratios are not very helpful in diagnosing the health of a database. In particular, the buffer cache hit ratio is frequently tagged as meaningless. A smile came to my face when I read the following excerpt from the course material:
"A badly tuned database can still have a hit ratio of 99% or better...hit ratio is only one part in determining tuning performance...hit ratio does not determine whether a database is optimally tuned..."
Oracle is finally teaching what the experts have been saying for years!
I have been to several Hotsos events/training classes. They often talk about the need to include the buffer is pinned count statistic in the tally for logical reads. These operations are simply latch-reduced logical reads. Why doesn't Oracle integrate this information into their course material or documentation set? They still only claim that db block gets and consistent gets constitute logical reads. I monitored a process recently in one of my production environments and noticed the process did 2 buffer is pinned count logical reads for every 1 (db block gets + consistent gets). That is a substantial percentage of work owed to operations not officially categorized as a measure of work by Oracle.
Lastly, the on-topic impromptu discussions were fruitful. That always makes the training session more interesting :)
Instructive Presentation on Logical I/O Mechanics
Training Class (Day 2)
- Metrics, Alerts and Baselines.
- Using Statspack.
- Using Automatic Workload Repository.
- Reactive Tuning.
Having limited exposure to 10g in any true production environment, I found 75% of these topics interesting (Statspack chapter was not valuable to me). I really like what Oracle has accomplished with 10g with regard to the gathering and reporting of statistics and metrics (the rates of changes for given statistics). About 5 years ago I wrote a utility for 9i that allowed me to compare Oracle-captured statistics and wait event durations to similar reference points. This utility, I dubbed AppSnap (written in PL/SQL), captured the statistics and wait event durations each hour and calculated and stored the deltas in a separate tablespace. This permitted me to compare what is considered "typical" load to current load and evaluate the deviations rather quickly. I wrote a Unix shell script reporting tool called Instance Health that reports each hour the deltas as they relate to what I call peer hours. For example, each hour a report is generated as a text file and stored in a log directory. The most previous delta is compared to the same hour of day for the past 30 days, the same hour of day and day of week for the past 12 weeks and against all hours for the past 30 days. This has proved to be very valuable for detecting systemic anomalies after application upgrades, etc.
Okay. Now Oracle has come along with 10g and provides the same functionality (albeit not free). I appreciate the graphical conveyance of this type of analysis provided by Enterprise Manager. Shoot, Oracle even calculates the variance within the sampled timeframe for each metric. This is really cool because you can easily write a query that can ascertain if some metric is statistically anomalous (i.e. +-3 standard deviations). At first glance, some of the AWR reports are not very intuitive. But, the more you stare at them the more sense they appear to make. The Active Session History reporting is also a very nice feature (once again, not free).
If you already have considerable work experience with AWR/ASH/ADDM then this class probably won't provide you much value. The course does go into the mechanics of the data capturing and touches rather superficially on the reporting capabilities. So there is a good chance you probably have more knowledge about these products than this class affords. However, if you are like me and have yet to dig in your heels on a 10g production environment this class could serve as a very nice primer.
Well, I am off to day 3 of this training class.
Training Class (Day 1)
The first day was just okay for me. I learned a few things but I felt the class dragged a bit: 30 minutes getting to know each other and too many unrelated tangents or related, yet gratuitous, topic embellishments. I don't mind an occasional anecdotal deviation as it relates to the topic at hand, but those that are completely off topic really slow down the course. You know when you are reading a story and everything seems to flow nicely (proper balance of dialogue and narrative), then you run into a couple pages of narrative? It really puts the brakes on the interest.
I tend to evaluate courses on:
- The conveyed knowledge of the instructor.
- The presentation of the subject matter.
- The quality of the course material.
- The value of the impromptu discourse along the way (icing on the cake stuff).
Based on the first day, I feel the instructor has a good command of the subject matter and adds value to the class based on relevant experience. I have been to several Oracle classes where the instructor did nothing more than read the slides verbatim and/or appeared to have little relevant experience. Aside from the introductions, we spent the remainder of the day on two topics: a performance tuning overview and 10g statistics/wait events. The study material that accompanies the course is rather good; I have taken the liberty to skip ahead and get into the meat of the course material. I am looking forward to the class tomorrow as I feel we will be digging our heels in a bit more (or at least I hope).
Given the sparse amount of substantive material covered on the first day, I don't have any really interesting takeaways. I'll give the first day of class a B.
iLoveit
I really like the IPod product - versatile, convenient and just loads of entertainment/educational potential for my ~2 hours of public transit commute each day.
With more and more news and entertainment mediums adding video Podcasts this little device has become a cool way to pass the commuting time. I enjoy reading on my commute, but watching an interesting show or funny movie can pass the time at record clips. I won't be surprised if I ask the engineer to make another loop so I can finish watching my movie before heading into work :)
I wonder when Mr. Kyte will be Podcasting some material...
I'll be in Oracle training next week attending the Oracle Database 10g: Performance Tuning (Database) class. It should be a really good course. I'll be blogging some of the interesting takeaways from the class, so stay "tuned"!
Cool Illusion
10,000 Visits and Counting
Seriously, it has been a fun 4+ months of blogging. I haven't been able to blog as much as I would have liked over the past 2 months. The fact that I still desire to blog is a good sign though. Time and other constraints in life always crop up. Anyway, it has been really fun and I hope some of you have enjoyed your visits. I have received some very interesting emails and questions from Oracle enthusiasts all over the world.
I want to send one special thanks out to Tom Kyte for recommending I start a blog (and placing me on his metablog even though I stole the naming style of his blog) and another to Doug Burns who featured my blog on his site several months back, giving my traffic a jump start.
Actually, Tom, my first choice for a blog name was The Emusing Blog :)