Skip navigation.

Chen Shapira

Syndicate content I'm just a simple DBA on a complex production system
Writing about all things production. Especially Oracle databases.
Updated: 2 hours 7 min ago

Mad Troubleshooting Skillz!

Mon, 2009-11-23 21:06

Last week I attended Tanel Poder’s Advanced Oracle Troubleshooting seminar, organized by NoCOUG.

Well, actually it was organized by me, with lots of help from Iggy and the rest of NoCOUG. Organizing a seminar was not trivial, but wow – it was totally worth it!

When the seminar was over, we asked the attendees to fill a small survey and tell us what was good and what was bad. Turns out that there was just a single answer for “What was the best thing about the seminar”. The answer is – Tanel.

He is clearly an expert. He loves what he is doing. He can think on his feet and answer all sorts of more-or-less related questions from the audience. He is extremely generous with his time and his knowledge and his scripts (I know lots of DBAs who guard their scripts with their lives, Tanel happily shares them in his blog). Tanel is also funny, entertaining and the course is pretty well structured. He also talks very fast (and seems somewhat obsessed about not wasting a single millisecond) – you won’t believe how much you’ll learn in two days.

I had some trouble explaining explaining to my team and boss what I learned:
“Well, the first half day was troubleshooting hangs and slowdowns using v$views”
“But you knew that before!”
“We also learned lots of Oracle internals. How the shared pool really works, and how SQL is really processed. Lots of cool stuff.”
“OK, but what is it good for?”
“Knowing how stuff works is always good. Anyway, I also learned to use Unix tools to debug problems. Like dumping process stuck to see where it hangs! Also, we learned how to handle free memory issues. Remember that awful leak we had on that test server?”
“No.”
“Oh! Look! Shiny scripts!”

But the proof of the pudding is in the eating. No one can argue the fact that last week I already managed to troubleshoot and solve two problems that other team members failed to make much progress on. I did it very quickly too.
Now here is the strange thing – the two problems were in areas of Oracle that Tanel very explicitly did not mention during the seminar. Streams and Clusterware. I did not even use any of his scripts to shoot them. And yet I’m still convinced that the reason I was so effective in solving those problems is directly related to the seminar. How is that? Here are the important non-technical things I learned at the seminar:

  1. Systematic approach – You don’t work off lists, you don’t waste time by looking at random places and you don’t guess (much). You gather symptoms, you use them to pinpoint the problem and you use the pinpointed knowledge to work your way toward a solution. The last part sometimes involves Oracle support. I knew about the systematic approach thing before, but two days of looking at someone demonstrating it makes a difference.
  2. Don’t believe anyone (except the OS) – Users lie, other DBAs lie, even Oracle sometimes lies. Always crosscheck and double check the facts. No one lies intentionally, but the result is still misleading.
  3. Problems have causes. I know it sounds funny, but very often we stop troubleshooting too soon, attributing a problem to mysterious unknown forces or at least say “well, I don’t know how to know this” and leave things at that. Tanel went farther than anyone I’ve ever seen by saying “I have to know why this behaves like that” and when Oracle doesn’t tell him, he goes to the OS, or the network, or writes his own tools. Thats a good lesson – don’t take no for an answer.
  4. All DBAs have tons of troubleshooting scripts. Real experts have scripts with very short names and very flexible arguments. They also have a script for reminding them how to use their scripts
  5. I no longer view trouble as something annoying that wastes my time and prevents me from doing stuff I want to do. Instead every trouble is now cherished as an opportunity to practice what I learned, learn more and polish my skills.

I highly recommend Tanel’s course to DBAs who want to suddenly become the best troubleshooters in their team. Its not a comfortable position to be in (suddenly a lot more trouble finds its way to you), but it can be lots of fun.

Categories: DBA Blogs

The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation

Tue, 2009-11-03 20:03

I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.

After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.

It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5 :)

Here are the techniques I use:

  1. I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:
    select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt
    ,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks
    ,sum(decode(THREAD#,1,1,0)) Node1_Count
    ,sum(decode(THREAD#,2,BLOCKS,0)) Node2
    ,sum(decode(THREAD#,2,1,0)) Node2_Count
    ,sum(decode(THREAD#,3,BLOCKS,0)) Node3
    ,sum(decode(THREAD#,3,1,0)) Node3_Count
    from v$archived_log
    where FIRST_TIME >sysdate -30
    group by trunc(FIRST_TIME,'HH')
    order by trunc(FIRST_TIME,'HH') desc
    
  2. If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.
  3. However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.
    create global temporary table redo_stats
    ( runid varchar2(15),
      sid number,
      value int )
    on commit preserve rows;
    
    truncate table redo_stats;
    
    insert into redo_stats select 1,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    insert into redo_stats select 2,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    select *
            from redo_stats a, redo_stats b,v$session s
           where a.sid = b.sid
           and s.sid=a.sid
             and a.runid = 1
             and b.runid = 2
             and (b.value-a.value) > 0
           order by (b.value-a.value)
    
  4. Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.
    -- Here I pick up the relevant redo logs.
    -- Adjust the where condition to match the times you are interested in.
    -- I picked the last day.
    select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');'
    from v$archived_log
    where FIRST_TIME >= sysdate-1
    and THREAD#=1
    and dest_id=1
    order by FIRST_TIME desc
    
    -- Use the results of the query above to add the logfiles you are interest in
    -- Then start the logminer
    exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf');
    
    exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG);
    
    -- You can find the top users and segments that generated redo
    select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*)
    from v$logmnr_contents
    group by seg_owner,seg_name,seg_type_name,operation
    order by count(*) desc
    
    -- You can get more details about the specific actions and the amounts of redo they caused.
    select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK
    from v$logmnr_contents
    group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE
    order by count(*)
    
    -- don't forget to close the logminer when you are done
    exec sys.dbms_logmnr.end_logmnr;
    
  5. If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.

    -- find the sessions and queries causing most redo and when it happened
    
    select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY
    where event like 'log file sync'
    group by  SESSION_ID,user_id,sql_id,round(sample_time,'hh')
    order by count(*) desc
    
    -- you can look the the SQL itself by:
    select * from DBA_HIST_SQLTEXT
    where sql_id='dwbbdanhf7p4a'
    
Categories: DBA Blogs

Thinking about Design Patterns

Mon, 2009-11-02 21:46

I have this friend who is an ambitious young corporate climber. When I was started out as a team lead, I was totally overwhelmed by the office politics I was suddenly exposed to. Naturally, I turned to my friend for advice. She told me to read Machiaveli’s “The Prince”. So I did. Its an interesting read, but not that amazing as far as management advice goes. When I later tried to discuss the book with my friend, I found out that she never actually read the book herself – she just thought it is good advice.

6 years later and I still believe people when they tell me that I have to read a book. I’m naive like that.

So I spent the last two weeks reading “The Timeless Way of Building”. Its an architecture book. Architecture as in cities and buildings. The reason I spent two weeks reading a professional book intended for a different profession is that at some point in history (1994, I believe), some people though that the ideas in the book are relevant to software development. These days you can’t really be a Java developer without being fully fluent in the development pattern language.

And of course, everyone was saying “You have to read The Timeless Way of Building. It will change the way you think about software.”. From my days in development, I still remember quite a bit of stuff about design patterns, and I never really liked that particular approach to software development, but I didn’t really figure out why. After reading the authoritative source on patterns, I can say the following:

  1. Christopher Alexander had some good ideas about patterns. The book is readable to non-architects and is very enlightening. I recommend reading it if you are interested in what makes some cities and buildings feel better than others.
  2. I am pretty sure that his advice on how to design good buildings is not really applicable to software development field.
    A lot of his ideas are based on the fact (which he did the research to prove) that people intuitively know how buildings should be designed, and that when you ask a large number of people “How do you imagine you will feel in such room?”, you’ll get an overwhelming consensus. This is far from being true in the software field.
  3. What is common known today as software design patterns is so far removed from what Christopher Alexander recommended for architects, that software developers should really go and find a different name for what they are doing.

I’m still rather shocked by the differences between Christopher Alexander’s patterns, and what design patterns look like today. It is not few tiny differences that occur whenever ideas are translated from one domain to another. Some of the changes are profound.

First of all, Christopher Alexander says that patterns describe the way people already do things. “Night Life” and “Parallel Streets” existed before the book “A Pattern Language” was written. I’m not at all sure this is the case for design patterns. People buy design pattern books to learn the patterns themselves, not just the language or which patterns are better than others.

Second, patterns should have an intuitive meaning and intuitive name. Again, you don’t need a book to know what is a “Bus Stop” or “Small Parking Lots”. You may want to read the book to find out why they are a good idea, or how to make a good bus stop, but you know what it is. I don’t believe that anyone knew what is an “Abstract Factory” before reading a document about design patterns. Even patterns that have been used for decades got a fancy name. It can take a while to figure out that a Singleton is a global variable. One of the simplest and most common patterns in software development “A function that does exactly one task” is missing from software design patterns. “A Loop” is also a pattern which is missing in action. All this gives the wrong impression that patterns are very complicated and something that can be mastered by experts only – which is exactly the opposite of what Christopher Alexander intended.

Third, patterns are abstract concepts. They are always implemented in a different way, because the entire idea is to be sensitive to the context, which is never the same twice. There is a pattern called “Six-foot balcony”, but it would be wrong to mass-manufacture six-foot balconies and start attaching them to buildings. Six-foot balcony is the idea, the exact shape of the balcony will be designed to match the building, the view, the trees, the sun, etc.
So it is rather annoying to discover that all patterns have “implementation examples”, which developers enjoy copying into their code. I’m all for code reuse, but this is not patterns mean.Wikipedia has a decent description of what defines a pattern, and “being implementable in one or two simple classes that can be copy-pasted” is not part of it.

Executive summary: “Timeless Way of Building” is an interesting book on architecture, with some good insights about how humans like to live and a bit of a Zen feel. You will not learn anything about software development from reading it. If you already know software design patterns, you will be struck by how different the ideas in the book seem.

Categories: DBA Blogs

Lessons From OOW09 #3 – What’s New with Streams?

Mon, 2009-10-19 19:24

The big news with streams is the Golden Gate acquisition. Oracle acquired Golden Gate, a smaller company that has a solution very similar to what Oracle Streams is doing. During OpenWorld Oracle announced that Golden Gate is the new strategic direction and that future development will be towards integrating Streams with Golden Gate. Oracle will still support streams, fix bugs, etc – but don’t expect new features other than the Golden Gate integration.

I missed the session where Golden Gate explain what its all about, but I’m planning to invite Golden Gate representative to give us a presentation and explain exactly what they do.

In the mean while, interesting new stuff with the old streams:

  1. One to many replication should be faster and more reliable in 11gR2.
  2. New statement DML handlers – allows manipulating LCRs using SQL statements instead of the PL/SQL code used in the past. According to Oracle it should be 4 times faster this way. One common use case of DML handlers that can now be implemented with the statement handlers is converting “delete” statements into an update that marks a row as “deleted” by modifying a varchar in a column.
  3. Keep Columns – new rule based transformation that allows you to specify which columns should be preserved in an LCR. In the past you can specify columns to drop, but not which columns to keep.
  4. Built in script for recording changes in a log table. This is an incredibly common use case for Streams, and Oracle now has a single command that automatically sets up the necessary statement handles and keep columns. Just call DBMS_STREAMS_ADM.MAINTAIN_CHANGE_TABLE.
  5. XSTREAM – Oracle says its a new API for allowing external programs to insert and pull changes from streams. Its cool, but I’m not convinced its new, since I’ve heard about a similar feature in previous versions under a different name.
  6. Streams performance advisor – thats an 11gR1 feature, but I didn’t know about it. Its a PL/SQL package and a bunch of views that can be used to report on streams performance. It should also be able to detect your topology automatically. You use this by running DBMS_PERFORMANCE_ADVISOR_ADM.ANALYZE_CURRENT_PERFORMANCE and checking a bunch of views. Documented here.
  7. If I followed correctly, the new advisor package uses the UTL_SPADV package, that you can also use for performance analysis or monitoring.
  8. DBMS_COMPARISON – a package that can be used to compare tables between two servers (one should be 11g and the other 10g or 11g). Can be useful when troubleshooting streams.
  9. Streams cross DB tracing – allows tracking a message when trying to troubleshoot apply that doesn’t work properly. You enable message tracking in a session (set message tracking) and then you monitor the views to see the actions that happen to it.
  10. 11g has greatly improved and more detailed apply error messages. This is probably my favorite new feature :) Most of the time I no longer need to print the LCR to debug the issue.
  11. Not sure if this is 11gR1 or R2 – but apparently propagation is no longer managed by “jobs” but has now moved to the new scheduler, making it much more manageable.
  12. Bunch of nice improvements. I’m looking forward to seeing what Golden Gate is doing and why it is so much better.

Categories: DBA Blogs

Lessons From OOW09 #2 – Consolidation Tips

Sun, 2009-10-18 23:54

The session was called “All in One” and it was given by Husnu Sensoy. A young and very accomplished DBA from Turkey. I chatted with him during ACE dinner and it turns out we have many colleagues in common. This was probably the most useful presentation I’ve heard this OpenWorld. As I am going into a large consolidation project for next year, I am glad I can learn from the experience of someone like Husnu who already gone through this and is very willing to share the experience.

His presentation is shared on his blog, so I’ll just give the parts that I consider to be highlights. You will probably learn more by reading his slides. He had tons of good content and he talks very very fast, so I’m sure I missed a bunch of good stuff. Since his content is readily available, I’m mixing in a lot of my own thoughts here.

The problems he set out to solve:

  • Too many DBs and too few DBAs.
  • Some servers are doing almost nothing.
  • Some servers have no HA.

Pick candidates for consolidation based on: License costs, utilization, data center location, dependencies, I/O characteristics, risk levels.

The driver for our consolidation were license costs – our new machines had two quad cores instead of one dual core, so license costs suddenly quadrapled and we were forced into cost saving consolidations. We mostly used data center location and risk levels to decide on the plan. Our most OLTP system, the one that is most sensitive to slow-downs, will remain unconsolidated for now.

Prior to consolidation collect lots of system/performance metrics. They will help pick candidates, plan capacity, test and later troubleshoot.

Don’t forget to talk to DBAs and business reps when making the consolidation plans, they will have their own ideas and this can be important input.

Additive linear models are recommended for capacity planning. He gave lots of guidelines on how to do this. Pages 26-36 in his PDF have the details. I could have sworn he recommended to stay below the 65% utilization when planning for CPU capacity, but I cannot see it in his slides. In any case – do this, because any higher than that and the linear additive model is questionable.

Also pay attention to the part about preferring larger servers and less RAC nodes, since RAC adds complexity. And to the part about every storage system delivering about 70-80% of spec. Actually, this is more true for the EMC system he used. Our Netapps seem to be up to spec.

Don’t mix sequential and random IO (i.e. OLTP and DW) is a good idea. A lot of places can’t really do this because of the way their apps are designed.

Benchmarking the new system to test the capacity plan is a great idea. I’d love to see more concrete information on how to benchmark, maybe a whole other presentation on this. One of the things that worry me most about our consolidation plans is that I’m not sure how good our tests will be. Husnu recommended HammerOra, which I’ll check out.

Crash tests. We did those ages ago when we moved to RAC architecture and then again for Netapp clusters. Was lots of fun and maybe its time to do this again. Husnu advised to ask support for a list of good test scenarios. I recommend taking your sysadmins, storage admins and net admins for few beers and asking them for scenarios – they generally come up with very creative stuff.

Good tip: In 11g, memory target does not work with huge pages. You are using huge pages, right?

Write the backup and recovery document as the first document on the new system.

Pages 76-86 have good advice on merging databases. We’ll be doing that too and I was glad to see that we came up with the same plans and same problems as Husnu.

His last advice is the best: “You never know how long this is going to take.” So true! Who could have known that we will be delayed for 3 month by an IT security group that popped up from no-where with requirement that we will pass certain security audits that we’ve never heard of. Life in a big organization can be full of surprises, so be prepared :)

Categories: DBA Blogs

Lessons From OOW09 #1 – Shell Script Tips

Fri, 2009-10-16 19:46

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

  1. Shell scripts must work
  2. Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that :)

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

  1. Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
  2. Document the script purpose in the script header, as well as the input arguments
  3. Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
  4. Clean up temporary files you will use before trying to use them:

    function CleanUpFiles {
    [ $LOGFILE ] && rm -rf ${LOGFILE}
    [ $SPOOLFILE ] && rm -rf ${SPOOLFILE}
    }
  5. Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
  6. Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
  7. Error messages should explain errors and advise how to fix them
  8. Same script can work interactively or in cron by using: if [ tty -s ] …
  9. When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
  10. Dialog/Zenity – tools that let you easily create cool dialog screens
  11. Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
  12. Be consistent – try to use same scripts whenever possible and limit editing permissions
  13. Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
  14. Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
  15. Use deployment control tool like CFEngine. I should definitely check this one out.
  16. Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
  17. Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
  18. You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
  19. You can put comments and descriptions in ORATAB
Categories: DBA Blogs

Most Important Thing I’ve Learned at OOW09

Tue, 2009-10-13 08:11

It is only Tuesday morning, OOW is not even half-way through. But there is something I’ve learned on Monday morning, and it left such a huge impression, that I know right now that nothing else that will happen until Thursday can top this.

Jonathan Lewis gave a presentation on “How to be an Expert” in the Unconference. Later I went on to discuss the issue of expertise with him, especially how much work it takes to become an expert. He told me that this morning (Meaning Monday morning) he was preparing a demo for a presentation, and after testing it on 9i, 10g and 11g, he noticed something strange in the way 10g behaved. He then went on to spend the next hour figuring out the strange behavior he saw. Even though it was not part of the presentation and not one was really expecting him to solve this problem.

This story was like a lightbulb going off in my head.

Because I wish I was sure that I would do just what he did. I am geek enough that there is a possibility I would do it. But there is also a voice in my head that tells me things like “Why are you wasting time just playing with this? Its not like solving this problem is going to be useful in any way. You have more important things to do, stop playing!” (Yes, I’m hearing voices. Don’t you?)

And the problem is that very often that internal manager is correct. There are more important things to do. Always. Stop playing with that interesting issue with the TCP/IP stack, reboot that machine already and solve the next SR, it is urgent and has to get done by lunch. Jonathan repeated that several times in his presentation – DBAs are under a lot of pressure not to be experts.

But I also believe that much of life is a story we tell ourselves. Its invented. Not only the relative importance of understanding the system and solving SRs fasters is an invention, also the importance of having our manager approve of the speed in which we solve SRs is something we decide on.

I can invent a story in which I am an Oracle expert. And as an Oracle expert I take time to understand how things behave and why they behave the way they do. Because this is how experts work.

Of course experts also take into account the wishes and desires of the people who pay them, but it cannot replace the importance of really understanding things. Because in the long term, people do pay you to be an expert and understand what is Oracle really doing.

I really hope this post can be as much of an epiphany to others as Jonathan Lewis’s presentation was to me. Invent yourself as an expert. Take time to learn, research, think, play and understand.

Categories: DBA Blogs

Visualization Session – The Slides

Tue, 2009-10-13 07:49

The “Visualization Session” at OOW Unconference was great. Thanks to everyone who showed up for the lively discussion. It was probably the most fun I’ve ever had at a presentation.
Also thanks for the fine folks whom I later met at the OTN lounge and explained that they wanted to attend my presentation but the OTN lounge had free beer and I did not. I’ll see what I can do about the beer next year.

For those who missed the presentation whether due to beer or to distance from OpenWorld, you can get my slides here. As usual for my presentations, I’m not sure if my slides are meaningful without me standing next to them. It is just a bunch of graphs without the stories. If you really want to hear the stories, you can invite me to speak at your usergroup :)

Categories: DBA Blogs

Visualization Session at OOW Unconference

Thu, 2009-10-08 17:02

OOW is next week, and on Monday, October 12, 4:00pm PST, I’ll be giving my unconference session about visualization.

Visualization is a big topic. What I’m going to talk about is:

  1. Why visualization is such an amazing tool for DBAs. Not just for newbies, it is actually more effective for experienced DBAs. I’ll explain why.
  2. Lots of examples of how you can use different graphs to explore your data set from different angles. I’ll show you some graphs you are familiar with (Histograms), and some that you probably don’t use (QQNorm). I’ll explain when to use each graph type.
  3. Stories about visualization – how a visualization mistake caused a manager to panic and cancel an important maintenance for instance.
  4. Some suggestions on how to format graphs so they’ll have maximum effect in documents, emails and presentations.
  5. If anyone is interested, I can show how I work with R to do visual data analysis. But its not in the slides, so you’ll have to ask for it.

One thing I will not mention:

Edward Tufte.

I know that everyone who talks about visualization talks about him, but I won’t. He does those amazing graphs and demonstrates how powerful visualization can be. I use Excel and R to create my graphs. They are not beautiful or amazing, and they can still be very very effective. This means that you will be able to easily reproduce everything I do in my presentation. You don’t need to be a talented designer to use graphs effectively.

See you on Monday!

Categories: DBA Blogs

One More Thing Everyone Should Know About Queues

Wed, 2009-10-07 18:11

I already posted two things everyone should know about queues, but the incidents of the last month made me realize I missed another very important queue fact.

Don’t double the load on a server and expect that response times will be “a little slower”. If the server is above 30-40% capacity, doubling the load is likely to be catastrophic.

Suppose you have two servers, for example a Netapp heads, that are operating at 48% capacity each, and suddenly one of them fails and you have to move all the load to one of them. This means that you have one server at close to 100% capacity. Users will not complain that things are a bit slow, they will complain that nothing is working.

Someone once told me that if I have two RAC nodes running at 100% CPU each, I do not have high availability. The truth is that you stop having high availability long before the 200% CPU point.

Oh, and in case anyone wonders what we did about the overloaded Netapp. We stopped every non-essential process reading from this Netapp. This included snapmirrors, non-production databases, exports, vmwares, and probably more stuff than I know. This moved utilization down to 60% point and life was good (except that we weren’t too comfortable about lack of snapmirrors and exports).

Categories: DBA Blogs

One of those… Months?

Wed, 2009-10-07 16:28

I’ve had one of those days, and even some weeks like that, but its the first time we have an entire murphy month – where everything possible goes wrong.

Lets see the list:

  1. DBA accidentally dropped production schema. He thought he was on test DB, of course. We are very proud that we managed to restore said schema with no data loss.
  2. One of our databases magically lost the storage network. No idea why or how. Reboot solved it.
  3. 8 hours downtime caused by a faulty switch. We have high availability, so we automatically failed over to the secondary switch. The secondary switch immediately failed too. Since we test the failover regularly, this can only be described as unbelievably bad luck.
  4. One of our Netapp heads failed. Again, we have high availability, so we fail over to the second head. Except that after we fixed the first head, it refused to recognize the disks. According to Netapp, the first head has to run a “rebuild” on the disks, so it can figure out again where is our data. We could have done it with few hours of downtime, but we already had a lot of downtime this month. So we opted for online rebuild. Which is as fun as online rebuild of indexes. Online rebuild of each disk takes around 12 hours. We have 14 disks. It was the week of unbelievable IO latency. The only upside is that for one week the DBA team was not the target for performance complaints.
  5. Bunch of smaller things: DBA who accidentally reset passwords in 20 servers, backups that stopped working, ORA-600 on capture process for our largest streams customer, accidentally exposing data of one customer to another, etc.

It should be obvious that the gods are out to get us. So much bad luck in one month cannot be accidental or random.

Since this run of production crashes coincides with the Jewish “Day of Atonement” (Yom Kippur) and the preceding repentance days period, the solution seemed obvious – I should repent my sins, promise never to repeat them, and pray for atonement. In Judaism any transgression of law is considered a sin. Even if it is not a moral hazard or if it was done by innocent mistake.

So consider this the reverse of new year’s resolutions. What I resolve not to do next year:

  1. Install new servers and publish them as production before I verify that backups and monitors indeed work on these servers.
  2. Undocumented changes on production servers.
  3. Accuse developers of being stupid and lazy. Not even if I find a nice way to paraphrase this.
  4. Ignore large infrastructural problems just because I prefer to work on something else.
  5. Ignore mysterious production glitches, just because they don’t happen a lot.

These steps should help our production be more stable next year. The more positive resolutions will wait for January :)

Categories: DBA Blogs

Kosher Visualization

Tue, 2009-09-22 18:04

I’m working on my visualization presentation (OOW unconference, October 12, 4PM – don’t miss it!), and one of the topics I keep rethinking is how to present results of research in a visual way. Especially when the report or presentation is for non-technical management.

It is perfectly easy to take true data and arrange it on a chart in a way that “proves” whatever it is you want to show. But is it wrong? Is there a one true way to display data and everything else is a lie and a distortion?

Lets look at a handy example: http://www.cunningham.me.uk/wordpress/2007/07/11/how-to-lie-with-statistics-as-shown-by-the-bbc/

In the example (look at the graphs with the red lines), scientists measured a temperature increase of 0.5 degrees over a period of 30 years. The blogger thinks that the first chart lies – because they make a tiny change look scary by changing the scale of the chart. He then shows the “correct” chart where the scale is changed to the point that the temperature increase is barely noticeable.

But is it really that straight forward? Another point of view can be that temperature change of 0.5 degrees over 30 years is a huge deal, and is indeed scary and the graph was scaled to make the correct scientific view more visible. By rescaling the graph you are actually obscuring an important truth and misleading the audience.

What is the truth? I’m just a simple DBA, I’ve no idea about global warming.

But when I do research about a performance issue and then I write a report about the results of my research, and I use charts to demonstrate the important points in my results – I find it legitimate to scale the graphs in a way that makes the important points as clear as possible. If my graphs don’t demonstrate my points in the clearest way possible then I’m doing a bad job.

However, to keep myself ethical, I follow few rules about these modifications:

  1. You are 100% sure, to the best of your knowledge and research, that the point you are making is indeed correct. You are not allowed to hide data just because you did not do a very good job at collecting or analyzing it.
  2. You mention the modification in the report or presentation. You make the original data available to anyone who wants to verify your results.
  3. You have very good reasons for the modifications you did and you feel comfortable presenting them to anyone who questions your charts.
  4. You will be extra careful when rescaling data that is displayed as two dimensional shapes, and make sure that the proportions between the rescaled areas indeed reflect the proportions of the data. Because in 2D small changes are doubled.

You’ll notice that my advice is somewhat subjective – that because I don’t really see an objective way to differentiate between “highlighting an important truth” and “making mountains out of molehills”. You did the research, you know if 0.002ms increase in storage network round-trip time is a big deal or not, and you should decide how to display it. Obviously, if you manage to find a clear and unquestionable way to display your results, so much the better.

Categories: DBA Blogs

Good Stuff

Wed, 2009-09-16 17:14

Oracle Open World! I’ll be there, and so will lots of other cool people. Don’t miss the blogger meetup where we’ll all hang out :) Don’t miss the unconference. The line-up is better than what I see at most events – Greg Rahn, Cary Millsap, Kelvin Closson, Rob van Wijk, Alex Gorbatchev, Richard Foote and I will all be there.

Dr. Neil Gunther! One of the top most performance specialists. His blog is not easy to read, and is not strictly Oracle related, but I’m always glad I take the time to read it because I learn so much. Its also quite entertaining (for load testing nerds). For example: “Without knowing any details, I can see is that the test rig was driven into saturation, starting with the first concurrent request! Therefore, the first data points provide all the comparison information. The other measurements are redundant (log axis or no). So, what’s the point of the plot?”. Oh, and he also has good twittings!

Exadata v2! A DB server so fast the only way to describe it is ridiculous! There’s still not a lot of technical information out there about it, but the FAQ is a good start.

Advanced Oracle Troubleshooting Seminar at NoCoug Unbelievable, but two month before the event 50% of the sits are already taken. If you are interested, you should probably hurry up. Early bird registration ends in a week. Don’t say I didn’t warn you.

Shell tricks! Don’t know about you, but I still do my scripting with BASH. Jared Still posted some useful tricks.

Please post more cool stuff in the comments. Also suggestions for books I should read on my daily 3 hour train commute to OpenWorld will be nice.

Categories: DBA Blogs

Two Things Everyone Should Know About Queues

Wed, 2009-09-16 16:38

If you are in the performance business, you should know a lot about queues. How to use them to find performance problems, predict issues, plan your capacity, model your load test results, etc. Queues are just a part of what you should know and be comfortable discussing.

But what if you are not a performance professional? What if you are a sales person or a manager or a dentist? Do you still need to understand queues?

Obviously not everyone should know queues at a precise mathematical level. But queues are everywhere, and sometimes I wish people around me understood queues better. It’ll make it easier for me to explain things. There are two things I think everyone should know about queues:

  1. If it takes me one hour on average to handle a request, and I get one request every hour – most of the time requests will be delayed due to queueing and backlog. Running your DBAs (or servers, or doctors, or toll-booths) at full utilization with every minute accounted for means queueing and delays.
  2. If there are multiple servers (or DBAs or DMV clerks), the most efficient way to get service is to arrange all the requests in a single queue and have all servers accept requests from that queue. The way supermarkets do it – a different queue per cashier is inefficient. Deciding that you want all your requests to be handled by a specific DBA because she is better looking is also less efficient than entering the request in the general DBA queue.

Spread the word :)

Categories: DBA Blogs

I Can Has Training Budget

Fri, 2009-09-11 16:37

We know how it goes – there is a recession, and companies try to reduce expanses. The next thing you know, your training budget is all gone. Or maybe there is some training budget left, but now 6 DBAs share a sum that is not enough for one Oracle University course. How do you convince your managers that paying for your training is the best investment they can make?

Start by convincing yourself. Remember that your manager probably got to his position because he is good at reading people, so if you don’t really want the training, or don’t really believe you need this training, he may see that and you lost. You have to be 100% sure that you want this training because it will really allow you to improve the way you work.

As an example, lets assume you want to go to Linux Administration course. Its an interesting case, because it is not even evident that a DBA should go to such course.

Then think about your boss for a bit – what parts of the job are most important to him? what are his pet projects? pet peeves.

Once you have your desire for the course and your bosses desires in mind, make a list of all the benefits you can see from going to the course. The important thing is to highlight how the things you want to learn will help with the projects that are most important to your boss, or will address his specific pain points.

So, if your boss loves automation say: “I will learn more shell linux tools so I’ll be able to write better automation scripts”.
If he is a capacity planning person, say: “I will be able to better monitor the OS so we can be more proactive about provisioning”.
If he is a big fan of RAC, say: “With my improved Linux knowledge, I’ll be able to understand low-leve clusterware issues and solve them faster!”

Now you need to decide if you make your pitch face to face or by email. I prefer email. Information I put in the email:
* Course title and instructor (or school name)
* Dates/Times
* Location
* Price
* The list of 3-5 reasons I need this course (as you prepared in the previous paragraph).

Until he makes his decision, keep mentioned once or twice a day how the things you do now will be much better after you take the course: “I still don’t understand how to debug coredumps after the process crashes, but the Linux course may help”, “It takes me 2 hours to copy old files to the second disk, but I’ll probably learn how to do it faster in the Linux course”. Don’t force it, but keep an eye open for opportunities to explain and demonstrate the value you see in the course.

And a questionable tactics that sometimes works: Get an ultra-expensive course rejected before asking for a reasonably-priced course. “I can totally understand you don’t have the budget to send me to Collaborate in Denver, but what about one day training given by our local usergroup at a near-by location?”. I’m not sure if this tactic works because the manager feels guilty about rejecting my request, or if the lower-price seminar just looks better in comparison. I’m not even sure if I recommend it, really. Consider and act at your own risk ;)

Categories: DBA Blogs

OOW09 – Tradition Edition

Wed, 2009-09-09 16:35

This year will be my third time I’m attending Oracle Open World. When you do something every year for 3 years in a row, you develop few traditions around it.

Even though I know I always have an amazing time there, I’m always worried before. I remember the commute, and the fatigue and the boring marketing contents. Somehow the memories of great discussions in the OTN lounge with amazing people are less vivid. So being anxious before is definitely a tradition.

Some traditions do not continue – this year there seem to be no blogger meeting. I guess I’ll need to be a bit more proactive about meeting my online colleagues. Like, email everyone to check if they will attend OOW and ask if they want to date me :) You can also leave a comment here if you want to hang out together.

A tradition I hope not to continue is over-scheduling sessions. I looked for presenters I know, especially those I enjoyed in previous years. Some Streams and RAC 11gr2 sessions, to make sure I keep on top of my favorite technologies. I made it a habit to attend “Current Trends in Real World Performance” session – it is consistently the most enlightening session in OpenWorld. I’ll probably rewrite my schedule few times before the conference, and few times a day during the conference. Thats traditional too.

I’m excited to continue the tradition (started last year) of giving an Unconference session at OpenWorld. Last year was my first ever Oracle presentation – I gave a live demo of streams configuration and troubleshooting. It was wonderful. This year I feel like a veteran presenter – I gave 4 presentations at conferences in the last year. I am going to talk about graphical methods (under the sexy name – visualization). To be honest, I still don’t know what exactly I’ll talk about. I have lots of ideas – using charts to explore the data and solve problems, using charts to prove a point in reports and presentations, how not to lie or confuse when charting data. I plan for lots of examples. I’m looking forward to cooking all these ingredients into one delicious presentation.

I’m presenting on Monday, 4pm – looking forward to see you all there, because meeting amazing people is my favorite OOW tradition.

Categories: DBA Blogs

Real Life Block Corruption (Maybe)

Thu, 2009-09-03 19:28

What’s the worst thing that can happen to a database? I think most DBAs will agree that block corruption is a good candidate on the list. When DBAs debate the soundness of their backup policy, corrupted blocks are often used as test cases and rhetoric devices: “Keep just 3 days of backup? But what if a block is corrupted on Saturday and we don’t find out until Monday?”.

Until this week, I only knew about block corruptions from my certification studies and from recovery practices (using dd to corrupt blocks is a common gambit).

We had a block corruption this week. At least, we think we did – neither us, nor Oracle support are 100% certain. It was nothing like the text books described.

On Saturday, our DB crashed. The error in the alert log indicated a corrupted block. We restarted the DB, and…. did nothing. My manager sent me an email asking me to open a ticket to Oracle about this. I saw the email on Monday, failed to realize the importance of the problem (I suck!) and proceeded to work on other tasks.

On Tuesday the DB crashed again. This time it also sprouted endless Ora-600 [2662] error message once it started. We gave it another restart, this time it started fine. I did open the ticket to Oracle. Priority 1. We ran a bunch of verifications – RMAN validation, DBV, analyzing bunch of tables and indexes.

RMAN and DBV did not detect any issues. Full export completed successfully. No one is actually certain this is a block corruption. The only strangeness was an index that appeared in DBA_INDEXES but did not exist when we tried to run analyze. We asked our sys admins to check the machine, the OS and the connected storage.

On Wednesday the server crashed again. Again a corrupt block. Different file this time. Oracle supports found that one of the millions of ORA-600 and ORA-7445 errors we’ve seen could be related to a SQL parsing bug and suggested a patch.

We’ve had it. In an emergency 10 hour maintenance, we used export/import to move all the schemas to a new DB server.

We hope this is the end of the problem, but we can’t really tell. Which is exactly how real DBA life is so different from textbook descriptions and recovery practices.

Categories: DBA Blogs

Automatic Maintenance Tasks

Wed, 2009-08-26 19:48

Automatic Maintenance Tasks is a new 11g feature which I recently noticed. Its a bit embarrassing, since I’ve had 11g in production for nearly a year and apparently I’ve been using the feature all along.

I discovered the feature when I noticed that the automatic statistics gathering job is running several times on a weekend, instead of just once as it did in 10g. Then I discovered that the job has a very strange name starting with ORA$, and that the name changes every time the job runs.

Turns out that Oracle’s automatic jobs are not longer jobs. They are now Maintenance Tasks.

Here’s how Oracle defines the tasks:
“When a maintenance window opens, Oracle Database creates an Oracle Scheduler job for each maintenance task that is scheduled to run in that window. Each job is assigned a job name that is generated at runtime. All automated maintenance task job names begin with ORA$AT. For example, the job for the Automatic Segment Advisor might be called ORA$AT_SA_SPC_SY_26. When an automated maintenance task job finishes, it is deleted from the Oracle Scheduler job system. However, the job can still be found in the Scheduler job history.”

And here’s the reason my statistics ran several times on a weekend:
“In the case of a very long maintenance window, all automated maintenance tasks except Automatic SQL Tuning Advisor are restarted every four hours. This feature ensures that maintenance tasks are run regularly, regardless of window size”

What practically changed? Almost nothing, we had schedule windows in 10g, and the maintenance jobs (not tasks) ran within the defined windows. I’ve no clue why this change was necessary.

It definitely looks like infrastructure prepared for a future cool feature. At present, it just looks weird. For instance:

  • You can’t add tasks. Oracle has 3 predefined tasks – statistics, space advisor and tuning advisor. You can add or remove maintenance windows and define in which window to run each task, but you can’t add your own task.
  • There are lots of seemingly unnecessary definitions around. For example, from the dictionary tables, you can see there are task clients and task jobs. Currently it looks like they are the same thing, since there is a one-to-one relation between them, but it probably won’t stay this way.
  • The documentation doesn’t document much. There are fields such as client attributes with values that are not really explained anywhere.
  • The API is really weak. As I said, you can’t do much beyond enable/disable tasks in specific schedules

So far, it looks like this feature adds confusion but no value. I hope Oracle will do something fun with it in the future.

Categories: DBA Blogs

I Love NoCoug Training Days

Mon, 2009-08-24 20:08

Sometimes, life kind of loops on itself. A circle closes. You find yourself at the same spot you were two years ago, but from a completely different viewpoint.

Flashback two years and few month back: I’ve recently relocated to the US. I’m somewhat of an Oracle newbie, but I know that there are all kinds of cool DBA stuff going on, and I desperately want to be part of it. My colleagues tell me that HotSoS seminars are the best, so I asked my boss to send me to one. The request was left hanging in the air for weeks, just to be rejected. Because it was far away and I had to fly there and it was just too expensive. I was in tears.

And then I got this email from the local user group – NoCoug. They said they are doing a training day with Kyle Hailey. Kyle Hailey of Oak-Table fame was my hero at the time. I just finished reading Oracle Insights, and I was deeply impressed by his story of the program that could connect to the SGA directly through shared memory. I was all “Wow! Kyle Hailey! Only an hour drive away! And it costs just 250$! My boss will have to approve it! Hell, I’ll even pay for it myself!”.

And my boss did approve it. I went to the training day, and it was amazing. I learned more at that day than at the week long classes I took when I learned to be a DBA. What I learned then is still useful to me, almost every day on the job.

Obviously, I was deeply thankful to NoCoug for making it possible for me to attend this amazing event for a price my boss agreed to pay.

Back to present day: I am the training day coordinator for NoCoug. I want to create the same experience for every other DBA in our region. Top-notch training event at a price that won’t make your boss blink.

Everyone who knows me will laugh at the idea of me coordinating a training day. I usually can’t coordinate my own breakfast. But this is so important to me – every DBA should be able to be better at his job by learning from the best experts.

I also knew just who should lead my first training event – Tanel Poder is one of the best experts I know (Probably at the top 3 of my personal ranking), his blog and scripts and systematic troubleshooting ideas completely changed the way DBAs work. In a very good way.

And he agreed to give his famous “Advanced Oracle Troubleshooting” at North California, and we agreed on dates and prices, and I found a location. The impossible happened and I almost coordinated a training day.

Now I just need people to register so the event can really happen. I desperately want everyone to know about this event. I know it can improve the way people work so much its really a shame if someone will miss the opportunity. So even though I’m just a simple DBA and not a marketing expert, I’m going to do my best and annoy the hell out of everyone just to make sure that every single DBA in North California will know about this event. I even put a small funny looking ad in my blog.

If you can help me here by spreading the word to your NorCal friends – I’ll really appreciate the help.

P.S:
Advice on how to do non-annoying marketing for the event will also be appreciated. I know some of you have been promoting your own events for years. Please share your experience!

Categories: DBA Blogs

Preparing to Clone!

Mon, 2009-08-17 16:54

We are moving to this new data center. The new servers are in the new data centers, just waiting for Oracle to be installed on them. We have about 50 new servers to install.

Obviously we want to install them as quickly as possible, and taking as little human-work time as possible. To achieve this, we are checking to options – scripted silent installation and installation cloning.

I’m checking the installation cloning part, and I’m using Oracle’s Universal Installer User Guide for the process. One of the first steps in the process is:

At the source, you run a script called prepare_clone.pl.
This is a Perl script that prepares the source for cloning by recording the information that is required for cloning. This is generally located in the following location: $ORACLE_HOME/clone/bin/prepare_clone.pl.

When I tried to run the script, I found out that oracle user did not have execute permissions on the file. Then I found out that the script had the location of PERL hardcoded to the wrong location. Finally, I found out that the file had the usual amount of comments for an Oracle script, but only one line of code:
exit 0;

I found this incredibly amusing, so I decided to blog on this. While blogging, I took a closer look at the documentation, and found the following comment:

The need to perform the preparation phase depends on the Oracle product that you are installing. This script needs to be executed only for the Application Server Cloning. Database and CRS Oracle home Cloning does not need this.

I guess the joke is on me. Serves me right for not reading the instructions carefully.

Categories: DBA Blogs