Cary Millsap
How Slow Programs Are Like Christmas: a Sales Pitch
My company, Method R Corporation, makes systems faster. And we make people who make systems faster, faster. We train people to become performance optimization heroes.
Here is my story about why you should be interested in our training.
%%
Slow programs remind me of Christmas when I was a kid. In early December, my parents would put a present for me under our Christmas tree. It would be wrapped up so I couldn't see what was inside it. The unwrapping would not happen until Christmas morning, December 25. (That was the best-case scenario. If my Dad couldn't be home because of work, then Christmas morning would come a day or two late.)
So, every day, for nearly a month, I would see that present under the tree and wonder what was in it.
I'd take any clue I could find. What shape is it? What does it weigh? What does it sound like when I shake it? (Sometimes, my Mom and Dad would prohibit me from shaking it.) No matter how desperate the curiosity, all I could do was guess.
When Christmas came, I'd finally get to tear the paper off, and I would now see, plain as day, what had been in that box the whole time. All the clues and possibilities would collapse into a single reality. Finally, there was no more mystery, no more guessing.
Slow programs are like that. The clues aren't enough. You guess a lot. But with slow programs, there's no specially designated morning when your programs reveal their mysteries to you. They just keep irritating you, with no end in sight. You need somebody who knows how to tear the wrapping paper off those programs so you can see what they're doing wrong.
That's the somebody I like being. The role scares a lot of people, but it doesn't scare me. That's because I trust my preparation. I know that I have three particular assets that tilt the game in my favor.
Those assets are knowledge, tools, and community. With the knowledge and tools I have, I don't get stumped very often. But when I do, I have a network of friends who'll help me out. My friends and I can solve just about anything. These three assets are huge for both my effectiveness and my confidence.
These three assets aren't just for me. They're for you, too.
That's the aim of my online course called "Mastering Oracle Trace Data." In this course, I've bundled everything you need to claim those three assets for your own:
- You'll learn the details about Oracle traces and the stories they're trying to tell you. This is your knowledge asset.
- You'll have, for the duration of your choosing, full-feature access to Method R Workbench, the most comprehensive software in the world for mining, managing, and manipulating Oracle traces. This is your tools asset.
- You'll have access to our Slack channel, a global community of Oracle trace enthusiasts that can help you whenever you get stumped. You won't be alone. You'll have people who are there for you. This is your community asset.
If you're interested in becoming a more effective and confident optimizer, you can get started now. Just visit our course page for details.
%%
That's my story. I hope you'll contact me at method-r.com if you're interested.
And if you like stories like this, you'll find a lot more in my How To Make Things Faster book, available wherever books are sold.
A Design Decision
This week, my team at Method R devoted some time to an enhancement request that required an interesting design decision. This post is about the analysis behind that decision.
The enhancement request was for our flagship product called Method R Workbench. It's an application that people use to mine, manage, and manipulate Oracle trace files.
One of its features, called mrskew
, is a tool that allows a Workbench user to filter, group, and sort the raw dbcall and syscall data that Oracle Database processes write to their trace files. You can use mrskew
from within the Workbench application, or from a *nix command line.
Here's an example of a mrskew
command. It's what you would use to find out how long your program spent reading Oracle blocks from secondary storage. It will show you which blocks it read, how long they took, and how many times each block was read:
mrskew --name='db.*read' \ --group='sprintf("%8d %8d", $p1, $p2)' \ x_ora_1492.trc
Here's the output:
sprintf("%8d %8d", $p1, $p2) DURATION % CALLS MEAN ... ---------------------------- -------- ------ ----- -------- 2 2 0.072918 1.0% 26 0.002805 ... 33 698186 0.051940 0.7% 1 0.051940 ... 50 339841 0.049261 0.7% 1 0.049261 ... ...
The important thing in this report is the meaning of the $p1
and $p2
variables. The combination of these two variables happens to represent the data block address (the file number and block number) of an Oracle block that was read by some kind of an Oracle read call. It would be nice for the report to tell you that instead of just telling you that the first two columns of numbers are the output of an sprintf
function call.
We have a command-line option for that. The ‑‑group-label
option lets you assign your own title for the group column. So, with some careful character counting, you could use…
‑‑group-label=' FILE BLOCK'
…to get exactly the heading you want:
FILE BLOCK DURATION % CALLS MEAN ... ----------------- -------- ------ ----- -------- 2 2 0.072918 1.0% 26 0.002805 ... 33 698186 0.051940 0.7% 1 0.051940 ... 50 339841 0.049261 0.7% 1 0.049261 ... ...
That makes sense. Now it's easy to see that Oracle has read one block (file #2, block #2) 26 times, consuming a total of 0.072918 seconds reading it.
The group label fits the output, only because of the careful character counting. The enhancement request was to allow the ‑‑group-label
option to take an expression, not just a string. Like this:
--group-label='sprintf("%8s %8s", "FILE", "BLOCK")'
That way, he could print out the header he wanted, perfectly aligned, by just syncing his ‑‑group‑label
expression to his ‑‑group
expression, without having to count space characters that are literally invisible.
It's a smart idea. The group label option should have been designed that way from the beginning. We eagerly approved the enhancement request and began thinking about the design.
When we thought it through, we ended up with two different ideas about how we could implement this new idea:
- Redefine
‑‑group‑label
to take an expression instead of a string.mrskew
will calculate the value of the expression before printing the column label. - Create a new option, say,
‑‑new‑group‑label
, that takes an expression as its argument. And leave‑‑group‑label
as it is.
The first idea is how the enhancement request was worded. The second idea entered our minds because the first idea creates a compatibility problem: if we change the spec of the existing ‑‑group‑label
option, it will break some existing mrskew
scripts. For example, these will work in Workbench 9.2.5:
--group-label=FILE --group-label="FILE BLOCK"
But if we redefine ‑‑group‑label
to take an expression instead of a string, then these won't work anymore. People will need to quote their string expressions like this:
--group-label='"FILE"' --group-label='"FILE BLOCK"'
In the end, we decided to redefine the existing option and live with the compatibility breach.
The way we make decisions like this is that we create strenuous arguments for each idea. Here are some of the arguments we considered en route to our decision.
First, the customer experience (cognitive expenditure).
Everyone who participated in the debate had the customer experience foremost in mind. But how can we objectively measure "customer experience"? How do you structure a scientific debate about the superiority of one experience over another?
One way to do it is to measure cognitive expenditure—the amount of mental effort that a user has to invest to get the desired outcome from our software. We want to minimize cognitive expenditure, to maximize a customer's return on investment of effort.
We began by realizing that responding to this enhancement request with one of our two ideas would necessarily force the user into one of two new regimes:
- The syntax of
‑‑group-label
has changed. - There's a new
‑‑new-group-label
option.
In regime 1, our users would have to learn the new syntax. That's a cognitive expenditure. But it's a one-time expenditure, which is good. The new syntax would be consistent with the existing ‑‑group
syntax, which is actually a cognitive savings for our users over what we have now. However, if a customer had saved any scripts that used the old syntax, then the customer would have to convert those scripts. That's a cognitive expenditure in a loop (one for each script), which is bad.
In regime 2, our users would have to learn about ‑‑new-group‑label
, which is a cognitive expenditure. They'd still have to remember (or relearn) about ‑‑group‑label
, too, which is a similar cognitive expenditure as the one in regime 1. They wouldn't have to modify any old scripts, but they would have to make the choice of whether to use ‑‑group‑label
or ‑‑new-group‑label
, every time they wrote a script in the future. That's another cognitive expenditure in a loop (one for each script), which is bad.
Second, the developer experience (technical debt).
We also need to consider the developer's experience. We don't want to create code that increases technical debt that makes the product unnecessarily difficult to support.
If we redefine ‑‑group-label
, there's no long-term effect to worry about. But if we add ‑‑new‑group‑label
to the story, I would expect for people to wonder, why are there two such similar group label options, when one (the one that takes an expression) is clearly superior? And why does the inferior one have the better name?
At some point in the future, I envision wanting to clean up the cruft and have just the one group label feature. Naturally, the right name for it would be ‑‑group‑label
. But of course, changing the spec that way would introduce a compatibility problem. To make things worse, this would occur in the future when—one would hope, if our business is growing—such a decision would impact even more customers than it would today. So then, why create the cruft in the first place? It'll be a worse problem later than it is now.
The question that really seals the deal, is who will the change really affect? It's really a probability question about customer experiences.
Most users who use the Workbench application will never experience our group label option directly. It's there for everybody to use, but our Workbench has so many predefined reports built into it, most users never need to touch the group label option for themselves. When they do need to modify it, they're usually tweaking a report that we've predefined for them, which is a low–cognitive-expenditure experience.
In the end, the Method R bears almost the entire cost of the ‑‑group‑label
redefinition. It required us to revise:
- The code for
mrskew ‑‑group‑label
- The prepackaged actions in Method R Workbench
- The
mrskew
.rc files - The
mrskew
manual page - The Mastering Oracle Trace Data book
- The Mastering Oracle Trace Data course
Most users will experience the benefit of the ‑‑group‑label
change, without ever knowing that, once upon a time, it changed. And that's the way we want it. We want the product to be as smart as possible so that our customers get the most out of their investment, both cognitive and financial.
Catchphrase
It finally happened.
Several years ago, I asked my friend Mauro Pagano to help me spread a new catchphrase. I had made up this little saying. It was weird enough that I knew, if I heard it in the wild somewhere, then it had to have come from me. It would be a fun little experiment.
The only problem with this new catchphrase was the "catch" part. It was a perfectly good "phrase," but we were having a hard time coming up with situations where we could use it. We agreed we'd keep on the lookout.
Skip forward several years.
Today, my colleague Jeff Holt and I had lunch at Weinberger's Deli, as we do at least every week or two. We got there just a few minutes after they opened, but people were already queued out the door. That's ok, the line would work its way down shortly.
As we advanced nearer the counter, I found myself increasingly distracted by The Table Situation. No matter how full Weinberger's is, we're almost always able to find a table. But today, since everyone ahead of us had just sat down, it was easy to imagine that no tables would be opening up anytime soon.
This wasn't just idle worry. It was required prep for the dreaded question, "For here? Or to go?" It was already 96ºF outside, on its way to 106ºF, so we'd prefer not to sit out there. But also we would even more prefer not to take our lunch in bags back to the office. We took a risk, mitigated by the two open tables outside. We answered, "For here."
While I was paying for my order, a guy finished his lunch, leaving an open two-top. Victory. We would eat inside, comfortably, at a table.
I shouldn't be surprised. I almost always get a table. The whole thing—worry, worry, worry, get a table—it happens nearly every time. But somehow, I'm almost always surprised.
But today, in a burst of excitement, a little program that's been running in the background of my mind for years popped its confetti cannon: THIS is the perfect situation for my catchphrase!
So I used it. Here it is:
It's like mouse balls. You think it can't possibly work, but it almost always does.
A Side Trip to Wonderhell
I should be jubilant. But I'm not.
I'm habitually non-jubilant. I don't know why.
This weekend, I returned from the kind of a trip that I don't take very often: an all-by-msyself vacation. This year, I attended the IPMS/USA National Convention 2023 in sunny San Marcos, Texas. IPMS is the International Plastic Modelers' Society. The convention is three things: a contest, a meetup, and a vendor showcase. This is the first time I've ever attended.
The contest is a big deal. Some people spend years preparing for it. And I won two gold medals for a model (a "replica") that took me seven years to make. It was awarded 1st place in its category and then it was named the best of its class: one of only eight such awards among a field of over 3,000 entries.
I should be jubilant. But I'm not.
There are modelers out there who have worked hard over the past who-knows-how-long, suffering now that their work didn't result in the level of appreciation they had hoped for. They probably have a right to be offended that I'm not jubilant.
One of the reasons I'm not jubilant is because I promised myself before the show that I wouldn't get too emotionally wrapped up in how my entry was judged. There are just too many things that could conspire that could deny—rightfully or not—my little model from winning a prize. So, before the show, I preloaded my mind with the following sentence:
When you put your ego in other people's hands, you're asking for trouble.
My wife and I have similar conversations with our kids. As she says, sometimes we need to ride the monorail, not the roller coaster. For example, when you believe you're awesome because a local sportswriter says that you are, then it's also tempting to believe that you're trash when that same writer doesn't include you in his all-area team. If you get too high or too low because of how someone else defines you, you're almost guaranteed to crash eventually.
That sentence of mine sustained me for the first three days of the show. The 3,000+ models on display in the contest room were absolutely overwhelming. There's no way I could walk in that room and feel like a legitimate competitor, no matter how hard I knew I had worked on my model. It would have been ludicrous to think I was going to win something. Even when I did, it felt like it couldn't be real.
Now that I'm home, I'm in this weird predicament of having summited an audacious, multi-year goal, certainly a good feeling, but simultaneously there's this looming feeling of "now what?" Sure, I won this huge award three days ago, but what have I done lately?!
It makes me feel like there's something wrong with me. But I know I'm not the only one. What I'm feeling happens so often, it even has a name: Laura Gassner Otting calls it Wonderhell.
At the end of all this, the feeling I'm trying to savor and cultivate is gratitude. Thankful for the awards, of course, but also thankful for what I really was hoping for when I signed up for the show. What really drove me while I was building my model was the hope that people might look at my model, enjoy it, and want to talk about how I built it.
That conversation happens every month in my club (my model couldn't have been what it was without my club). And it happened enough times in San Marcos to make it fully worth submitting myself to the harrowing competition part of the convention.
Maybe I'll do it again someday, I'm not sure. But for now, my monthly club meetings will be enough. Right now, I need to focus on improving my business so much that someday soon I can blog about being oddly not jubilant about that.
A Better Way to Think about %
A lot of people get confused by the "%" symbol. I can understand why. Even https://en.wikipedia.org/wiki/Percentage seems way more confusing than it should be.
Well, maybe I can help.
Here's a simplifying little idea that I learned by reading ISO 80000-1:
The percent symbol (%) is just a constant, just like π or e. Its value is 0.01 (or 1/100, if you prefer).
Let me show it to you in a table. Maybe that'll clear it up:
SymbolValue π≈ 3.14159 e≈ 2.71828 %= 0.01What this means is that anywhere you see the "%" symbol, you're free to substitute the value 0.01 if you want:
So, how does that help? Well, it gives you a simple rule you can apply instead of having to intuit how to convert something to or from a percentage.
For example, I used to find myself wondering, "If I want to convert this percentage to a real number, do I multiply by 100? Or divide?" I hate memorizing crap like that.
But knowing that % = 0.01 makes it easy. For example, converting 42% to a number without the % sign, I simply substitute, like this:
42% = 42(0.01) = 0.42
When you know that % = 0.01, it's easy to see that 100% is just another way of expressing the number 1:
100% = 100(0.01) = 1
Converting a number to a percentage is easy, too.
I can of course multiply anything I want by 100% and still have the same quantity I started with. Here's how to convert 0.0005 to a percentage:
0.0005 = 0.0005 × 1
= 0.0005 × 100%
= (0.0005 × 100)%
= 0.05%
Yep, ISO 80000-1... I don't do everything it says, but this percentage thing was a nice revelation.
Version 1 Is Never the Answer
I create for a living. I make presentations and software and books... Even my hobbies are creation hobbies. I make tools and furnishings and, these days, “museum quality aviation replicas.”
Many of the things I’ve made have worked out really well, but one result I can count on: my version 1 of pretty much anything is rarely going to be a keeper. My good work is almost always a v2, or v3, or beyond. When I succeed, it’s usually more a persistence thing than a genius thing.
For the past couple of weeks, I’ve been working on a 20-minute session I’ll present at P99Conf. Creating a presentation, for me, is a mostly private experience. I simulate in my mind what an audience might like to see. When either I think I’m finished, or I feel stuck, the next step in my process is to find an audience and go through the material. In other words, I use the material in front of other people.
Today, because I felt stuck, I showed my P99Conf material to the audience on my Tuesday Zoom call.
Here’s what I learned. The slides I had made were awful. Not nearly relevant enough to match my audience. I was already off the rails by the fourth slide. In the discussion though, the discussion helped to put me on the right track. The group was very helpful, identifying a couple of slides that were actually pretty good, and suggesting fixes for the worst parts. I went into the call, stuck. I came out inspired, ready to get back to work.
The thing that I suspect might surprise you, that did not surprise me, is how this process felt. I did not feel insulted or denigrated or “vulnerable.” The feelings I did have, because of the call, were these:
- Gratitude – I was appreciative that the group would lend me their time
- Motivation – I was inspired now, and interested; instead of stuck
- Shame – I was a tiny bit embarrassed that my v1 hadn’t been better
- Confidence – I knew my work was going to be better because of the review
I did not feel surprised, because I don’t expect any v1 to be a production release. Most of my v1’s I expect to throw away. A v1 is simply part of the process; you can’t have v2 without v1. More precisely, and this is the critical realization in the whole process:
You can’t have v2 without feedback, and you can’t have feedback without v1.
My job, then, is not to try to create a v1 that is production worthy. It is to create a v1 that is feedback-worthy. And then go get the feedback. That feedback loop is essential. It’s not a step that I aspire to ever “optimize away.” It is a vital part of the process that I embrace.
If you’re interested in this story, then you might be interested in a paper I wrote in 2011, called “My Case for Agile.”
Why the Oak Table Was So Great
This weekend, I watched a wonderful TEDx video by Barbara Sher, called “Isolation Is the Dream-Killer, Not Your Attitude.” Please watch this video. It’s 21 minutes, 18 seconds long.
It reminded me about what was so great about the Oak Table. That’s right: was. It’s not anymore.
Here’s what it was. People I admired, trusted, and liked would gather at a home in Denmark owned by a man named Mogens Nørgaard. Mogens is the kindest and most generous host I have ever encountered. He would give his whole home—every inch—to keep as many of us as he could, for a week, once or twice a year. Twenty, maybe thirty of us. We ate, drank, and slept, all for free, as much and for as long as we wanted.
And the “us” in that sentence was no normal, regular, everyday “us.” It was Tom Kyte, Lex de Haan, Anjo Kolk, Jonathan Lewis, Graham Wood, Tanel Põder, Toon Koppelaars, Chris Antognini, Steve Adams, Stephan Haisley, James Morle, John Beresniewicz, Jože Senegačnik, Bryn Llewellyn, Tuomas Pystynen, Andy Zitelli, Johannes Djernæs, Michael Möller, Dan Norris, Carel Jan-Engel, Pete Sharman, Tim Gorman, Kellyn Pot'Vin, Alex Gorbachev, Frits Hoogland, Karen Morton, Robyn Sands, Greg Rahn, and—my goodness—I’m leaving out even more people than I’m listing.We spent a huge amount of our time sitting together at Mogens’s big oak table, which was big enough for about eight people. Or, in actuality, about twice that. We’d just work. And talk. If there wasn’t a meal on the table, then it would be filled with laptops and power cords covering every square inch. Oops, I mean millimeter. That table had millimeters.
And here’s what was so great about the Oak Table: you could say what you wanted—whatever it was!—and you could have it. You could just say your dream and your obstacle, and someone around the table would know how to make your dream come true.
It’s tricky even trying to remember good examples of people’s dreams, because I’m so far removed from it now. Some of them were nerdy things like, “I wonder how long an Oracle PARSE call would take if we did a 256-table join?” You’d hear, “Hmm, interesting. I think I have a test for that,” and then the next thing you know, Jonathan Lewis would be working on your problem. Or, “Hey, does anyone know how to do such-and-such in vim?” And Johannes Djernæs or Michael Möller would show you how easy it was.
I got into a career-saving conversation late one night with Robyn Sands. She had asked, “Is anybody else having trouble finding good PL/SQL developers? I can’t figure out where they are, if there even are any. Are there?” We talked for a while about why they were so scarce, and then I connected the dots that, hey, I have two superb PL/SQL developers at home on the bench, and I had been desperately trying to find them good work. The story that Robyn and I started some 3:00am over beers resulted in a superb consumer femtocell device for Robyn and a year’s worth of much-needed revenue for my tiny little team.
It was a world where you could have anything you want. Better yet, it was a world where you could dream properly. Today, in isolation, it’s hard to even dream right. After nearly two years of being locked away, I can barely conceive of a world that’s plentiful and joyous like those Oak Table years. I feel much smaller now. (Oh, and it wasn’t COVID-19 that killed that Oak Table experience. It died years before that—but, obviously, it’s a factor today.)
I want it back. I want my friends back. How are we going to do this?
The New Book Is Ready: “Faster: How to Optimize a System”
My new book called Faster: How to Optimize a System is ready! You can buy the PDF now at method-r.com. The hardcover format should be available at Amazon sometime around mid-January.
The physical book will be about 250 pages long, in a normal 6 × 9-inch business book format. There are 109 chapters, each of which conveys either a story or a lesson. Most of the lessons are only a couple pages long. A few of the stories are longer than that, but the story chapters are little adventures that read quickly.
So far, the people I’ve shown it to seem to like it. I can’t wait for you to see it.
- Faster PDF
- Faster hardcover (coming January 2022)
How I Spent My COVID Vacation
I haven’t blogged in, what, ...forever. A couple of years.
I have been writing, though. Just not here. A lot, actually. Here’s some of the stuff I’ve done during my lapse here:
- 2020-01-28 – “Solving the unsolvable performance problem” (2 pages)
- 2020-02-14 – “Method R Workbench: the pesky, intermittent performance problem” (video 4:57)
- 2020-03-11 – “Preventing the post-production performance problem” (2 pages)
- 2020-04-23 – “Better testing, better risk reduction” (2 pages)
- 2020-05-07 – “Some things you probably didn’t know about tracing (Dallas edition)” (video 1:03:27)
- 2020-06-10 – “Death to the health check… Long live the health check” (2 pages)
- 2020-07-01 – Method R Workbench 9.0, a huge new release of my company’s flagship software system. Since 2001, this software has grown from a tkprof replacement to a system for mining and managing 10,000s of trace files at once.
- 2020-07-01 – “Method R Workbench 9: a whole new way to see Oracle performance” (video 5:36)
- 2020-08-26 – “Some things you probably didn’t know about tracing (Chicago edition)” (video 1:07:34)
- 2021-01-19 – CMG IMPACT 2021 “Three tricky performance problems solved with Oracle trace data” (ad video 6:12)
- 2021-08-09 – Method R Trace 21.2, a re-imagining of our trace file collector extension for Oracle SQL Developer.
- 2021-08-11 – “Method R Workbench video tips” (8 videos, each 2:21 or less)
Sometimes the Simplest Things Are the Most Important
I can remember telling her during dinner, wow, just look at us. Sitting in an aluminum tube going 500 miles per hour, 40,000 feet off the ground. It’s 50º below zero out there. Thousands of gallons of kerosene are burning in huge cans bolted to our wings, making it all go. Yet here we sit in complete comfort, enjoying a glass of wine and a steak dinner. And just three feet away from us in that lavatory right there, a grown man is evacuating his bowels.
I said, you know, out of all the inventions that have brought us to where we are here today, the very most important one is probably that wall.
Words I Don’t Use, Part 5: “Wait”
The Oracle Wait InterfaceIn 1991, Oracle Corporation released some of the most important software instrumentation of all time: the wait statistics that were implemented in Oracle 7.0. Here’s part of the story, in Juan Loaiza’s words, as told in Nørgaard et. al (2004), Oracle Insights: Tales of the Oak Table.
This stuff was developed because we were running a benchmark that we could not get to perform. We had spent several weeks trying to figure out what was happening with no success. The symptoms were clear—the system was mostly idle—we just couldn’t figure out why.
We looked at the statistics and ratios and kept coming up with theories, the trouble was that none of them were right. So we wasted weeks tuning and fixing things that were not the problem. Finally we ran out of ideas and were forced to go back and instrument the code to figure out what the problem was.
Once the waits were instrumented the problem was diagnosed in minutes. We were having “free buffer” waits because the DBWR was not writing blocks fast enough. It’s amazing how hard that was to figure out with statistics, and how easy it was to figure out once the waits were instrumented.
...In retrospect a lot of the names could be greatly improved. The wait interface was added after the freeze date as a “stealth” project so it did not get as well thought through as it should have. Like I said, we were just trying to solve a problem in the course of a benchmark. The trouble is that so many people use this stuff now that if you change the names it will break all sorts of thing tools, so we have to leave them alone.Before Juan’s team added this code, the Oracle kernel would show you only how much time its user calls (like parse, exec, and fetch) were taking. The new instrumentation, which included a set of new fixed views like v$session_wait and new
WAIT
lines in our trace files, showed how much time Oracle’s system calls (like reads, writes, and semops) were taking.The Working-Waiting ModelThe wait interface begat a whole new mental model about Oracle performance, based on the principle of working versus waiting:
Response Time = Service Time + Wait TimeIn this formula, Oracle defines service time as the duration of the CPU used by your Oracle session (the duration Oracle spent working), and wait time as the sum of the durations of your Oracle wait events (the duration that Oracle spent waiting). Of course, response time in this formula means the duration spent inside the Oracle Database kernel.
Why I Don’t Say Wait, Part 1There are two reasons I don’t use the word wait. The first is simply that it’s ambiguous.
The Oracle formula is okay for talking about database time, but the scope of my attention is almost never just Oracle’s response time—I’m interested in the business’s response time. And when you think about the whole stack (which, of course you do; see holistic), there are events we could call wait events all the way up and down:
- The customer waits for an answer from a user.
- The user waits for a screen from the browser.
- The browser waits for an HTML page from the application server.
- The application server waits for a database call from the Oracle kernel.
- The Oracle kernel waits for a system call from the operating system.
- The operating system’s I/O request waits to clear the device’s queue before receiving service.
- ...
Why I Don’t Say Wait, Part 2There is a deeper problem with wait than just ambiguity, though. The word wait invites a mental model that actually obscures your thinking about performance.
Here’s the problem: waiting sounds like something you’d want to avoid, and working sounds like something you’d want more of. Your program is waiting?! Unacceptable. You want it to be working. The connotations of the words working and waiting are unavoidable. It sounds like, if a program is waiting a lot, then you need to fix it; but if it’s working a lot, then it is probably okay. Right?
Actually, no.
The connotations “work is virtuous” and “waits are abhorrent” are false connotations in Oracle. One is not inherently better or worse than the other. Working and waiting are not accurate value judgments about Oracle software. On the contrary, they’re not even meaningful; they’re just arbitrary labels. We could just as well have been taught to say that an Oracle program is “working on disk I/O” and “waiting to finish its CPU instructions.”
The terms working and waiting really just refer to different subroutine call types:
“Oracle is working”means“your Oracle kernel process is executing a user call”“Oracle is waiting”means“your Oracle kernel process is executing a system call”
The working-waiting model implies a distinction that does not exist, because these two call types have equal footing. One is no worse than the other, except by virtue of how much time it consumes. It doesn’t matter whether a program is working or waiting; it only matters how long it takes.
Working-Waiting Is a Flawed AnalogyThe working-waiting paradigm is a flawed analogy. I’ll illustrate. Imagine two programs that consume 100 seconds apiece when you run them:
Program AProgram BDurationCall typeDurationCall type 98system calls (waiting)98user calls (working)2user calls (working)2system calls (waiting) 100Total100Total
To improve program A, you should seek to eliminate unnecessary system calls, because that’s where most of A’s time has gone. To improve B, you should seek to eliminate unnecessary user calls, because that’s where most of B’s time has gone. That’s it. Your diagnostic priority shouldn’t be based on your calls’ names; it should be based solely on your calls’ contributions to total duration. Specifically, conclusions like, “Program B is okay because it doesn’t spend much time waiting,” are false.
A Better ModelI find that discarding the working-waiting model helps people optimize better. Here’s how you can do it. First, understand the substitute phrasing: working means executing a user call; and waiting means executing a system call. Second, understand that the excellent ideas people use to optimize other software are excellent ideas for optimizing Oracle, too:
Oracle’s wait interface is vital because it helps us measure an Oracle program’s complete execution duration—not just Oracle’s user calls, but its system calls as well. But I avoid saying wait to help people steer clear of the incorrect bias introduced by the working-waiting analogy.
Words I Don’t Use, Part 4: “Expert”
When I was a young boy, my dad would sometimes drive me to school. It was 17 miles of country roads and two-lane highways, so it gave us time to talk.
At least once a year, and always on the first day of school, he would tell me, “Son, there are two answers to every test question. There’s the correct answer, and there’s the answer that the teacher expects. ...They’re not always the same.”
He would continue, “And I expect you to know them both.”
He wanted me to make perfect grades, but he expected me to understand my responsibility to know the difference between authority and truth. My dad thus taught me from a young age to be skeptical of experts.
The word expert always warns me of a potentially dangerous type of thinking. The word is used to confer authority upon the person it describes. But it’s ideas that are right or wrong; not people. You should evaluate an idea on its own merit, not on the merits of the person who conveys it. For every expert, there is an equal and opposite expert; but for every fact, there is not necessarily an equal and opposite fact.
A big problem with expert is corruption—when self-congratulators hijack the label to confer authority upon themselves. But of course, misusing the word erodes the word. After too much abuse within a community, expert makes sense only with finger quotes. It becomes a word that critical thinkers use only ironically, to describe people they want to avoid.
Words I Don’t Use, Part 3: “Best Practice”
The “best practice” serves a vital need in any industry. It is the answer to, “Please don’t make me learn about this; just tell me what to do.” The “best practice” is a fine idea in spirit, but here’s the thing: many practices labeled “best” don’t deserve the adjective. They’re often containers for bad advice.
The most common problem with “best practices” is that they’re not parameterized like they should be. A good practice usually depends on something: if this is true, then do that; otherwise, do this other thing. But most “best practices” don’t come with conditions of execution—they often contain no if statements at all. They come disguised as recipes that can save you time, but they often encourage you to skip past thinking about things that you really ought to be thinking about.
Most of my objections to “best practices” go away when the practices being prescribed are actually good. But the ones I see are often not, like the old SQL “avoid full-table scans” advice. Enforcing practices like this yields applications that don’t run as well as they should and developers that don’t learn the things they should. Practices like “Measure the efficiency of your SQL at every phase of the software life cycle,” are actually “best”-worthy, but alas, they’re less popular because they sound like real work.
Words I Don’t Use, Part 2: “Holistic”
When people use the word “holistic” in my industry (Oracle), it means that they’re paying attention to not just an individual subcomponent of a system, but to a whole system, including (I hope) even the people it serves.
But trying to differentiate technology services by saying “we take a holistic view of your system” is about like differentiating myself by saying I’ll wear clothes to work. Saying “holistic” would make it look like I’ve only just recently become aware that optimizing a system’s individual subsystems is not a reliable way to optimize the system itself. This should not be a distinctive revelation.
Words I Don’t Use, Part 1: “Methodology”
The first word I’ll discuss is methodology. Yes. I made a shirt about it.
Approximately 100% of the time in the [mostly non-scientific] Oracle world that I live in, when people say “methodology,” they’re using it in the form that American Heritage describes as a pretentious substitute for “method.” But methodology is not the same as method. Methods are processes. Sequences of steps. Methodology is the scientific study of methods.
I like this article called “Method versus Methodology” by Peter Klein, which cites the same American Heritage Dictionary passage that I quoted on page 358 of Optimizing Oracle Performance.
Messed-Up App of the Day: Tables of Numbers
Database Total Size Total StorageIt’s harder than it looks.
-------------------- --------------- ---------------
SAD99PS 635.53 GB 1.24 TB
ANGLL 9.15 TB 18.3 TB
FRI_W1 2.14 TB 4.29 TB
DEMO 6.62 TB 13.24 TB
H111D16 7.81 TB 15.63 TB
HAANT 1.1 TB 2.2 TB
FSU 7.41 TB 14.81 TB
BYNANK 2.69 TB 5.38 TB
HDMI7 237.68 GB 476.12 GB
SXXZPP 598.49 GB 1.17 TB
TPAA 1.71 TB 3.43 TB
MAISTERS 823.96 GB 1.61 TB
p17gv_data01.dbf 800.0 GB 1.56 TB
Did you come up with ANGLL? If you didn’t, then you should look again. If you did, then what steps did you have to execute to find the answer?
I’m guessing you did something like I did:
- Skim the entire list. Notice that HDMI7 has a really big value in the third column.
- Read the column headings. Parse the difference in meaning between “size” and “storage.” Realize that the “storage” column is where the answer to a question about space consumption will lie.
- Skim the “Total Storage” column again and notice that the wide “476.12” number I found previously has a GB label beside it, while all the other labels are TB.
- Skim the table again to make sure there’s no PB in there.
- Do a little arithmetic in my head to realize that a TB is 1000× bigger than a GB, so 476.12 is probably not the biggest number after all, in spite of how big it looked.
- Re-skim the “Total Storage” column looking for big TB numbers.
- The biggest-looking TB number is 15.63 on the H111D16 row.
- Notice the trap on the ANGLL row that there are only three significant digits showing in the “18.3” figure, which looks physically the same size as the three-digit figures “1.24” and “4.29” directly above and below it, but realize that 18.3 (which should have been rendered “18.30”) is an order of magnitude larger.
- Skim the column again to make sure I’m not missing another such number.
- The answer is ANGLL.
Rendering the table differently makes your readers’ (plural!) job much easier:
Database Size (TB) Storage (TB)This table obeys an important design principle:
---------------- --------- ------------
SAD99PS .64 1.24
ANGLL 9.15 18.30
FRI_W1 2.14 4.29
DEMO 6.62 13.24
H111D16 7.81 15.63
HAANT 1.10 2.20
FSU 7.41 14.81
BYNANK 2.69 5.38
HDMI7 .24 .48
SXXZPP .60 1.17
TPAA 1.71 3.43
MAISTERS .82 1.61
p17gv_data01.dbf .80 1.56
The amount of ink it takes to render each number is proportional to its relative magnitude.I fixed two problems: (i) now all the units are consistent (I have guaranteed this feature by adding unit label to the header and deleting all labels from the rows); and (ii) I’m showing the same number of significant digits for each number. Now, you don’t have to do arithmetic in your head, and now you can see more easily that the answer is ANGLL, at 18.30 TB.
Let’s go one step further and finish the deal. If you really want to make it as easy as possible for readers to understand your space consumption problem, then you should sort the data, too:
Database Size (TB) Storage (TB)Now, your answer comes in a glance. Think back at the comprehension steps that I described above. With the table here, you only need:
---------------- --------- ------------
ANGLL 9.15 18.30
H111D16 7.81 15.63
FSU 7.41 14.81
DEMO 6.62 13.24
BYNANK 2.69 5.38
FRI_W1 2.14 4.29
TPAA 1.71 3.43
HAANT 1.10 2.20
MAISTERS .82 1.61
p17gv_data01.dbf .80 1.56
SAD99PS .64 1.24
SXXZPP .60 1.17
HDMI7 .24 .48
- Notice that the table is sorted in descending numerical order.
- Comprehend the column headings.
- The answer is ANGLL.
Good design is a topic of consideration. And even conservation. If spending 10 extra minutes formatting your data better saves 1,000 readers 2 minutes each, then you’ve saved the world 1,990 minutes of wasted effort.
But good design is also a very practical matter for you personally, too. If you want your audience to understand your work, then make your information easier for them to consume—whether you’re writing email, proposals, reports, infographics, slides, or software. It’s part of the pathway to being more persuasive.
Fail Fast
I think I can help explain why the principle of “fail fast” is so important, and maybe I can help you explain it, too.
Software developers know about fail fast already, whether they realize it or not. Yesterday was a prime example for me. It was a really long day. I didn’t leave my office until after 9pm, and then I turned my laptop back on as soon as I got home to work another three hours. I had been fighting a bug all afternoon. It was a program that ran about 90 seconds normally, but when I tried a code path that should have been much faster, I could let it run 50 times that long and it still wouldn’t finish.
At home, I ran it again and left it running while I watched the Thunder beat the Spurs, assuming the program would finish eventually, so I could see the log file (which we’re not flushing often enough, which is another problem). My MacBook Pro ran so hard that the fan compelled my son to ask me why my laptop was suddenly so loud. I was wishing the whole time, “I wish this thing would fail faster.” And there it is.
When you know your code is destined to fail, you want it to fail faster. Debugging is hard enough as it is, without your stupid code forcing you to wait an hour just to see your log file, so you might gain an idea of what you need to go fix. If I could fail faster, I could fix my problem earlier, get more work done, and ship my improvements sooner.
But how does that relate to wanting my business idea to fail faster? Well, imagine that a given business idea is in fact destined to fail. When would you rather find out? (a) In a week, before you invest millions of dollars and thousands of hours investing into the idea? Or (b) In a year, after you’ve invested millions of dollars and thousands of hours?
I’ll take option (a) a million times out of a million. It’s like asking if I’d like a crystal ball. Um, yes.
The operative principle here is “destined to fail.” When I’m fixing a reported bug, I know that once I create reproducible test case for that bug, my software will fail. It is destined to fail on that test case. So, of course, I want for my process of creating the reproducible test case, my software build process, and my program execution itself to all happen as fast as possible. Even better, I wish I had come up with the reproducible test case a year or two ago, so I wouldn’t be under so much pressure now. Because seeing the failure earlier—failing fast—will help me improve my product earlier.
But back to that business idea... Why would you want a business idea to fail fast? Why would you want it to fail at all? Well, of course, you don’t want it to fail, but it doesn’t matter what you want. What if it is destined to fail? It’s really important for you to know that. So how can you know?
Here’s a little trick I can teach you. Your business idea is destined to fail. It is. No matter how awesome your idea is, if you implement your current vision of some non-trivial business idea that will take you, say, a month or more to implement, not refining or evolving your original idea at all, your idea will fail. It will. Seriously. If your brain won’t permit you to conceive of this as a possibility, then your brain is actually increasing the probability that your idea will fail.
You need to figure out what will make your idea fail. If you can’t find it, then find smart people who can. Then, don’t fear it. Don’t try to pretend that it’s not there. Don’t work for a year on the easy parts of your idea, delaying the inevitable hard stuff, hoping and praying that the hard stuff will work its way out. Attack that hard stuff first. That takes courage, but you need to do it.
Find your worst bottleneck, and make it your highest priority. If you cannot solve your idea’s worst problem, then get a new idea. You’ll do yourself a favor by killing a bad idea before it kills you. If you solve your worst problem, then find the next one. Iterate. Shorter iterations are better. You’re done when you’ve proven that your idea actually works. In reality. And then, because life keeps moving, you have to keep iterating.
That’s what fail fast means. It’s about shortening your feedback loop. It’s about learning the most you can about the most important things you need to know, as soon as possible.
So, when I wish you fail fast, it’s a blessing; not a curse.
Loss Aversion and the Setting of DB_BLOCK_CHECKSUM
Clients are always concerned about the performance impact of features like this. Several years ago, I met a lot of people who had—in response to some expensive advice with which I strongly disagreed—turned off redo logging with an underscore parameter. The performance they would get from doing this would set the expectation level in their mind, which would cause them to resist (strenuously!) any notion of switching this [now horribly expensive] logging back on. Of course, it makes you wish that it had never even been a parameter.
I believe that the right analysis is to think clearly about risk. Risk is a non-technical word in most people’s minds, but in finance courses they teach that risk is quantifiable as a probability distribution. For example, you can calculate the probability that a disk will go bad in your system today. For disks, it’s not too difficult, because vendors do those calculations (MTTF) for us. But the probability that you’ll wish you had set db_block_checksum=full yesterday is probably more difficult to compute.
From a psychology perspective, customers would be happier if their systems had db_block_checksum set to full or typical to begin with. Then in response to the question,
“Would you like to remove your safety net in exchange for going between 1% and 10% faster? Here’s the horror you might face if you do it...”...I’d wager that most people would say no, thank you. They will react emotionally to the idea of their safety net being taken away.
But with the baseline of its being turned off to begin with, the question is,
“Would you like to install a safety net in exchange for slowing your system down between 1% and 10%? Here’s the horror you might face if you don’t...”...I’d wager that most people would answer no, thank you, even though this verdict that is opposite to the one I predicted above. They will react emotionally to the idea of their performance being taken away.
Most people have a strong propensity toward loss aversion. They tend to prefer avoiding losses over acquiring gains. If they already have a safety net, they won’t want to lose it. If they don’t have the safety net they need, they’ll feel averse to losing performance to get one. It ends up being a problem more about psychology than technology.
The only tools I know to help people make the right decision are:
- Talk to good salespeople about how they overcome the psychology issue. They have to deal with it every day.
- Give concrete evidence. Compute the probabilities. Tell the stories of how bad it is to have insufficient protection. Explain that any software feature that provides a benefit is going to cost some system capacity (just like a new report, for example), and that this safety feature is worth the cost. Make sure that when you size systems, you include the incremental capacity cost of switching to db_block_checksum=full.
When you read David’s article, you are going to see heavy quoting of my post here in his intro. He did that with my full support. (He wrote his article when my article here wasn’t an article yet.) If you feel like you’ve read it before, just keep reading. You really, really need to see what David has written, beginning with the question:
If I’ve never faced a corruption, and I have good backup strategy, my disks are mirrored, and I have a great database backup strategy, then why do I need to set these kinds of parameters that will impact my performance?Enjoy.
The “Two Spaces After a Period” Thing
Here’s the story.
When you type, you’re inputting data into a machine. I know you like feeling like you’re in charge, but really you’re not in charge of all the rules you have to follow while you’re inputting your data. Other people—like the designers of the machine you’re using—have made certain rules that you have to live by. For example, if you’re using a QWERTY keyboard, then the ‘A’ key is in a certain location on the keyboard, and whether it makes any sense to you or not, the ‘B’ key is way over there, not next to the ‘A’ key like you might have expected when you first started learning how to type. If you want a ‘B’ to appear in the input, then you have to reach over there and push the ‘B’ key on the keyboard.
In addition to the rules imposed upon you by the designers of the machine you’re using, you follow other rules, too. If you’re writing a computer program, then you have to follow the syntax rules of the language you’re using. There are alphabet and spelling and grammar rules for writing in German, and different ones for English. There are typographical rules for writing for The New Yorker, and different ones for the American Mathematical Society.
A lot of people who are over about 40 years old today learned to type on an actual typewriter. A typewriter is a machine that used rods and springs and other mechanical elements to press metal dies with backwards letter shapes engraved onto them through an inked ribbon onto a piece of paper. Some of the rules that governed the data input experience on typewriters included:
- You had to learn where the keys were on the keyboard.
- You had to learn how to physically return the carriage at the end of a line.
- You had to learn your project’s rules of spelling.
- You had to learn your project’s rules of grammar.
- You had to learn your project’s rules of typography.
On your typewriter, you might not have realized it, but you did adhere to some typography rules. They might have included:
- Use two carriage returns after a paragraph.
- Type two spaces after a sentence-ending period.
- Type two spaces after a colon.
- Use two consecutive hyphens to represent an em dash.
- Make paragraphs no more than 80 characters wide.
- Never use a carriage return between “Mr.” and the proper name that follows, or between a number and its unit.
- Double-space all paragraph text.
Most people who didn’t write for different publishers got by just fine on the one set of typography rules they learned in high school. To them, it looked like there were only a few simple rules, and only one set of them. Most people had never even heard of a lot of the rules they should have been following, like rules about widows and orphans.
In the early 1980s, I began using computers for most of my work. I can remember learning how to use word processing programs like WordStar and Sprint. The rules were a lot more complicated with word processors. Now there were rules about “control keys” like ^X and ^Y, and there were no-break spaces and styles and leading and kerning and ligatures and all sorts of new things I had never had to think about before. A word processor was much more powerful than a typewriter. If you did it right, typesetting could could make your work look like a real book. But word processors revealed that typesetting was way more complicated than just typing.
Doing your own typesetting can be kind of like doing your own oil changes. Most people prefer to just put gas in the tank and not think too much about the esoteric features of their car (like their tires or their turn signal indicators). Most people who went from typewriters to word processors just wanted to type like they always had, using the good-old two or three rules of typography that had been long inserted into their brains by their high school teachers and then committed by decades of repetition.
Donald Knuth published The TeXBook in 1984. I think I bought it about ten minutes after it was published. Oh, I loved that book. Using TeX was my first real exposure to the world of actual professional-grade typography, and I have enjoyed thinking about typography ever since. I practice typography every day that I use Keynote or Pages or InDesign to do my work.
Many people don’t realize it, but when you type input into programs like Microsoft Word should follow typography rules including these:
- Never enter a blank line (edit your paragraph’s style to manipulate its spacing).
- Use a single space after a sentence-ending period (the typesetter software you’re using will make the amount of space look right as it composes the paragraph).
- Use a non-breaking space after a non-sentence-ending period (so the typesetter software won’t break “Mr. Harkey” across lines).
- Use a non-breaking space between a number and its unit (so the typesetter software won’t break “8 oz” across lines).
- Use an en dash—not a hyphen—to specify ranges of numbers (like “3–8”).
- Use an em dash—not a pair of hyphens—when you need an em dash (like in this sentence).
- Use proper quotation marks, like “this” and ‘this’ (or even « this »).
So, it’s always funny to me when people get into heated arguments on Facebook about using one space or two after a period. It’s the tiniest little tip of the typography iceberg, but it opens the conversation about typography, for which I’m glad. In these discussions, two questions come up repeatedly: “When did the rule change? Why?”
Well, the rule never did change. The next time I type on an actual typewriter, I will use two spaces after each sentence-ending period. I will also use two spaces when I create a Courier font court document or something that I want to look like it was created in the 1930s. But when I work on my book in Adobe InDesign, I’ll use one space. When I use my iPhone, I’ll tap in two spaces at the end of a sentence, because it automatically replaces them with a period and a single space. I adapt to the rules that govern the situation I’m in.
It’s not that the rules have changed. It’s that the set of rules was always a lot bigger than most people ever knew.
What I Wanted to Tell Terry Bradshaw
When I was little, Terry Bradshaw was my enemy because, unforgivably to a young boy, he and his Pittsburgh Steelers kept beating my beloved Dallas Cowboys in Super Bowls. As I grew up, though, his personality on TV talk shows won me over, and I enjoy watching him to this day on Fox NFL Sunday. After learning a little bit about his life, I’ve grown to really admire and respect him.
I had heard that he owned a ranch not too far from where I live, and so I had it in mind that inevitably I would meet him someday, and I would say thank you. One day I had that chance.
I completely blew it.
My wife and I saw him there at the theater one day, standing by himself not far from us. It seemed like if I were to walk over and say hi, maybe it wouldn’t bother him. So I walked over, a little bit nervous. I shook his hand, and I said, “Mr. Bradshaw, hi, my name is Cary.” I would then say this:
I was a big Roger Staubach fan growing up. I watched Cowboys vs. Steelers like I was watching Good vs. Evil.
But as I’ve grown up, I have gained the deepest admiration and respect for you. You were a tremendous competitor, and you’re one of my favorite people to see on TV. Every time I see you, you bring a smile to my face. You’ve brought joy to a lot of people.
I just wanted to say thank you.
Yep, that’s what I would say to Terry Bradshaw if I got the chance. But that’s not how it would turn out. How it actually went was like this, …my big chance:
Me: I was a big Roger Staubach fan growing up.
TB: Hey, so was I!
Me: (stunned)
TB: (turns away)
The End
I was heartbroken. It bothers me still today. If you know Terry Bradshaw or someone who does, I wish you would please let him know. It would mean a lot to me.
…I did learn something that day about the elevator pitch.