Skip navigation.

Tanel Poder

Syndicate content
Linux, Oracle, Hadoop.
Updated: 10 hours 9 min ago

My New Youtube Channel

Mon, 2015-11-23 22:30

I have created a new youtube channel – and have uploaded some videos there already! Bookmark & Subscribe here:

More stuff is coming over the next weeks & months :-)


NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Troubleshooting Another Complex Performance Issue – Oracle direct path inserts and SEG$ contention

Tue, 2015-11-10 18:35

Here’s an updated presentation I first delivered at Hotsos Symposium 2015.

It’s about lots of concurrent PX direct path insert ant CTAS statements that, when clashing with another bug/problem, caused various gc buffer busy waits and enq: TX – allocate ITL entry contention. This got amplified thanks to running this concurrent workload on 4 RAC nodes:

When reviewing these slides, I see there’s quite a lot that needs to be said in addition to what’s on slides, so this might just mean a (Powerpoint) hacking session some day!

NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

SQL Monitoring in Oracle Database 12c

Thu, 2015-10-29 12:53

Here’s my latest OOW presentation – SQL Monitoring in Oracle Database 12c:

[direct link]

You can download all my scripts from



NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

Connecting Hadoop and Oracle

Tue, 2015-10-27 18:06

Here are the slides of my yesterday’s OakTableWorld presentation. They also include a few hints about what our hot new venture Gluent is doing (although bigger annoucements come later this year).

[direct link]

Also, if you are at Oracle OpenWorld right now, my other presentation about SQL Monitoring in 12c is tomorrow at 3pm in Moscone South 103. See you there!


NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

My Oracle OpenWorld presentations

Fri, 2015-10-23 20:44

Oracle OpenWorld is just around the corner – I will have one presentation at OOW this year and another at the independent OTW event:

Connecting Oracle with Hadoop

Real-Time SQL Monitoring in Oracle Database 12c

  • Conference: OpenWorld
  • Time: Wednesday, 28 Oct, 3:00pm
  • Location: Moscone South 103
  • Abstract: Click here (sign up to guarantee a seat!)

I plan to hang out at the OTW venue on Monday and Tuesday, so see you there!


NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

Advanced Oracle Troubleshooting v2.5 (with 12c stuff too)

Fri, 2015-10-09 01:58

It took a while (1.5 years since my last class – I’ve been busy!), but I am ready with my Advanced Oracle Troubleshooting training (version 2.5) that has plenty of updates, including some more modern DB kernel tracing & ASH stuff and of course Oracle 12c topics!

The online training will take place on 16-20 November & 14-18 December 2015 (Part 1 and Part 2).

The latest TOC is below:

Seminar registration details:

A notable improvement of AOT v2.5: now attendees will get downloadable video recordings after the sessions for personal use! So, no crappy streaming with 14-day expiry date, you can download the video MP4 files straight to your computer or tablet and keep for your use forever!

I won’t be doing any other classes this year, but there will be some more (pleasant) surprises coming next year ;-)

See you soon!

NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

RAM is the new disk – and how to measure its performance – Part 2 – Tools

Mon, 2015-09-21 02:20

In the previous article I explained that the main requirement for high-speed in-memory data scanning is column-oriented storage format for in-memory data. SIMD instruction processing is just icing on the cake. Let’s dig deeper. This is a long post, you’ve been warned.

Test Environment

I will cover full test results in the next article in this series. First, let’s look into the test setup, environment and what tools I used for peeking inside CPU hardware.

I was running the tests on a relatively old machine with 2 CPU sockets, with 6-core CPUs in each socket (2s12c24t):

$ egrep "MHz|^model name" /proc/cpuinfo | sort | uniq -c
     24 cpu MHz		: 2926.171
     24 model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

The CPUs support SSE4.2 SIMD extensions (but not the newer AVX stuff):

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq

Even though the /proc/cpuinfo above shows the CPU clock frequency as 2.93GHz, these CPUs have Intel Turboboost feature that allows some cores run at up to 3.33GHz frequency when not all cores are fully busy and the CPUs aren’t too hot.

Indeed, the turbostat command below shows that the CPU core executing my Oracle process was running at 3.19GHz frequency:

# turbostat -p sleep 1
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             6.43 3.02 2.93   0  93.57   0.00   0.00   59   0.00   0.00
 0   0   0   4.49 3.19 2.93   0  95.51   0.00   0.00   46   0.00   0.00
 0   1   1  10.05 3.19 2.93   0  89.95   0.00   0.00   50
 0   2   2   2.48 3.19 2.93   0  97.52   0.00   0.00   45
 0   8   3   2.05 3.19 2.93   0  97.95   0.00   0.00   44
 0   9   4   0.50 3.20 2.93   0  99.50   0.00   0.00   50
 0  10   5 100.00 3.19 2.93   0   0.00   0.00   0.00   59
 1   0   6   6.25 2.23 2.93   0  93.75   0.00   0.00   44   0.00   0.00
 1   1   7   3.93 2.04 2.93   0  96.07   0.00   0.00   43
 1   2   8   0.82 2.15 2.93   0  99.18   0.00   0.00   44
 1   8   9   0.41 2.48 2.93   0  99.59   0.00   0.00   41
 1   9  10   0.99 2.35 2.93   0  99.01   0.00   0.00   43
 1  10  11   0.76 2.36 2.93   0  99.24   0.00   0.00   44

I will come back to this CPU frequency turbo-boosting later when explaining some performance metrics.

I ran the experiments in Oct/Nov 2014, so used a relatively early Oracle version with a bundle patch (19189240) for in-memory stuff.

The test was deliberately very simple as I was researching raw in-memory scanning and filtering speed and was not looking into join/aggregation performance. I was running the query below with different hints and parameters to change access path options:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

I used the CUSTOMERS table of Swingbench Sales History schema. I deliberately didn’t use COUNT(*), but COUNT(col) on an actual column “cust_valid” that was nullable, so values in that actual column had to be accessed for correct counting.

Also, I picked the last column in the table as accessing columns in the physical “end” of a row (in row-oriented storage format) would cause more memory/cache accesses and CPU execution branch jumps due to the run-length encoded structure of a row in a datablock. Of course this depends on number of columns and width of the row too, plus hardware characteristics like cache line size (64 bytes on my machine).

Anyway, querying the last column helps to illustrate better what kind of overhead you may be suffering from when filtering that 500-column fact table using columns in the end of it.

SQL> @desc ssh.customers_nopart
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      CUST_ID                         NOT NULL NUMBER
    2      CUST_FIRST_NAME                 NOT NULL VARCHAR2(20)
    3      CUST_LAST_NAME                  NOT NULL VARCHAR2(40)
   22      CUST_EFF_TO                              DATE
   23      CUST_VALID                               VARCHAR2(1)

The table has 69,642,625 rows in it and its segment size is 1613824 blocks / 12608 MB on disk (actual used space in it was slightly lower due to some unused blocks in the segment). I set the table PCTFREE to zero to use all space in the blocks. I also created a HCC-compressed copy of the same table for comparison reasons.

SQL> @seg tanel.customers_nopart

-------- ------ -------------------------  ------------- ----------
   12608 TANEL  CUSTOMERS_NOPART           TABLE            1613824
    6416 TANEL  CUSTOMERS_NOPART_HCC_QL    TABLE             821248

I made sure that the test tables were completely cached in Oracle buffer cache to eliminate any physical IO component from tests and also enabled in-memory columnar caching for the CUSTOMERS_NOPART table.

SQL> @imseg %.%

---------- ---------- ----- ------------- ------------------- ------------- ------
     12608       5913  100% TANEL         CUSTOMERS_NOPART    TABLE         COMPLE
---------- ----------
     12608       5913
CPU Activity Measurement Tools

In addition to the usual suspects (Oracle SQL Monitoring reports and Snapper), I used the awesome Linux tool called perf, but not in the typical way you might have used it in past.

On Linux, perf can be used for profiling code executing on CPUs by sampling the instruction pointer and stack backtraces (perf top), but also for taking snapshots of internal CPU performance counters (perf stat). These CPU performance counters (CPC) tell us what happened inside the CPU during my experiments.

This way we can go way deeper than the high level tools like top utility or getrusage() syscall would ever allow us to go. We’ll be able to measure what physically happened inside the CPU. For example: for how many cycles the CPU core was actually doing useful work pushing instructions further in the execution pipeline vs. was stalled, waiting for requested memory to arrive or for some other internal condition to come true. Also, we can estimate the amount of traffic between the CPU and main memory, plus CPU cache hits/misses at multiple cache levels.

Perf can do CPC snapshotting and accounting also at OS process level. This means you can measure the internal CPU/memory activity of a single OS process under examination and that was great for my experiment.

Note that these kinds of tools are nothing new, they’ve been around with CPU vendor code profilers ever since CPUs were instrumented with performance counters (but undocumented in the early days). Perf stat just makes this stuff easily accessible on Linux. For example, since Solaris 8, you could use cputrack for extracting similar process-level CPU counter “usage”, other platforms have their own tools.

I used the following command (-p specifies the target PID) for measuring internal CPU activity when running my queries:

perf stat -e task-clock,cycles,instructions,branches,branch-misses \
          -e stalled-cycles-frontend,stalled-cycles-backend \
          -e cache-references,cache-misses \
          -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \
          -p 92106 sleep 30

In RHEL6 equivalents (and later) you can use perf stat -d option for getting similar detailed output without specifying all the counters separately – but I was on OEL5.8. Also, different CPU versions support different performance counters. Read the manuals and start from simpler stuff.

Below is an example output from one test run – where I ran a full table scan against the last column of a regular row-oriented table (all cached in buffer cache) and took a perf stat snapshot of the entire SQL execution. Note that even though the table was cached in Oracle’s in-memory column store, I had disabled its use with the NO_INMEMORY hint, so this full table scan was done entirely via traditional Oracle buffer cache (no physical IOs!):

 Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
     7,386,220,210 branches                  #  269.828 M/sec                   [39.99%]
        22,056,397 branch-misses             #    0.30% of all branches         [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]
       234,361,189 LLC-loads                 #    8.562 M/sec                   [26.66%]
       218,570,294 LLC-load-misses           #   93.26% of all LL-cache hits    [ 6.67%]
        18,493,582 LLC-stores                #    0.676 M/sec                   [ 6.67%]
         3,233,231 LLC-store-misses          #    0.118 M/sec                   [ 6.67%]
     7,324,946,042 L1-dcache-loads           #  267.589 M/sec                   [13.33%]
       305,276,341 L1-dcache-load-misses     #    4.17% of all L1-dcache hits   [20.00%]
        36,890,302 L1-dcache-prefetches      #    1.348 M/sec                   [26.66%]

      30.000601214 seconds time elapsed

I ran perf for 30 seconds for the above experiment, kicked it running just before executing the Oracle SQL and it finished right after the SQL had completed.

Let’s go through some of the above metrics – top down. I’m explaining these metrics at a fairly high level and in the context of my experiment – fully measuring a single SQL execution in a single Oracle process:

Basic CPU Performance Counter Reference
  1. task-clock (~27373 milliseconds)
    – This is a software event and shows how much the target Linux task (my Oracle process) spent running on CPU during the SQL execution, as far as the OS scheduler knows (roughly 27 seconds on CPU).
    – So while perf took a 30 second snapshot of my process, my test SQL ran in a couple of seconds shorter time (so didn’t run on CPU all 30 seconds). That should explain the “0.912 CPU utilized” derived metric above.
  2. cycles – (86B cycles)
    – This hardware metric shows how many CPU cycles did my process (running a SQL statement) consume during perf runtime.
    – dividing 86B CPU cycles with 27 CPU seconds gives that the CPU core must have operated at around 3.1 GHz speed (on average) during my SQL run.
    – Remember, earlier in this article I used turbostat to show how these 2.93 GHz CPU cores happened to be running at 3.19 GHz frequency thanks to turbo-boost!
  3. instructions – (32B instructions)
    – This hardware metric shows how many instructions the CPU managed to successfully execute (and retire). This is where things get interesting:
    – It’s worth mentioning that modern CPUs are superscalar and pipelined. They have multiple internal execution units, can have multiple instructions (decoded to µops) executed in its pipeline, memory loads & stores happening concurrently and possibly out-of-order – instruction level parallelism, data level parallelism.
    – When dividing 32B executed instructions with 86B CPU cycles we see that we managed to execute only ~0.37 instructions per CPU cycle (IPC) on average!
    – When inverting this number we get 86B/32B=2.69 Cycles Per Instruction (CPI). So, on average, every CPU instruction took ~2.69 CPU cycles to execute! We’ll get to the “why” part later.
  4. branches – (7.3B branches)
    – This hardware metric shows how many branches the execututed code took
    – A branch is basically a jump (unconditional JMP instruction or a conditional jump like JZ, JNZ and many more – this is how basic IF/THEN/ELSE, CASE and various LOOP statmenents work at CPU level).
    – The more decision-points in your code, the more branches it takes.
    – Branches are like speedbumps in a CPU execution pipeline, obstructing the execution flow and prefetching due to the uncertainty of which branch will be taken.
    – That’s why features like branch prediction with speculative execution are built into modern CPUs to alleviate this problem.
    – Knowing that we scanned through roughly 70M rows in this table, this is over 100 branches taken per row scanned (and counted)!
    – Oracle’s traditional block format rows are stored in run-length encoded format, where you know where the following column starts only after reading (and testing) the previous column’s length byte(s). The more columns you need to traverse, the more branches you’ll take per row scanned.
  5. branch-misses – (22M, 0.3% of all branches)
    – This hardware metric shows how many times the CPU branch predictor (described above) mispredicted which branch would be taken and causing a pipeline stall.
    – Correctly predicting branches (where will the code execution jump next?) is good, as this allows to speculatively execute upcoming instructions and prefetch data required by them.
    – However, the branch predictor doesn’t always predict the future correctly, despite various advancements in modern branch prediction, like branch history tables and branch target buffers etc.
    – In a branch misprediction case, the mispredicted branch state has to be discarded, pipeline flushed and the correct branch’es instructions will be fetched & put into the start of execution pipeline (in short: mispredictions waste CPU cycles).
  6. stalled-cycles-frontend – (~76.6M cycles, 88.7% of all cycles)
    – This hardware metric shows for how many cycles the front-end of the CPU were spent stalled, not producing new µops into pipeline for backend execution.
    – The front-end of an Intel pipelined CPU is basically the unit that fetches the good old x86/x86_64 instructions from L1 instruction cache (or RAM if needed), decodes these to RISC-like µops (newer CPUs also cache those µops) and puts these into backend instruction queue for execution.
    – The front-end also deals with remembering taken brances and branch prediction (decoding and sending the predicted branch’es instructions into the backend).
    – The frontend can stall due to various reasons, like instruction cache misses (waiting for memory lines containing instructions to arrive from RAM or a lower level cache), branch mispredictions or simply because the backend can not accept more instructions into its pipeline due to some bottlenecks there.
  7. stalled-cycles-backend – (~58.6M cycles, 67.8% of all cycles)
    – This hardware metric shows how many cycles in the back-end of the CPU were spent stalled instead of advancing the pipeline.
    – The backend of the CPU is where actual computation on data happens – any computation referencing main memory (not only registers) will have to wait until the referenced memory locations have arrived in CPU L1 cache
    – A common reason for backend stalls is due to waiting for a cache line to arrive from RAM (or a lower level cache), although there are many other reasons.
    – To reduce memory-access related stalls, the program should be optimized to do less memory accesses or switch to more compact data structures to avoid loading data it doesn’t need.
    – Simpler, more predictable data structures also help as the CPU hardware prefetcher may detect an “array scan” and start prefetching required memory lines in advance.
    – In the context of this blog series – sequentially scanning and filtering a column of a table’s data is good for reducing memory access related CPU stalls. Walking through random pointers of linked lists (cache buffers chains) and skipping through row pointers in blocks, plus many columns’ length bytes before getting to your single column of interest causes memory access related stalls.
  8. cache-references – (256M references)
    – Now we get into a series of CPU cache traffic related metrics, some of these overlap
    – This metric shows how many Last Level Cache accesses (both read and write) were done.
    – The memory location that CPU tried to access was not in a higher level (L1/L2) cache, thus the lowest cache, Last Level Cache, was checked.
    – Last Level Cache, also called LLC or Lower Level Cache or Longest Latency Cache is usually L3 cache on modern CPUs (although there are some hints that some perf versions still report L2 cache as LLC). I need to read some more perf source code to figure this out, but for this experiment’s purposes it doesn’t matter much if it’s L2 or L3. If I scan through a multi-GB table, it won’t fit into either level cache anyway.
  9. cache-misses – (222M misses)
    – This metric shows how many times the cache reference could not be satisified by the Last Level Cache and therefore RAM access was needed.
  10. LLC-loads – (234M loads)
    – The following 4 metrics just break down the above two in more detail.
    – This metric shows how many times a cache line was requested from LLC as it wasn’t available (or valid) in a higher level cache.
  11. LLC-load-misses – (218M misses)
    – This metric shows how many LLC loads could not be satisfied from the LLC and therefore RAM access was needed.
  12. LLC-stores – (18M stores)
    – This metric shows how many times a cache line was written into a LLC.
  13. LLC-store-misses – (3M misses)
    – This metric shows how many times we had to first read the cache line into LLC before completing the LLC write.
    – This may happen due to partial writes (for example: cache line size is 64 bytes and not currently present in LLC and the CPU tries to write into first 8 bytes of the line).
    – This metric may get incremented due to other cache coherency related reasons where the store fails as other CPU(s) currently own the memory line and have locked and modified it since it was loaded into current CPU cache.
  14. L1-dcache-loads – (7300M loads)
    – The following 3 metrics are similar as above, but for the small (but fast) L1 cache.
    – This metric shows how many times the CPU attempted to load a cache line from L1-data cache into a register.
    – The dcache in the metric name means data accesses from memory (icache means instruction cache – memory lines fetched from L1I cache with instructions for execution).
    – Note how the L1D cache loads metric is way higher than LLC-loads (7300M vs 234M) as many of the repeated tight loops over small internal memory structures can be satisfied from L1 cache.
  15. L1-dcache-load-misses – (305M misses)
    – This metric shows how many data cache loads from L1D cache couldn’t be satisfied from that cache and therefore a next (lower) level cache was needed.
    – If you are wondering how come the L1D cache load misses is much larger than the LLC-loads (305M vs 234M – shouldn’t they be equal), one explanation is that as there’s a L2 cache between L1 & L3, some of the memory accesses got satisfied in L2 cache (and some more explanations illustrating the complexity of CPU cache metrics are here).
  16. L1-dcache-prefetches – (37M prefetches)
    – This metric shows how many cache lines the CPU prefetched as the L1D cache prefetch (DCU prefetcher).
    – Usually this simple prefetcher just fetches the next cache line to the “currently” accessed one.
    – It would be interesting to know if this prefetcher is smart enough to prefetch previous cache lines as regular row-formatted Oracle datablocks are filled from bottom up (this does not apply to the column-oriented stuff).
    – If the full table scan code walks the block’s row directory so that it jumps to the bottom of the block first and works its way upwards, this means that some memory accesses will look like scanning backwards – and may affect prefetching.


I hope that this is a useful reference when measuring what’s going on inside a CPU. This is actually pretty basic stuff in the modern CPU world, there’s much more that you can measure in CPUs (via raw performance counters for example) and also different tools that you can use, like Intel VTune. It’s not trivial though, as at such low level even different CPU models by the same vendor may have different meaning (and numbering & flags) for their performance counters.

I won’t pretend to be a CPU & cache coherency expert, however these basic metrics and my understanding looks to be correct enough for comparing different Oracle access paths and storage formats (more about this in next parts of the series).

One bit of warning: It’s the best to run these experiments on a bare-metal server, not in a virtual machine. This is a low-level measurement exercise and in a VM you could suffer from all kinds of additional noise, plus some of the hardware counters would not be available for perf. Some hypervisors do not allow the guest OS to access hardware performance counters by default. One interesting article (by Frits Hoogland) about running perf in VMs is here.

Ok, enough writing for today! I actually started this post more than a month ago and it got way longer than planned. In the next part of this series I will interpret this post’s full-table scan SQL CPU metrics using the above reference (and explain where the bottleneck/inefficiency is). And in Part 4 I’ll  show you all the metrics from a series of experiments – testing memory access efficiency of different Oracle data access paths (indexes vs full table scan vs HCC vs in-memory column store).

Update: I have corrected a couple of typos, we had 86B cpu cycles and 32B instructions instead of 86M/32M as I had mistakenly typed in before. The instructions-per-cycle ratio calculation and the point remains the same though. Thanks to Mark Farnham for letting me know.

NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

We are hiring!

Thu, 2015-09-17 15:32

Gluent – where I’m a cofounder & CEO – is hiring awesome developers and (big data) infrastructure specialists in US and UK!

We are still in stealth mode, so won’t be detailing publicly what exactly we are doing ;-)

However, it is evident that the modern data platforms (for example Hadoop) with their scalability, affordability-at-scale and freedom to use many different processing engines on open data formats are turning enterprise IT upside down.

This shift has already been going on for years in large internet & e-commerce companies and small startups, but now the shockwave is arriving to all traditional enterprises too. And every single one of them must accept it, in order to stay afloat and win in the new world.

Do you want to be part of the new world?



NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

RAM is the new disk – and how to measure its performance – Part 1 – Introduction

Sun, 2015-08-09 17:26

RAM is the new disk, at least in the In-Memory computing world.

No, I am not talking about Flash here, but Random Access Memory – RAM as in SDRAM. I’m by far not the first one to say it. Jim Gray wrote this in 2006: “Tape is dead, disk is tape, flash is disk, RAM locality is king” (presentation)

Also, I’m not going to talk about how RAM is faster than disk (everybody knows that), but in fact how RAM is the slow component of an in-memory processing engine.

I will use Oracle’s In-Memory column store and the hardware performance counters in modern CPUs for drilling down into the low-level hardware performance metrics about CPU efficiency and memory access.

But let’s first get started by looking a few years into past into the old-school disk IO and index based SQL performance bottlenecks :)

Have you ever optimized a SQL statement by adding all the columns it needs into a single index and then letting the database do a fast full scan on the “skinny” index as opposed to a full table scan on the “fat” table? The entire purpose of this optimization was to reduce disk IO and SAN interconnect traffic for your critical query (where the amount of data read would have made index range scans inefficient).

This special-purpose approach would have benefitted your full scan in two ways:

  1. In data warehouses, a fact table may contain hundreds of columns, so an index with “only” 10 columns would be much smaller. Full “table” scanning the entire skinny index would still generate much less IO traffic than the table scan, so it became a viable alternative to wide index range scans and some full table scans (and bitmap indexes with star transformations indeed benefitted from the “skinniness” of these indexes too).
  2. As the 10-column index segment is potentially 50x smaller than the 500-column fact table, it might even fit entirely into buffer cache, should you decide so.

This is all thanks to physically changing the on-disk data structure, to store a copy of only the data I need in one place (column pre-projection?) and store these elements close to each other (locality).

Note that I am not advocating the use of this as a tuning technique here, but just explaining what was sometimes used to make a handful critical queries fast at the expense of the disk space, DML, redo and buffer cache usage overhead of having another index – and why it worked.

Now, why would I worry about this at all in a properly warmed up inmemory database, where the disk IO is not at the critical path of data retrieval at all? Well, now that we have removed the disk IO bottleneck, we inevitably hit the next slowest component as a bottleneck and this is … RAM.

Sequentially scanning RAM is slow. Randomly accessing RAM lines is even slower! Of course this slowness is all relative to the modern CPUs that are capable of processing billions of instructions per core every second.

Back to Oracle’s In-Memory column store example: Despite all the marketing talk about loop vectorization with CPU SIMD processing extensions, the most fundamental change required for “extreme performance” is simply about reducing the data traffic between RAM and CPUs.

This is why I said “SIMD would be useless if you waited on main memory all the time” at the Oracle Database In-Memory in Action presentation at Oracle OpenWorld (Oct 2014):

Oracle In-Memory in Action presentation

The “secret sauce” of Oracle’s in-memory scanning engine is the columnar storage of data, the ability to (de)compress it cheaply and accessing only the filtered columns’ memory first, before even touching any of the other projected columns required by the query. This greatly reduces the slow RAM traffic, just like building that skinny index reduced disk I/O traffic back in the on-disk database days. The SIMD instruction set extensions are just icing on the columnar cake.

So far this is just my opinion, but in the next part I will show you some numbers too!


NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts