Tuning Goldilocks SQL

articles: 

Tuning Goldilocks SQL

... then Goldilocks went into the bears' Data Centre and there were 3 Oracle databases. The first was a Data Warehouse. Goldilocks checked the AWR, but all the SQLs were to-o-o-o-o-o-o-o big; they all used full scans, hash joins, bitmap-index combining, partition pruning and parallelism and couldn't be tuned any more. So Goldilocks went to the second Oracle database. It was an OLTP system with hundreds of concurrent users. Goldilocks fired up SQL Tuning Advisor, but all the SQLs were to-o-o-o-o-o-o-o small; they used unique index scans and cluster-joins and couldn't be tuned any more. So Goldilocks went to the third Oracle database. It was an Operational Data Store with a rolling 3 month retention. Goldilocks found SQLs that were joining a million rows with Nested Loops joins, buffer cache hit ratios of 50%, and under-utilised disk. She smiled, opened up Tom Kyte's Expert Oracle eBook on her second monitor and got to work. This database was ju-u-u-u-u-u-u-u-st right ...

Hang around me and you will quickly get sick of hearing how there are two types of SQL: High volume and Low volume. High volume SQLs process more than 10% of data in a table and are best performed with Full Scans and Hash Joins. Low volume SQLs process less than 1% of data in a table and are best performed with index scans and Nested Loop joins. Somewhere between these two figures (depending on many factors) is a break-even point where the full scan/hash join will be equally as fast as the index scan/nested loop join, so it hardly matters which approach is used because they will perform about the same.

I have been known to harp on about it, so it will please many people to know that I have received my come-uppance. After years of ignoring Goldilocks SQL - not too big and not too small - I have discovered that high volume and low volume techniques are not "just as good" as each-other but "just as bad". A Goldilocks SQL will have worse throughput (rows per second) than a high-volume SQL when it full-scans because it discards a higher proportion of unwanted rows. And it will have a worse cache-hit ratio than a high-concurrency low-volume SQL when it index-scans because the long index scans continually age data out of the Buffer Cache. By any metric, Goldilocks SQL just plain sucks!

Baseline

The poor join performance experienced in my latest project (a datamart ETL) was unlike any other I've ever seen. My standard approach to designing an ETL job is to select rows from a single source table, join to secondary tables to pick up reference data for the transformation, then join to the target table (the one that will be updated) to determine whether the row already exists and whether it has changed. Something like this:

SELECT    *
FROM      test_join_src  src
LEFT JOIN test_join_ref1 ref1 ON ref1.pk1 = src.fk1
LEFT JOIN test_join_ref2 ref2 ON ref2.pk2 = src.fk2
LEFT JOIN test_join_tgt  tgt  ON tgt.pk = src.pk

Since we're talking about a Goldilocks example, let's stipulate that TEST_JOIN_TGT is too large to perform a Full Table Scan - so we must use an indexed nested loop join. THis would be a typical execution plan.

-----------------------------------------------------------------------------------------
| Id  | Operation                      | Name              | Rows  | Bytes | Cost (%CPU)|
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                   |  9990 |  2029K| 58694   (1)|
|   1 |  NESTED LOOPS OUTER            |                   |  9990 |  2029K| 58694   (1)|
|   2 |   NESTED LOOPS OUTER           |                   |  9761 |  1553K| 39136   (1)|
|   3 |    NESTED LOOPS OUTER          |                   |  9761 |  1010K| 19579   (1)|
|   4 |     TABLE ACCESS FULL          | TEST_JOIN_SRC     |  9761 |   581K|    22   (5)|
|   5 |     TABLE ACCESS BY INDEX ROWID| TEST_JOIN_REF2    |     1 |    45 |     2   (0)|
|   6 |      INDEX UNIQUE SCAN         | TEST_JOIN_REF2_PK |     1 |       |     1   (0)|
|   7 |    TABLE ACCESS BY INDEX ROWID | TEST_JOIN_TGT     |     1 |    57 |     2   (0)|
|   8 |     INDEX UNIQUE SCAN          | TEST_JOIN_TGT_PK  |     1 |       |     1   (0)|
|   9 |   TABLE ACCESS BY INDEX ROWID  | TEST_JOIN_REF1    |     1 |    45 |     2   (0)|
|  10 |    INDEX UNIQUE SCAN           | TEST_JOIN_REF1_PK |     1 |       |     1   (0)|
-----------------------------------------------------------------------------------------

For testing, we'll use the following setup:

  • TEST_JOIN_SRC: 10,000 rows
  • TEST_JOIN_REF1: 1,000 rows
  • TEST_JOIN_REF2: 1,000 rows
  • TEST_JOIN_TGT: 10,000,000 rows

Let's look at two variants

  1. ETL LOAD - 1% of rows in TEST_JOIN_SRC exist in TEST_JOIN_TGT. This corresponds to the most common ETL case that performs primarily INSERTs.
  2. ETL MERGE - 99% of rows in TEST_JOIN_SRC exist in TEST_JOIN_TGT. This corresponds to the less common ETL case that performs primarily UPDATES.

CASE 1 - ETL INSERT


    Below are the TKPROF stats including wait events for the SQL and plan above, where just 1% of the rows in TEST_JOIN_SRC exist in TEST_JOIN_TGT. This is typical of ETL jobs in a Data Warehouse that receives delta files (change data only) on event-based data. The bulk of the data is new events, with a very small number of old events updated.

    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.02       0.01          0          5          0           0
    Execute      1      0.00       0.00          0          0          0           0
    Fetch        1      1.13       3.10        182      60182          0           0
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total        3      1.15       3.12        182      60187          0           0
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      SQL*Net message to client                       1        0.00          0.00
      db file sequential read                       180        0.10          2.00
      db file scattered read                          1        0.01          0.01
      SQL*Net message from client                     1        3.73          3.73

    Wow. An ETL job processing 10,000 rows in 3.12 seconds (3205 rows per second) is pretty good. Granted, this is downhill with a tail wind because:

    • We've only performed the data acquisition for the ETL - not the load.
    • 99% of those rows in TEST_JOIN_SRC have a PK value that is outside the MIN/MAX of TEST_JOIN_TGT.PK. This means the index scan can short-circuit at the root block without even scanning through the branches of the b-tree index. Still, this is not unusual for an ETL loading new rows with a sequentially allocated key; the keys for all of the new rows we receive each day are higher than those of previous days.
    • Those two reference tables are small and rapidly cached.

    Why so fast? What happened here is that within a few rows being processed, every block that the SQL needed had been loaded into the Buffer Cache. For 10,000 rows, Oracle only went to disk 180 times to retrieve 182 blocks. Compare this to the 60,187 hits on the buffer cache, that's a buffer cache hit ratio of 99.7%. No wonder it's fast.

    Even though this may not be representative of the throughput of an entire ETL job, it establishes a benchmark against which we can compare more complex distributions of data using the same SQL and plan. Let's see what happens when we increase the disk I/O.

CASE 2 - ETL MERGE


    This time 99% of rows in TEST_JOIN_SRC will exist in TEST_JOIN_TGT, so the index scans on TEST_JOIN_TGT_PK will result in table lookups, which in turn will generate a lot more db file sequential read wait events.

    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.00       0.00          0          0          0           0
    Execute      1      0.00       0.00          0          0          0           0
    Fetch        1      3.83      70.43      10097      69964          0           0
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total        3      3.83      70.43      10097      69964          0           0
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      SQL*Net message to client                       2        0.00          0.00
      SQL*Net message from client                     2       42.84         48.49
      db file sequential read                     10017        0.30         67.65
      db file scattered read                          5        0.01          0.07

    And there you go: 10017 trips to disk in 67.65 seconds. 6.8ms per disk read is a bit fast for the NAS on my system, so I suspect a bit of disk cacheing has occurred here. 11ms is the normal latency for a read on this hardware, so this test case should typically run in about 110 seconds instead of 70 seconds. Still, even at 110 seconds, an ETL job processing these 10,000 rows would have a throughput of 90 rows per second. That's not fantastic, but then an ETL that primarily updates existing data is a tougher-than-average job, and you expect worse-than-average performance.

    It's also worth noting that the scattered distribution of rows in TEST_JOIN_TGT is unnatural, we would typically expect a better clustering factor amongst our updated rows. Our even spread of keys across the entire table - combined with the flush of the buffer cache I performed before the test - results in a worst-case buffer hit ratio (85.5%). Under "normal" circumstances, we could reasonably expect 100+ rows per second, which is in an acceptable range.

The Problem (a.k.a. ... but the 3rd SQL was ju-u-u-u-u-u-u-st .... well .... WRONG!)

These are nice benchmarks, and they are typical of the type of ETL job you usually stumble across in a walk in the woods. But they bear (pun intended) about as much resemblence to the SQL in my latest project as papa-bear's porridge does to baby-bear's.

The SQLs I was trying to tune were identical in structure, but not in distribution. Specifically, the reference tables were not nice and small and quickly cached; they were big, fat, 10M row monsters just like the target table. Let's see what happens when we change the distribution of data in these tables.

CASE 3 - Sparse lookups


    This case increases the two reference tables from 1,000 to 10,000,000 rows. To make it worse, the foreign key columns in TEST_JOIN_SRC have been distributed amongst those values to minimise clusteredness. As a result, we expect the PK index scans and resultant table scans to be picking up lots of new blocks from disk, with minimum benefit from the buffer cache.

    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.03       0.04          1          1          0           0
    Execute      1      0.00       0.00          0          0          0           0
    Fetch        1     12.91     483.63      44907      89988          0           0
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total        3     12.94     483.67      44908      89989          0           0
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      db file sequential read                     44903        0.42        475.70
      SQL*Net message to client                       1        0.00          0.00
      db file scattered read                          2        0.00          0.00

    As an ETL job, those 10,000 source rows were processed in 483 seconds, or just over 20 rows per second. That is abysmal. Very few ETLs would be considered viable at 20 rows per second. The reason is pretty straightforward: 44907 visits to disk at 10.6ms per visit.

    Compared to Case 1 (ETL INSERTS), we have:

    • The same SQL
    • The same execution plan
    • The same number of rows processed
    • Performance degradation of 15,500% (155x worse)

Q: Given the same SQL, same plan, and same data volumes, does the distribution of data affect performance?
A: Yep. Lots.

The Solution

The system that I worked on (which prompted this article) had a job that joined 10 large tables in this way, and plenty of other similar jobs as well. As an ETL designer more accustomed to the distributions of data shown in CASE 1 above, I was expecting a nice steady 500+ rows per second. Imagine my response to throughputs of 14 rows per second.

What's happening here is fairly straightforward. When combined with a high buffer cache hit ratio - such as in Case 1 above - indexed nested loops joins are fine. But when scanning a broad range of uncached data, the buffer-cache hit ratio drops and the SQL becomes non-viable.

Indexed nested-loops SQL reads from disk one-block-at-a-time. The time taken to perform such a read (disk latency) does not really depend on the amount of data returned; full table scans return many blocks at a time and achieve the same latency. We cannot use full table scans though; they are even slower because they read and discard >99% of unwanted rows.

Indexed nested-loops SQLs that read uncached data are nothing new; they are pretty standard for OLTP systems where dozens or even hundreds of users are updating different transactions at once. And there lies the solution: "at once". All of those OLTP users are connected in separate sessions and are issuing requests to read from disk concurrently. Modern disks are capable of handling many concurrent requests without an appreciable drop in performance, so parallelism holds the key to this solution.

Parallel Nested Loops

I was of the mistaken belief that Indexed Nested Loops joins could not be parallelised. This a false deduction based on the fact that Oracle can only parallelise Full Table Scans, Fast Full Index Scans, and Partitioned Range Scans; the low-selectivity unique index scans in Indexed Nested Loops joins cannot be parallelised.

Of course you can use a Full Table Scan as the driving table of an indexed nested-loops SQL (see the plan above) and that Full Table Scan can be parallelised. But is this any help? The full scan is not the problem; the uncached index lookups are the problem. What I did not realise is that when Oracle spawns parallel slaves to perform the full table scan, each of those slaves also takes responsibility for the other steps in the plan, including the unique index scans. So whilst a single unique index scan cannot be parallelised, the thousands of unique scans that are nested beneath a Full Table Scan CAN be parallelized.

Let's take a look at that plan after enabling parallel query on the Full Table Scan.

ALTER TABLE TEST_JOIN_SRC PARALLEL;

SELECT    *
FROM      test_join_src  src
LEFT JOIN test_join_ref1 ref1 ON ref1.pk1 = src.fk1
LEFT JOIN test_join_ref2 ref2 ON ref2.pk2 = src.fk2
LEFT JOIN test_join_tgt  tgt  ON tgt.pk = src.pk

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name              | Rows  | Bytes | Cost (%CPU)|    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                   |  9990 |  2029K|   511   (1)|        |      |            |
|   1 |  PX COORDINATOR                  |                   |       |       |            |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10000          |  9990 |  2029K|   511   (1)|  Q1,00 | P->S | QC (RAND)  |
|   3 |    NESTED LOOPS OUTER            |                   |  9990 |  2029K|   511   (1)|  Q1,00 | PCWP |            |
|   4 |     NESTED LOOPS OUTER           |                   |  9761 |  1553K|   342   (1)|  Q1,00 | PCWP |            |
|   5 |      NESTED LOOPS OUTER          |                   |  9761 |  1010K|   172   (1)|  Q1,00 | PCWP |            |
|   6 |       PX BLOCK ITERATOR          |                   |       |       |            |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL         | TEST_JOIN_SRC     |  9761 |   581K|     2   (0)|  Q1,00 | PCWP |            |
|   8 |       TABLE ACCESS BY INDEX ROWID| TEST_JOIN_REF2    |     1 |    45 |     2   (0)|  Q1,00 | PCWP |            |
|   9 |        INDEX UNIQUE SCAN         | TEST_JOIN_REF2_PK |     1 |       |     1   (0)|  Q1,00 | PCWP |            |
|  10 |      TABLE ACCESS BY INDEX ROWID | TEST_JOIN_TGT     |     1 |    57 |     2   (0)|  Q1,00 | PCWP |            |
|  11 |       INDEX UNIQUE SCAN          | TEST_JOIN_TGT_PK  |     1 |       |     1   (0)|  Q1,00 | PCWP |            |
|  12 |     TABLE ACCESS BY INDEX ROWID  | TEST_JOIN_REF1    |     1 |    45 |     2   (0)|  Q1,00 | PCWP |            |
|  13 |      INDEX UNIQUE SCAN           | TEST_JOIN_REF1_PK |     1 |       |     1   (0)|  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------

Note the Parallel Slaves step (PX SEND); it doesn't just surrond the Full Table Scan (Step 7), it surrounds the entire query. This means that all of the Index Scan and Table Access steps (Steps 8-13) are performed in parallel. Let's see how this performs.

CASE 4 - Sparse Lookups using Parallel Query


    This is the same SQL and data as is in the abysmal Case 3 above, except with Oracle Parallel Query enabled. Once again, I have flushed the buffer cache first to level the playing field.

    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.04       0.08          5          5          0           0
    Execute      1      0.19      13.38          0          3          0           0
    Fetch        1      2.55      22.65          0          0          0           0
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total        3      2.78      36.13          5          8          0           0
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      os thread startup                             128        0.10         11.88
      PX Deq: Join ACK                                7        0.00          0.00
      PX Deq: Parse Reply                            11        1.04          1.16
      PX Deq Credit: send blkd                      139        0.00          0.03
      SQL*Net message to client                       1        0.00          0.00
      PX Deq: Execute Reply                        3573        1.93         20.24
      PX Deq: Table Q Normal                          1        0.00          0.00
      PX Deq: Signal ACK                              1        0.00          0.00
      latch: session allocation                       1        0.01          0.01
      SQL*Net message from client                     1        0.38          0.38

    Unfortunately, TKPROF doesn't capture the wait events of the parallel threads, so we cannot see the number of physical reads from disk. According to the doco, it should have created a trace file for each parallel slave, but in my case it didn't.

    Still, 92% improvement. Not too shabby. For our 10,000 rows we now have a throughput of 277 rows per second; that's well in the acceptable range given my minimum acceptable benchmark of 100 rows/s, especially given this is a worst-case scenario with an empty buffer-cache and very low level of index clustering.

Conclusion

The duopoly of High-Volume and Low-Volume SQL is over. Goldilocks SQL is genuinely a third class of SQL that has its own tuning method.

Comments

First, I think this article is brilliant, it taught me a lot. But second, I did wonder about a couple of other approaches. Am I right in thinking that the problem in case 3 is in part due to the reference table blocks not being cached, so that each INDEX UNIQUE SCAN is followed by a TABLE ACCESS BY INDEX ROWID that requires a physical read? Parallel query speeds things up by a brute force approach. However, it requires Enterprise Edition licences, and I'm not sure it would scale up too well when you think of the impact on CPU and I/O. How about these alternatives:

Case 5: restructure the reference tables as IOTs. I can see that this would not be appropriate if the reference tables have many columns, but if they are sufficiently few (or if sensible use can be made of an overflow segment) there might be a considerable saving from cutting out the TABLE ACCESS BY INDEX ROWID.

Case 6: restructure the reference tables as single table hash clusters, which would replace both the INDEX UNIQUE SCAN and the TABLE ACCESS BY INDEX ROWID with TABLE ACCESS HASH.

If one has EE licences, use parallel query as well! If it is feasible to provide the data sets, I would be interested in doing the test myself - though I'm not sure I could obtain a suitable environment just now.

Enterprise license didn't occur to me, I've always been lucky enough to work for large wealthy clients who buy Oracle with all the trimmings. But the point is well taken. I am pretty confident neither would come close to Parallel, but it is still theoretically possible to get improvements approaching 50%.

I would be very wary of an IOT approach. They are great for PK access - agreed - but they are punishingly slow for selective access on alternate indexes. I think there would be unsatisfyingly few situations where IOTs would do more good than harm. I'm also wary of setting precedents that other people copy without understand the caveats.

The hash cluster possibility is more interesting. Coincidently, I built a Data Mart a while ago where I built all of the dimensions as Hash Clusters for just this purpose. I went to write up the experience for an article on this site, but was unable to create artificial test cases where the hash clusters materially out-performed unique indexes. In the end, the article morphed into a completely different topic (http://www.orafaq.com/node/1446): Worlds fastest scalable join.

During that exercise, I suspect that I did not adequately manage the physical disk cache and the buffer cache, and my volumes were too small. Certainly my buffer cache hit ratio was too high to see a difference. It would be interesting to have another go with the Goldilocks example.

My only reservations on the cluster example are:
- You cannot partition a hash cluster
- INSERT performance is impacted
You'd need to evaluate these costs when deciding to hash cluster a table as part of a Goldilocks solution.

I might have a go at this. Otherwise, stay tuned and I will see if I can post the test data generators.

Thanks for the feedback. Nice to know someone's reading out there.

Here is the output from the hash example. I set up both REF tables in Hash Clusters with a key size of 1 block and 100,000 hashkeys (these tables previously occupied about 70K blocks). Here is the trace:

SELECT    /*+ first_rows*/ *
FROM      test_join_src  src
LEFT JOIN test_join_ref1 ref1 ON ref1.pk1 = src.fk1
LEFT JOIN test_join_ref2 ref2 ON ref2.pk2 = src.fk2
LEFT JOIN test_join_tgt  tgt  ON tgt.pk = src.pk
where rownum > 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.21       0.37         86        183          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1     10.06     306.30      29100      69768          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3     10.27     306.67      29186      69951          0           0

Misses in library cache during parse: 1
Optimizer mode: FIRST_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  COUNT  (cr=69768 pr=29100 pw=0 time=306303222 us)
      0   FILTER  (cr=69768 pr=29100 pw=0 time=306303206 us)
  10000    NESTED LOOPS OUTER (cr=69768 pr=29100 pw=0 time=321080054 us)
  10000     NESTED LOOPS OUTER (cr=49887 pr=19531 pw=0 time=218657909 us)
  10000      NESTED LOOPS OUTER (cr=29997 pr=10034 pw=0 time=103099253 us)
  10000       TABLE ACCESS FULL TEST_JOIN_SRC (cr=88 pr=6 pw=0 time=58280 us)
   9907       TABLE ACCESS BY INDEX ROWID TEST_JOIN_TGT (cr=29909 pr=10028 pw=0 time=93153961 us)
   9907        INDEX UNIQUE SCAN TEST_JOIN_TGT_PK (cr=20002 pr=9942 pw=0 time=92078523 us)(object id 142723)
  10000      TABLE ACCESS HASH TEST_JOIN_REF2 (cr=19890 pr=9497 pw=0 time=105084541 us)
  10000     TABLE ACCESS HASH TEST_JOIN_REF1 (cr=19881 pr=9569 pw=0 time=106784586 us)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                     29101        0.50        299.55
  SQL*Net message to client                       1        0.00          0.00
  db file scattered read                          2        0.00          0.01
  SQL*Net message from client                     1       13.89         13.89

So it compares favourably with the indexed NL (484 sec), a 37% improvement. I think if I had of clustered the TEST_JOIN_TGT as well, it would have been freakishly close to my estimate of 50% improvement.

Verdict: Hash Clusters - definitely a tool to drop in the Goldilocks kit.

Ross,

Very informative article. Can you comment on the degree of parallelism and how to determine the most appropriate setting for that, for your test case?

--Andrew

Thanks for reading and responding, Andrew.

I have not done much experimenting with degrees of parallelism, and I certainly did not experiment with this test case.

My approach to parallelism has been to enable it on tables and indexes, and let Oracle sort out the number of parallel threads it uses. On my system - when nothing else is contending for parallel threads - Oracle allocates 128 to a SQL, with a maximum of 640 for the instance. In my parallel test case above it used 128 threads and the server was otherwise idle.

If you would like me to speculate (and lets face it, it's pretty hard to stop me), I would say that you could still get massive performance gains in this test case from much fewer slaves. My thinking here is:
- The critical path is the disk latency (round trip to disk). This represented 476 seconds out of a 484 second query.
- We mitigate this by parallelising those round trips.
- Too much parallelism will clog either the network (my storage is NAS) or the disk.
- If the disk was infinitely scaleable, we would halve the disk latency with 2 slaves, one third with 3 slaves, etc.
- With my parallel example, the 128 slaves spent 20 seconds acquiring the data. Based on test case 3, I expect most of this was disk latency.
- 20 seconds is about 1/24th of the non-parallel 476 seconds spent retrieving from disk, whereas each thread processed 1/128th of the data.

My guess is that disk latency / parallelism related performance gains on my system tails off at around 20 parallel slaves.

So, I could probably be more frugal in my use of parallel resources by overriding the degree of parallelism to (say) 24. But this is a special case where we are dealing with disk latency. If the parallelism were being used to break down a GROUP BY - or some other intensive process - then the ideal degree might change.

About all I can say about the degree of parallelism is that Oracle seems to manage it pretty well. If there is a lot of competition for parallel slaves, then the initialisation parameter parallel_adaptive_multi_user can be used to automatically limit the number of slaves allocated to each process.

I haven't yet found a need - in 10g - to manually throttle parallelism in Oracle.