8 Bulk Update Methods Compared

articles: 

What I love about writing SQL Tuning articles is that I very rarely end up publishing the findings I set out to achieve. With this one, I set out to demonstrate the advantages of PARALLEL DML, didn't find what I thought I would, and ended up testing 8 different techniques to find out how they differed. And guess what? I still didn't get the results I expected. Hey, at least I learned something.

As an ETL designer, I hate updates. They are just plain nasty. I spend an inordinate proportion of design time of an ETL system worrying about the relative proportion of rows inserted vs updated. I worry about how ETL tools apply updates (did you know DataStage applys updates singly, but batches inserts in arrays?), how I might cluster rows together that are subject to updates, and what I might do if I just get too many updates to handle.

It would be fair to say I obsess about them. A little bit.

The two most common forms of Bulk Updates are:

  1. Update (almost) every row in the table. This is common when applying data patches and adding new columns.
  2. Updating a small proportion of rows in a very large table.

Case 1 is uninteresting. The fastest way to update every row in the table is to rebuild the table from scratch. All of these methods below will perform worse.

Case 2 is common in Data Warehouses and overnight batch jobs. We have a table containing years worth of data, most of which is static; we are updating selected rows that were recently inserted and are still volatile. This case is the subject of our test. For the purposes of the test, we will assume that the target table of the update is arbitrarily large, and we want to avoid things like full-scans and index rebuilds.

And the nominees are...

The methods covered include both PL/SQL and SQL approaches. I want to test on a level playing field and remove special factors that unfairly favour one method, so there are some rules:

  • Accumulating data for the update can be arbitrarily complex. SQL updates can have joins with grouping and sub-queries and what-not; PL/SQL can have cursor loops with nested calls to other procedures. I'm not testing the relative merits of how to accumulate the data, so each test will use pre-preared update data residing in a Global Temporary Table.
  • Some methods - such as MERGE - allow the data source to be joined to the update target using SQL. Other methods don't have this capability and must use Primary Key lookups on the update target. To make these methods comparable, the "joinable" techniques will use a Nested Loops join to most closely mimic the Primary Key lookup of the other methods. Even though a Hash join may be faster than Nested Loops for some distributions of data, that is not always the case and - once again - we're assuming an arbitraily large target table, so a full scan is not necessarily feasible.
  • Having said that we're not comparing factors outside of the UPDATE itself, some of the methods do have differences unrelated to the UPDATE. I have included these deliberately because they are reasonably common and have different performance profiles; I wouldn't want anyone to think that because their statements were *similar* to those shown here that they have the same performance profile.

The 8 methods I am benchmarking here are as follows (in rough order of complexity):

  1. Explicit Cursor Loop
  2. Implicit Cursor Loop
  3. UPDATE with nested SET subquery
  4. BULK COLLECT / FORALL UPDATE
  5. Updateable Join View
  6. MERGE
  7. Parallel DML MERGE
  8. Parallel PL/SQL

For all of the tests, the following table structures will be used:

TEST{n} (Update Source) - 100K rows             TEST (Update target) - 10M rows

Name                           Type             Name                           Type             
------------------------------ ------------     ------------------------------ ------------
PK                             NUMBER           PK                             NUMBER
FK                             NUMBER           FK                             NUMBER
FILL                           VARCHAR2(40)     FILL                           VARCHAR2(40)

The data has the following characteristics:

  • TEST.PK will contain values 0 .. 9999999, but not in that order.
  • TEST.PK is poorly clustered. It is generated by reversing the digits in LPAD(ROWNUM, '0', 7). PK values of 1,2, and 3 are adjacent in the primary key index but one-million rows apart in the table.
  • TEST.FK will contain values 0 .. 99, unevenly distributed to favour lower numbers.
  • For the first round of testing, the column FK will be indexed with a simple b-tree index.

Method 1: Explicit Cursor Loop

Not many people code this way, but there are some Pro*C programmers out there who are used to Explicit Cursor Loops (OPEN, FETCH and CLOSE commands) and translate these techniques directly to PL/SQL. The UPDATE portion of the code works in an identical fashion to the Implicit Cursor Loop, so this is not really a separate "UPDATE" method as such. The interesting thing about this method is that it performs a context-switch between PL/SQL and SQL for every FETCH; this is less efficient. I include it here because it allows us to compare the cost of context-switches to the cost of updates.

DECLARE
    CURSOR c1 IS
        SELECT *
        FROM test6;

    rec_cur c1%rowtype;
BEGIN
    OPEN c1;
    LOOP
        FETCH c1 INTO rec_cur;
        EXIT WHEN c1%notfound;

        UPDATE test
        SET    fk = rec_cur.fk
        ,      fill = rec_cur.fill
        WHERE  pk = rec_cur.pk;
    END LOOP;
    CLOSE C1;
END;
/

Method 2: Implicit Cursor Loop

This is the simplest PL/SQL method and very common in hand-coded PL/SQL applications. Update-wise, it looks as though it should perform the same as the Explicit Cursor Loop. The difference is that the Implicit Cursor internally performs bulk fetches, which should be faster than the Explicit Cursor because of the reduced context switches.

BEGIN
    FOR rec_cur IN (
        SELECT *
        FROM test3
    ) LOOP
        UPDATE test
        SET    fk = rec_cur.fk
        ,      fill = rec_cur.fill
        WHERE  pk = rec_cur.pk;
    END LOOP;
END;
/

Method 3: UPDATE with nested SET subquery

This method is pretty common. I generally recommend against it for high-volume updates because the SET sub-query is nested, meaning it is performed once for each row updated. To support this method, I needed to create an index on TEST8.PK.

UPDATE test
SET    (fk, fill) = (
           SELECT test8.fk, test8.fill
           FROM   test8
           WHERE  pk = test.pk
)
WHERE  pk IN (
           SELECT pk
           FROM   test8
);

Method 4: BULK COLLECT / FORALL UPDATE

This one is gaining in popularity. Using BULK COLLECT and FORALL statements is the new de-facto standard for PL/SQL programmers concerned about performance because it reduces context switching overheads between the PL/SQL and SQL engines.

The biggest drawback to this method is readability. Since Oracle does not yet provide support for record collections in FORALL, we need to use scalar collections, making for long declarations, INTO clauses, and SET clauses.

DECLARE
    CURSOR rec_cur IS
    SELECT *
    FROM test4;

    TYPE num_tab_t IS TABLE OF NUMBER(38);
    TYPE vc2_tab_t IS TABLE OF VARCHAR2(4000);

    pk_tab NUM_TAB_T;
    fk_tab NUM_TAB_T;
    fill_tab VC2_TAB_T;
BEGIN
    OPEN rec_cur;
    LOOP
        FETCH rec_cur BULK COLLECT INTO pk_tab, fk_tab, fill_tab LIMIT 1000;
        EXIT WHEN pk_tab.COUNT() = 0;

        FORALL i IN pk_tab.FIRST .. pk_tab.LAST
           UPDATE test
            SET    fk = fk_tab(i)
            ,      fill = fill_tab(i)
            WHERE  pk = pk_tab(i);
    END LOOP;
    CLOSE rec_cur;
END;
/

Method 5: Updateable Join View

This is really a deprecated pre-9i method; the modern equivalent is the MERGE statement. This needs a unique index on TEST1.PK in order to enforce key preservation.

UPDATE (
        SELECT /*+ ordered use_nl(old)*/ new.pk as new_pk
        ,       new.fk as new_fk
        ,       new.fill as new_fill
        ,       old.*
        FROM test1 new
        JOIN test old ON (old.pk = new.pk)
)
SET fk = new_fk
,   fill = new_fill
/

Method 6: MERGE

The modern equivalent of the Updateable Join View. Gaining in popularity due to its combination of brevity and performance, it is primarily used to INSERT and UPDATE in a single statement. We are using the update-only version here. Note that I have included a FIRST_ROWS hint to force an indexed nested loops plan. This is to keep the playing field level when comparing to the other methods, which also perform primary key lookups on the target table. A Hash join may or may not be faster, that's not the point - I could increase the size of the target TEST table to 500M rows and Hash would be slower for sure.

MERGE /*+ FIRST_ROWS*/ INTO test
USING test2 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk
,       fill = new.fill
/

Here is the Explain Plan

-------------------------------------------------------------------------------
| Id  | Operation                      | Name    | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                |         |   130K|  9921K|   258K  (1)|
|   1 |  MERGE                         | TEST    |       |       |            |
|   2 |   VIEW                         |         |       |       |            |
|   3 |    NESTED LOOPS                |         |   130K|    11M|   258K  (1)|
|   4 |     TABLE ACCESS FULL          | TEST2   |   128K|  6032K|   172   (5)|
|   5 |     TABLE ACCESS BY INDEX ROWID| TEST    |     1 |    48 |     2   (0)|
|   6 |      INDEX UNIQUE SCAN         | TEST_PK |     1 |       |     1   (0)|
-------------------------------------------------------------------------------

Method 7: Parallel DML MERGE

Now we're getting clever... This is the MERGE example on steroids. It uses Oracle's Parallel DML capability to spread the load over multiple slave threads.

ALTER SESSION ENABLE PARALLEL DML;

MERGE /*+ first_rows parallel(test) parallel(test2) */ INTO test
USING test5 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk
,       fill = new.fill
/

Note the differences in the Explain Plan.

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)|    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                       |          |   109K|  8325K|  1880   (1)|        |      |            |
|   1 |  PX COORDINATOR                       |          |       |       |            |        |      |            |
|   2 |   PX SEND QC (RANDOM)                 | :TQ10002 |   109K|    10M|  1880   (1)|  Q1,02 | P->S | QC (RAND)  |
|   3 |    INDEX MAINTENANCE                  | TEST     |       |       |            |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                        |          |   109K|    10M|  1880   (1)|  Q1,02 | PCWP |            |
|   5 |      PX SEND RANGE                    | :TQ10001 |   109K|    10M|  1880   (1)|  Q1,01 | P->P | RANGE      |
|   6 |       MERGE                           | TEST     |       |       |            |  Q1,01 | PCWP |            |
|   7 |        PX RECEIVE                     |          |   109K|    10M|  1880   (1)|  Q1,01 | PCWP |            |
|   8 |         PX SEND HYBRID (ROWID PKEY)   | :TQ10000 |   109K|    10M|  1880   (1)|  Q1,00 | P->P | HYBRID (ROW|
|   9 |          VIEW                         |          |       |       |            |  Q1,00 | PCWP |            |
|  10 |           NESTED LOOPS                |          |   109K|    10M|  1880   (1)|  Q1,00 | PCWP |            |
|  11 |            PX BLOCK ITERATOR          |          |       |       |            |  Q1,00 | PCWC |            |
|  12 |             TABLE ACCESS FULL         | TEST2    |   107K|  5062K|     2   (0)|  Q1,00 | PCWP |            |
|  13 |            TABLE ACCESS BY INDEX ROWID| TEST     |     1 |    48 |     0   (0)|  Q1,00 | PCWP |            |
|  14 |             INDEX UNIQUE SCAN         | TEST_PK  |     1 |       |     0   (0)|  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------

Method 8: Parallel PL/SQL

This is much easier to do with DataStage than with native PL/SQL. The goal is to have several separate sessions applying UPDATE statements at once, rather than using the sometimes restrictive PARALLEL DML alternative. It's a bit of a kludge, but we can do this in PL/SQL using a Parallel Enable Table Function. Here's the function:

CREATE OR REPLACE FUNCTION test_parallel_update (
	test_cur IN SYS_REFCURSOR
) 
RETURN test_num_arr
PARALLEL_ENABLE (PARTITION test_cur BY ANY)
PIPELINED
IS
	PRAGMA AUTONOMOUS_TRANSACTION;

	test_rec TEST%ROWTYPE;
	TYPE num_tab_t IS TABLE OF NUMBER(38);
	TYPE vc2_tab_t IS TABLE OF VARCHAR2(4000);

	pk_tab NUM_TAB_T;
	fk_tab NUM_TAB_T;
	fill_tab VC2_TAB_T;

	cnt INTEGER := 0;
BEGIN
	LOOP
		FETCH test_cur BULK COLLECT INTO pk_tab, fk_tab, fill_tab LIMIT 1000;
		EXIT WHEN pk_tab.COUNT() = 0;

		FORALL i IN pk_tab.FIRST .. pk_tab.LAST
			UPDATE test
			SET    fk = fk_tab(i)
			,      fill = fill_tab(i)
			WHERE  pk = pk_tab(i);

		cnt := cnt + pk_tab.COUNT;
	END LOOP;

        CLOSE test_cur;

	COMMIT;
	PIPE ROW(cnt);
	RETURN;
END;
/

Note that it receives its data via a Ref Cursor parameter. This is a feature of Oracle's parallel-enabled functions; they will apportion the rows of a single Ref Cursor amongst many parallel slaves, with each slave running over a different subset of the input data set.

Here is the statement that calls the Parallel Enabled Table Function:

SELECT sum(column_value)
FROM   TABLE(test_parallel_update(CURSOR(SELECT * FROM test7)));

Note that we are using a SELECT statement to call a function that performs an UPDATE. Yeah, I know, it's nasty. You need to make the function an AUTONOMOUS TRANSACTION to stop it from throwing an error. But just bear with me, it is the closest PL/SQL equivalent I can make to a third-party ETL Tool such as DataStage with native parallelism.

And on to the testing....

ROUND 1

In this test, we apply the 100K updated rows in Global Temporary Table TEST{n} to permanent table TEST. There are 3 runs:

  • Run 1: The buffer cache is flushed and about 1 hour of unrelated statistics gathering has been used to age out the disk cache.
  • Run 2: The buffer cache is flushed and the disk cache has been aged out with about 5-10mins of indexed reads. Timings indicate that the disk cache is still partially populated with blocks used by the query.
  • Run 3: The buffer cache is pre-salted with the table and blocks it will need. It should perform very little disk IO.

                                      RUN 1   RUN 2   RUN 3  
-----------------------------------   -----   -----   -----
1. Explicit Cursor Loop               931.3   783.2    49.3
2. Implicit Cursor Loop               952.7   672.8    40.2
3. UPDATE with nested SET subquery    941.4   891.5    31.5
4. BULK COLLECT / FORALL UPDATE       935.2   826.0    27.9
5. Updateable Join View               933.2   741.0    28.8
6. MERGE                              854.6   838.5    28.4
7. Parallel DML MERGE                  55.7    46.1    47.7
8. Parallel PL/SQL                     28.2    27.2     6.3

The things I found interesting from these results are:

  • Amongst the non-parallel methods (1-6), context switches only make a significant and noticable difference with cached data. With uncached data, the cost of disk reads so far outweighs the context switches that they are barely noticable. Context Switching - whilst important - is not really a game-changer. This tells me that you should avoid methods 1 and 2 as a best practice, but it is probably not cost-effective to re-engineer existing method 1/2 code unless your buffer cache hit ratio is 99+% (ie. like RUN 3).

  • There were no significant differences between the 6 non-parallel methods, however this is not to suggest that it is not important which one you choose. All of these benchmarks perform Primary Key lookups of the updated table, however it is possible to run methods 5 and 6 as hash joins with full table scans. If the proportion of blocks updated is high enough, the hash join can make an enormous difference in the run time. See Appendix 1 for an example.

  • Parallel updates are a game changer. The reason for this is disk latency. Almost all of the time for RUN 1 and RUN 2 of the non-parallel methods is spent waiting for reads and writes on disk. The IO system of most computers is designed to serve many requests at a time, but no ONE request can utilise ALL of the resources. So when an operation runs serially, it only uses a small proportion of the available resources. If there are no other jobs running then we get poor utilisation. The parallel methods 7 and 8 allow us to tap into these under-utilised resources. Instead of issuing 100,000 disk IO requests one after the other, these methods allow (say) 100 parallel threads to perform just 1000 sequential IO requests.

  • Method 8, which is the equivalent of running many concurrent versions of Method 4 with different data, is consistently faster than Oracle's Parallel DML. This is worth exploring. Since the non-parallel equivalents (Methods 4 and 6) show no significant performance difference, it is reasonable to expect that parallelising these two methods will yield similar results. I ran some traces (see Appendix 2) and found that the Parallel Merge was creating too many parallel threads and suffering from latch contention. Manually reducing the number of parallel threads made it perform similarly to the Parallel PL/SQL method. The lesson here is that too much parallelism is a bad thing.

ROUND 2

Let's see how a Foreign Key constraint affects things. For this round, I have created a parent table and a Foreign Key on the FK column.

For brevity, this time we'll just flush the buffer cache and run about 5 minutes worth of indexed reads to cycle the disk cache.

                                      RUN 1  RUN 2
-----------------------------------   -----  -----
1. Explicit Cursor Loop               887.1  874.6
2. Implicit Cursor Loop               967.0  752.1
3. UPDATE with nested SET subquery    920.1  795.2
4. BULK COLLECT / FORALL UPDATE       840.9  759.2
5. Updateable Join View               727.5  851.8
6. MERGE                              807.8  833.6
7. Parallel DML MERGE                  26.8   29.2
8. Parallel PL/SQL                     25.3   23.8

Summary of findings:

  • It looks as though there is a small premium associated with checking the foreign key, although it does not appear to be significant. It's worth noting that the parent table in this case is very small and quickly cached. A very large parent table would result in considerably greater number of cache misses and resultant disk IO. Foreign keys are often blamed for bad performance; whilst they can be limiting in some circumstances (e.g. direct path loads), updates are not greatly affected when the parent tables are small.

  • I was expecting the Parallel DML MERGE to be slower. According to the Oracle® Database Data Warehousing Guide - 10g Release 2, INSERT and MERGE are "Not Parallelized" when issed against the child of a Foreign Key constraint, whereas parallel UPDATE is "supported". As a test, I issued a similar MERGE statement and redundantly included the WHEN NOT MATCHED THEN INSERT clause: it was not parallelized and ran slower. The lesson here: there may be merit in applying an upsert (insert else update) as an update-only MERGE followed by an INSERT instead of using a single MERGE.

ROUND 3

The two things I hear most about Bitmap indexes is that:

  • They are inappropriate for tables that undergo concurrent updates, and
  • They are slow to update.

Surely no comparison of update methods could possibly be complete without a test of Bitmap index maintenance.

In this round, I have removed the Foreign Key used in Round 2, and included a Bitmap index on TEST.FK

                                      RUN 1  RUN 2
-----------------------------------   -----  -----
1. Explicit Cursor Loop               826.0  951.2
2. Implicit Cursor Loop               898.7  877.2
3. UPDATE with nested SET subquery    588.9  633.4
4. BULK COLLECT / FORALL UPDATE       898.0  926.7
5. Updateable Join View               547.8  687.1
6. MERGE                              689.3  763.4
7. Parallel DML MERGE                  30.2   28.4
8. Parallel PL/SQL                    ORA-00060: deadlock detected

Well, if further proof was needed that Bitmap indexes are inappropriate for tables that are maintained by multiple concurrent sessions, surely this is it. The Deadlock error raised by Method 8 occurred because bitmap indexes are locked at the block-level, not the row level. With hundreds of rows represented by each block in the index, the chances of two sessions attempting to lock the same block are quite high. The very clear lesson here: don't update bitmap indexed tables in parallel sessions; the only safe parallel method is PARALLEL DML.

The other intesting outcome is the differing impact of the bitmap index on SET-based updates vs transactional updates (SQL solutions vs PL/SQL solutions). PL/SQL solutions seem to incur a penalty when updating bitmap indexed tables. A single bitmap index has added around 10% to the overall runtime of PL/SQL solutions, whereas the set-based (SQL-based) solutions run faster than the B-Tree indexes case (above). Although not shown here, this effect is magnified with each additional bitmap index. Given that most bitmap-indexed tables would have several such indexes (as bitmap indexes are designed to be of most use in combination), this shows that PL/SQL is virtually non-viable as a means of updating a large number of rows.

SUMMARY OF FINDINGS

  • Context Switches in cursor loops have greatest impact when data is well cached. For updates with buffer cache hit-ratio >99%, convert to BULK COLLECT or MERGE.
  • Use MERGE with a Hash Join when updating a significant proportion of blocks (not rows!) in a segment.
  • Parallelize large updates for a massive performance improvement.
  • Tune the number of parallel query servers used by looking for latch contention thread startup waits.
  • Don't rashly drop Foreign Keys without benchmarking; they may not be costing very much to maintain.
  • MERGE statements that UPDATE and INSERT cannot be parallelised when a Foreign Key is present. If you want to keep the Foreign Key, you will need to use multiple concurrent sessions (insert/update variant of Method 8) to achieve parallelism.
  • Don't use PL/SQL to maintain bitmap indexed tables; not even with BULK COLLECT / FORALL. Instead, INSERT transactions into a Global Temporary Table and apply a MERGE.


APPENDIX 1 - Nested Loops MERGE vs. Hash Join MERGE

Although we are updating only 1% of the rows in the table, those rows are almost perfectly distributed throughout the table. As a result, we end up updating almost 100% of the blocks. This makes it a good candidate for hash joins and full scans to out-perform indexed nested loops. Of course, as you decrease the percentage of blocks updated, the balance will swing in favour of Nested Loops; but this trace demonstrates that MERGE definitely has it's place in high-volume updates.

MERGE /*+ FIRST_ROWS*/ INTO test
USING test2 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk
,       fill = new.fill

-------------------------------------------------------------------------------
| Id  | Operation                      | Name    | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                |         | 95331 |  7261K|   191K  (1)|
|   1 |  MERGE                         | TEST    |       |       |            |
|   2 |   VIEW                         |         |       |       |            |
|   3 |    NESTED LOOPS                |         | 95331 |  8937K|   191K  (1)|
|   4 |     TABLE ACCESS FULL          | TEST2   | 95331 |  4468K|   170   (3)|
|   5 |     TABLE ACCESS BY INDEX ROWID| TEST    |     1 |    48 |     2   (0)|
|   6 |      INDEX UNIQUE SCAN         | TEST_PK |     1 |       |     1   (0)|
-------------------------------------------------------------------------------

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.01          0          4          1           0
Execute      1     57.67     829.77      95323     383225     533245      100000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2     57.68     829.78      95323     383229     533246      100000

Misses in library cache during parse: 1
Optimizer mode: FIRST_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  MERGE  TEST (cr=383225 pr=95323 pw=0 time=127458586 us)
 100000   VIEW  (cr=371028 pr=75353 pw=0 time=619853020 us)
 100000    NESTED LOOPS  (cr=371028 pr=75353 pw=0 time=619653018 us)
 100000     TABLE ACCESS FULL TEST2 (cr=750 pr=386 pw=0 time=505310 us)
 100000     TABLE ACCESS BY INDEX ROWID TEST (cr=370278 pr=74967 pw=0 time=615942540 us)
 100000      INDEX UNIQUE SCAN TEST_PK (cr=200015 pr=227 pw=0 time=4528703 us)(object id 141439)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file scattered read                         37        0.20          0.72
  db file sequential read                     94936        0.39        781.52
  buffer exterminate                              1        0.97          0.97
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.05          0.05
********************************************************************************

MERGE INTO test
USING test2 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk
,       fill = new.fill

---------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)|
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT      |       | 95331 |  7261K|       | 46318   (3)|
|   1 |  MERGE               | TEST  |       |       |       |            |
|   2 |   VIEW               |       |       |       |       |            |
|   3 |    HASH JOIN         |       | 95331 |  8937K|  5592K| 46318   (3)|
|   4 |     TABLE ACCESS FULL| TEST2 | 95331 |  4468K|       |   170   (3)|
|   5 |     TABLE ACCESS FULL| TEST  |    10M|   458M|       | 16949   (4)|
---------------------------------------------------------------------------

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.03       0.40          1          4          1           0
Execute      1     54.50     123.48      94547      82411     533095      100000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2     54.53     123.88      94548      82415     533096      100000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  MERGE  TEST (cr=82411 pr=94547 pw=0 time=123480418 us)
 100000   VIEW  (cr=75424 pr=74949 pw=0 time=48081374 us)
 100000    HASH JOIN  (cr=75424 pr=74949 pw=0 time=47981370 us)
 100000     TABLE ACCESS FULL TEST2 (cr=750 pr=335 pw=0 time=1207771 us)
9999999     TABLE ACCESS FULL TEST (cr=74674 pr=74614 pw=0 time=10033917 us)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                     19606        0.37         41.24
  db file scattered read                       4720        0.52         34.20
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.03          0.03

That's a pretty significant difference: the same method (MERGE) is 6-7 times faster when performed as a Hash Join. Although the number of physical disk blocks and Current Mode Gets are about the same in each test, the Hash Join method performs multi-block reads, resulting in fewer visits to the disk.

All 8 methods above were benchmarked on the assumption that the target table is arbitrarily large and the subset of rows/blocks to be updated are relatively small. If the proportion of updated blocks increases, then the average cost of finding those rows decreases; the exercise becomes one of tuning the data access rather than tuning the update.

APPENDIX 2 - Parallel DML vs. PARALLEL PL/SQL

Why is the Parallel PL/SQL (Method 8) approach much faster than the Parallel DML MERGE (Method 7)? To shed some light, here are some traces. Below we see the trace from the Parallel Coordinator session of Method 7:

MERGE /*+ first_rows */ INTO test
USING test5 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk
,       fill = new.fill

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.02       0.02          0          4          1           0
Execute      1      1.85      57.91          1          7          2      100000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      1.87      57.94          1         11          3      100000

Misses in library cache during parse: 1
Optimizer mode: FIRST_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
    128  PX COORDINATOR  (cr=7 pr=1 pw=0 time=57912088 us)
      0   PX SEND QC (RANDOM) :TQ10002 (cr=0 pr=0 pw=0 time=0 us)
      0    INDEX MAINTENANCE  TEST (cr=0 pr=0 pw=0 time=0 us)(object id 0)
      0     PX RECEIVE  (cr=0 pr=0 pw=0 time=0 us)
      0      PX SEND RANGE :TQ10001 (cr=0 pr=0 pw=0 time=0 us)
      0       MERGE  TEST (cr=0 pr=0 pw=0 time=0 us)
      0        PX RECEIVE  (cr=0 pr=0 pw=0 time=0 us)
      0         PX SEND HYBRID (ROWID PKEY) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0          VIEW  (cr=0 pr=0 pw=0 time=0 us)
      0           NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us)
      0            PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0             TABLE ACCESS FULL TEST5 (cr=0 pr=0 pw=0 time=0 us)
      0            TABLE ACCESS BY INDEX ROWID TEST (cr=0 pr=0 pw=0 time=0 us)
      0             INDEX UNIQUE SCAN TEST_PK (cr=0 pr=0 pw=0 time=0 us)(object id 141439)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                         1        0.02          0.02
  reliable message                                1        0.00          0.00
  enq: RO - fast object reuse                     1        0.00          0.00
  os thread startup                             256        0.09         23.61
  PX Deq: Join ACK                                7        0.00          0.00
  PX Deq: Parse Reply                            15        0.09          0.19
  PX Deq Credit: send blkd                       35        0.00          0.00
  PX qref latch                                   5        0.00          0.00
  PX Deq: Execute Reply                        1141        1.96         30.30
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.05          0.05

We can see here that the Parallel Co-ordinator spent 23.61 seconds (of the 57.94 elapsed) simply starting up the parallel threads, and 30.3 seconds waiting for them to do their stuff.

And here are the wait events for just ONE of the parallel threads from the same test case:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  cursor: pin S wait on X                         3        0.02          0.06
  PX Deq: Execution Msg                          16        1.96         10.94
  PX Deq: Msg Fragment                            2        0.00          0.00
  latch: parallel query alloc buffer              7        5.89          7.52
  db file sequential read                       825        0.10         12.00
  read by other session                          17        0.06          0.18
  log buffer space                                1        0.03          0.03
  PX Deq Credit: send blkd                        1        0.02          0.02
  PX Deq: Table Q Normal                         28        0.19          0.35
  latch: cache buffers chains                     1        0.01          0.01
  db file parallel read                           1        0.11          0.11

From this, we can see that of the 30.3 seconds the Co-ordinator spent waiting for the parallel threads, this one spent 7.52 waiting for shared resources (latches) held by other parallel threads, and just 12 seconds reading blocks from disk.

For comparison, here is the trace of the Co-ordinator session of a Parallel PL/SQL run:

SELECT  sum(column_value)
FROM    TABLE(test_parallel_update(
                CURSOR(SELECT * FROM TEST7)
        ))

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.05       0.13          7         87          1           0
Execute      1      0.20      12.47          0          3          0           0
Fetch        2      0.21      20.13          0          0          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.46      32.74          7         90          1           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=3 pr=0 pw=0 time=32609316 us)
    128   PX COORDINATOR  (cr=3 pr=0 pw=0 time=252152371 us)
      0    PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0     SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us)
      0      VIEW  (cr=0 pr=0 pw=0 time=0 us)
      0       COLLECTION ITERATOR PICKLER FETCH TEST_PARALLEL_UPDATE (cr=0 pr=0 pw=0 time=0 us)
      0        PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0         TABLE ACCESS FULL TEST7 (cr=0 pr=0 pw=0 time=0 us)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  reliable message                                1        0.00          0.00
  enq: RO - fast object reuse                     1        0.00          0.00
  os thread startup                             128        0.10         11.85
  PX Deq: Join ACK                                4        0.00          0.00
  PX Deq: Parse Reply                            46        0.00          0.10
  PX Deq Credit: send blkd                      128        0.00          0.06
  SQL*Net message to client                       2        0.00          0.00
  PX Deq: Execute Reply                         143        1.96         19.86
  PX Deq: Signal ACK                              4        0.00          0.00
  enq: PS - contention                            2        0.00          0.00
  SQL*Net message from client                     2        0.03          0.06

And the wait events for a single parallel thread:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PX Deq: Execution Msg                           8        0.13          0.22
  PX Deq: Msg Fragment                            1        0.00          0.00
  library cache load lock                         3        0.19          0.30
  db file sequential read                       872        0.11         16.47
  read by other session                          13        0.06          0.25
  latch: cache buffers chains                     3        0.01          0.02

The Parallel PL/SQL spent just 11.85 seconds starting parallel threads, compared to 23.61 seconds for PARALLEL DML. I noticed from the trace that PARALLEL DML used 256 parallel threads, whereas the PL/SQL method used just 128. Looking more closely at the trace files I suspect that the PARALLEL DML used 128 readers and 128 writers, although it hard to be sure. Whatever Oracle is doing here, it seems there is certainly a significant cost of opening parallel threads.

Also, looking at the wait events for the Parallel PL/SQL slave thread, we see no evidence of resource contention as we did in the PARALLEL DML example.

In theory, we should be able to reduce the cost of thread startup and also reduce contention by reducing the number of parallel threads. Knowing from above that the parallel methods were 10-20 time faster than the non-parallel methods, I suspect that benefits of parallelism diminish after no more than 32 parallel threads. In support of that theory, here is a trace of a PARALLEL DML test case with 32 parallel threads:

First the Parallel Co-ordinator:

MERGE /*+ first_rows parallel(test5 32) parallel(test 32) */ INTO test
USING test5 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk
,       fill = new.fill

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.02       0.02          0          4          1           0
Execute      1      0.55      31.14          0          7          2      100000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.57      31.17          0         11          3      100000

Misses in library cache during parse: 1
Optimizer mode: FIRST_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
     32  PX COORDINATOR  (cr=7 pr=0 pw=0 time=30266841 us)
      0   PX SEND QC (RANDOM) :TQ10002 (cr=0 pr=0 pw=0 time=0 us)
      0    INDEX MAINTENANCE  TEST (cr=0 pr=0 pw=0 time=0 us)(object id 0)
      0     PX RECEIVE  (cr=0 pr=0 pw=0 time=0 us)
      0      PX SEND RANGE :TQ10001 (cr=0 pr=0 pw=0 time=0 us)
      0       MERGE  TEST (cr=0 pr=0 pw=0 time=0 us)
      0        PX RECEIVE  (cr=0 pr=0 pw=0 time=0 us)
      0         PX SEND HYBRID (ROWID PKEY) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0          VIEW  (cr=0 pr=0 pw=0 time=0 us)
      0           NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us)
      0            PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0             TABLE ACCESS FULL TEST5 (cr=0 pr=0 pw=0 time=0 us)
      0            TABLE ACCESS BY INDEX ROWID TEST (cr=0 pr=0 pw=0 time=0 us)
      0             INDEX UNIQUE SCAN TEST_PK (cr=0 pr=0 pw=0 time=0 us)(object id 141439)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  reliable message                                1        0.00          0.00
  enq: RO - fast object reuse                     1        0.00          0.00
  os thread startup                              64        0.09          5.89
  PX Deq: Join ACK                                9        0.00          0.00
  PX Deq: Parse Reply                            18        0.04          0.06
  PX Deq: Execute Reply                         891        1.78         24.18
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.03          0.03

And the wait events for one of those threads:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  cursor: pin S wait on X                         1        0.02          0.02
  PX Deq: Execution Msg                          34        0.11          0.16
  db file sequential read                      2419        0.30         22.21
  read by other session                           5        0.02          0.05
  buffer busy waits                               1        0.00          0.00
  PX Deq Credit: send blkd                        1        0.00          0.00
  PX Deq: Table Q Normal                          4        0.18          0.19

Note in this case

  • OS thread startup of just 5.89 seconds
  • No more resource contention
  • Performance now in line with that of the PARALLEL PL/SQL solution

Comments

I find this a very good article and I would like to explore some of the results of the tests. Can you provide the code use for the tests? For example, when I try to replicate method 8, the function won't compile because return value (test_num_arr) is not defined. I'm a little unclear what this is and how I should define it.

Thanks,
Jesse

I don't have the original code, but TEST_NUM_ARR is just a Nested Table of integers

CREATE OR REPLACE TYPE test_num_arr AS TABLE OF INTEGER;

Should do the trick, I think.