Jonathan Lewis
Merge emergency
When using the merge command to process small amounts of data be careful that you don’t manage to enable parallel DML or you may find that you add a lot of empty space to the object. This could become very wasteful if you were running a loop that processed a few rows on every iteration.
ExplanationA poster on one of the Oracle forums recently complained that their “Oracle Always Free Database in the Cloud” was using up the available space at a ridiculous rate and showed us some SQL they were using to populate a table, and some figures about the database space allocation and usage after their code had inserted just 20,000 fairly ordinary-sized rows.
Looking at the numbers I noticed a strange coincidence and invented a slightly silly hypothesis (so I thought) of how some code might be causing something strange to happen. Then I thought – what the heck, it should only take about 5 minutes to model that hypothesis. So here’s a model, running on an instance of 19.11 and a tablespace using automatic segment space management (ASSM) with uniform 1MB extents:
rem
rem Script: merge_parallel_3.sql
rem Author: Jonathan Lewis
rem Dated: Sept 2025
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
rem To be tested
rem 23.6(ai)
rem 23.3.0.0
rem 12.2.0.1
rem
alter session enable parallel dml;
create table t1(n1 number)
parallel(degree 2)
/
begin
<<loop>>
for i in 1..20 loop
merge into t1
using dual
on (t1.n1 = loop.i)
when not matched then
insert (n1) values (loop.i);
commit;
end loop;
end;
/
execute dbms_stats.gather_table_stats(user,'t1')
The question: after inserting 20 rows (using this merge command), how much space has been allocated to the table? Here are three calls to check what space has been used, followed by the results:
select segment_name, extents, blocks
from user_segments
where segment_name = 'T1'
/
select table_name, num_rows, blocks
from user_tables
where table_name = 'T1'
/
@dbms_space_use_assm_embedded test_user t1 table
The last call is to a variant of a script I published (and updated) several years ago that calls the dbms_space package to report the level of fullness of blocks in an ASSM segment.
SEGMENT_NAME EXTENTS BLOCKS
------------------------- ---------- ----------
T1 21 2688
1 row selected.
TABLE_NAME NUM_ROWS BLOCKS
------------------------- ---------- ----------
T1 20 2684
1 row selected.
Full : 20 / 163,840
Freespace 1 ( 0 - 25% free) : 0 / 0
Freespace 2 ( 25 - 50% free) : 0 / 0
Freespace 3 ( 50 - 75% free) : 0 / 0
Freespace 4 ( 75 - 100% free) : 0 / 0
Freespace 5 ( Empty ) : 0 / 0
Unformatted : 2,564 / 21,004,288
PL/SQL procedure successfully completed.
Segment Total blocks: 2,688
Object Unused blocks: 0
HWM extent: file id: 36
HWM extent: block id: 2,688
HWM extent: high block: 128
PL/SQL procedure successfully completed.
Twenty rows, twenty-one extents of 1MB each: every time I’ve inserted a row Oracle has added another extent to the table before inserting the row. This is an unfortunate side effect of combining parallel DML, with a low volume merge. With a parallel-enabled “insert as select” each PX process wants to use its own extent to insert a subset of data, so each insert is going to request (at least) one more extent to be added to the segment. It looks as if the parallel merge uses the core of the same algorithm, with the result that we add at least one extent every time we merge, and if we’re only merging a tiny number of rows then we waste a lot of space.
You’ll notice, by the way, that the segment consists of 2,688 blocks, of which 20 are full and 2,564 are unformatted – leaving 104 blocks “unaccounted”: that’s 5 blocks per extent that the PX process sets up because it “thinks” that it’s starting a whole new segment. From a data block dump of the first 5 blocks of each extent we can see that these blocks are, in order:
type: 0x20=FIRST LEVEL BITMAP BLOCK
type: 0x20=FIRST LEVEL BITMAP BLOCK
type: 0x21=SECOND LEVEL BITMAP BLOCK
type: 0x22=THIRD LEVEL BITMAP BLOCK
type: 0x23=PAGETABLE SEGMENT HEADER
The 3rd level bitmap (L3 bitmap) is an unexpected occurrence. It’s something you would normally expect to see only in extremely large objects, but presumably it appears here as a defensive strategy for parallel processing. (For less extreme data volumes the segment header blocks take on the role of the first L3 bitmap block)
FootnoteThis phenomenon was one that Christo Kutrovsky pointed out to me nearly 12 years ago. It was only after I had described my thought experiment (and the explanation wasn’t entirely correct) that I had a memory of parallel execution adding extra extents and searched my library for any scripts including both the words “merge” and “parallel” and rediscovered the notes we had made in 2014 on what was going on.
Footnote 2The notes I found in the 2014 scripts included some comments about autoallocate extents rather than uniform extents, and how “extent trimming” would reduce any extra extent to the smallest multiple of 64KB, but also pointed out that this might not avoid the wastage since Oracle could only use the rest of the next 960KB if it actuallyy wanted to allocate further extents of 64KB.
Update (already)Since I had a model that I could re-run I decided to spend a few minutes checking the effect of using the system/autoallocate option for extent sizing (which means, in general, Oracle will start at 64KB, then jump to 1MB, then 8MB then 64MB extents). All I had to do was create a suitable tablespace, grant myself a quota on it, and reference it in my “create table” statement. Here are the results from the last three calls to the database:
SEGMENT_NAME EXTENTS BLOCKS
------------------------- ---------- ----------
T1 21 168
1 row selected.
TABLE_NAME NUM_ROWS BLOCKS
------------------------- ---------- ----------
T1 20 168
1 row selected.
Full : 20 / 163,840
Freespace 1 ( 0 - 25% free) : 0 / 0
Freespace 2 ( 25 - 50% free) : 0 / 0
Freespace 3 ( 50 - 75% free) : 0 / 0
Freespace 4 ( 75 - 100% free) : 0 / 0
Freespace 5 ( Empty ) : 45 / 368,640
Unformatted : 0 / 0
PL/SQL procedure successfully completed.
Segment Total blocks: 168
Object Unused blocks: 0
HWM extent: file id: 14
HWM extent: block id: 2,688
HWM extent: high block: 8
PL/SQL procedure successfully completed.
A little detailed checking (but not using the 10391 trace) showed that the PX processes appeared to have allocated 1MB extents *** which were then trimmed back to 64KB, returning 15/16th of that megabyte to the tablespace free space (where it would be virtually unusable). At the end of the test the highwater mark (HWM) for the segment was in the same block (of a different file number, of course); the table was growing by 1MB extents on the same 1MB bourndaries, then trimming them back to 64KB for each row inserted, so although the table/segment reported itself as holding far fewer blocks it was still (effectively) denying use of the same amount of space to the database – unless you were planning to create lots of very small tables.
*** The difference in figures between my results and the (larger scale) results from the OP, which suggested something like 16 blocks “used” per row may be a hint that his Cloud database was allocating 8MB extents (that would allocate 16 L1 bitmap blocks) before trimming them back to 1MB.
Final updateThe OP had worked around this unexpected behaviour in his Python code by including the call:
cur.execute("ALTER SESSION DISABLE PARALLEL DML")
A final comment from Chris Saxon of Oracle Corp., however, suggested that the underlying “parallel DML enabled” could have been due to the choice of service that the OP was using to connect to the database. In Autonomous Database some of the services enable parallel DML by default (notwithstanding the long time statement from Oracle that “parallel DML is not enabled by default because of side effects, including deadlocks”). The link to the manuals supplied by Chris points out that in 23ai you no longer need to commit or rollback before you can access a table that your parallel DML has modified.
The HIGH and MEDIUM services enable parallel DML by default, the LOW, TP, and TP_URGENT services do not. (And the same footnote in the manuals points out that the DW (presumably HIGH and MEDIUM) services will ignore your parallel hints and do their own thing!)
rowlen surprise
Here’s a little detail about Oracle’s calculation of user_tables.avg_row_len that I hadn’t noticed before – and I really should have noticed it years ago, so maybe I’m the only person who’s surprised by it.
The figure isn’t necessarily telling you about the number of bytes stored in the row.
We’ve seen plenty of oddities relating to LOBs and LOB locators, of course, and there are little glitches due to chained or migrated rows, but in this note I’m thinking only of columns that are simple data types and complete rows that are single row pieces.
From time to time people ask about estimating the “correct” size for a table – usually in the context of segments that seem to be much bigger than they should be and space that has gone “missing”. The usual response to such a request is to suggest dividing the block size (minus a couple of hundred bytes for overheads) by the avg_row_len then adjusting for the pctfree to get a typical “rows per block” estimate. Dividing the user_tables.num_rows by this “rows per block” gives you the number of blocks needed to hold the table data; add a couple of percent to cater for space management bitmap blocks and there’s your (rough) estimate.
Responding a question on one of the Oracle forums I suddenly realised that this is not quite correct – and here’s a little script (tested, so far, on 19.11 [ed: and 23.9 by a reader]) to demonstrate why not.
rem
rem Script: rowlen_oddity.sql
rem Author: Jonathan Lewis
rem Dated: Sept 2025
rem Purpose:
rem
rem Last tested
rem 23.9.0.25.7
rem 19.11.0.0
rem To be tested
rem
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum id,
rownum + 0.25 id2,
rownum + 0.50 id3,
lpad(rownum,10,'0') v1,
lpad('x',30,'x') padding
from
generator v1,
generator v2
where
rownum <= 1e4 -- > comment to avoid WordPress format issue
;
spool rowlen_oddity.lst
column table_name format a12
column column_name format a32
column data_default format a40
set linesize 132
set feedback 6
prompt ====================
prompt Table after creation
prompt ====================
select table_name, avg_row_len from user_tables where table_name = 'T1';
prompt ==============================
prompt Effects of adding column group
prompt ==============================
execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns (id, id2) size 1')
select column_name, internal_column_id, avg_col_len
from user_tab_cols
where table_name = 'T1'
and column_name like 'SYS_STU%'
order by
internal_column_id;
select table_name, avg_row_len from user_tables where table_name = 'T1';
prompt ======================================
prompt Effects of adding other extended stats
prompt ======================================
execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns (v1 || id) size 1')
select column_name, internal_column_id, avg_col_len
from user_tab_cols
where table_name = 'T1'
and column_name like 'SYS_STU%'
order by
internal_column_id;
select table_name, avg_row_len from user_tables where table_name = 'T1';
prompt ===========================================
prompt Effects of adding a "proper" virtual column
prompt ===========================================
alter table t1 add sum_column invisible generated always as (id + id2 + id3) virtual;
execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')
select column_name, internal_column_id, avg_col_len
from user_tab_cols
where table_name = 'T1'
and (column_name like 'SYS_STU%' or column_name = 'SUM_COLUMN')
order by
internal_column_id;
select table_name, avg_row_len from user_tables where table_name = 'T1';
prompt ==========================================
prompt Effects of adding a "function-based index"
prompt ==========================================
create index t1_i1 on t1(upper(v1));
execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')
select column_name, internal_column_id, avg_col_len , data_default
from user_tab_cols
where table_name = 'T1'
order by
internal_column_id;
select table_name, avg_row_len from user_tables where table_name = 'T1';
All I’ve done is create a simple table with a few numeric columns and a couple of character columns, then I’ve reported the average row length for that table. After that I’ve add a column group (through a call to dbms_stats.create_extended_stats), a second set of extended stats for a column concatenation, a declared virtual column, and a function-based index.
After each addition to the table I’ve reported the average column length for the column I’ve added and the table’s average row length. Here’s an extract from the output showing just the row lengths (with a note of the activity up to that point:
SQL> host grep T1 rowlen_oddity.lst
T1 56 -- table created
T1 68 -- column group added
T1 83 -- extended stats added
T1 89 -- virtual column added
T1 100 -- function-based index added
Every time I’ve done something to introduce some new statistic Oracle has stored metadata in the data dictionary but no actual data in the table – nevertheless user_tables.avg_row_len tells us that the row gets longer each step we take. As a quick cross check, here’s the full list of column names with their “internal” column id and nominal (though never stored) length:
COLUMN_NAME INTERNAL_COLUMN_ID AVG_COL_LEN DATA_DEFAULT
-------------------------------- ------------------ ----------- ----------------------------------------
ID 1 4
ID2 2 5
ID3 3 5
V1 4 11
PADDING 5 31
SYS_STUEOQD16MRLF_3QF0IRTJ0NSQ 6 12 SYS_OP_COMBINED_HASH("ID","ID2")
SYS_STUUW7XIA$3151$$167LVPSG8U 7 15 "V1"||TO_CHAR("ID")
SUM_COLUMN 8 6 "ID"+"ID2"+"ID3"
SYS_NC00009$ 9 11 UPPER("V1")
The last 4 rows are the metadata entries in the order we generated them – and if you check their column lengths against the changes in the average row length you’ll see the exact correspondence.
The moral of the blog note is this: if you’re trying to estimate how much space your table should use by multiplying the row length by the number of rows, check if any of its columns are metadata-only and subtract their avg_col_len from the table’s avg_row_len before doing any other arithmetic.
SQL Developer and CBO
In recent posts I’ve started with the conclusion so that you can decide very quickly whether or not the article is likely to be relevant to your current requirements.
SQL Developer (I have been told) always uses the character data type to pass bind variables to the database and leaves it up to the database/optimizer to deal with any conversions it might need. If the code is then used with the correct data type for the bind variables the optimizer will probably produce differently cardinality estimates for exactly the same code, and changes in cardinality estimates may result in changes in execution plans.
Even when plans with the correct bind variable datatypes match plans with the character datatype they may still perform differently for “exactly the same” inputs. (On the plus side, the plans with the correct datatype are more likely to have the better performance.)
When checking execution plans from memory (dbms_xplan.display_cursor()) always check the Predicate Information section; there is also a peeked_binds option to the format call that makes differences in bind datatypes very obvious.
PreambleAt POUG2025 (Polish Oracle User Group) conference earlier on this month, I attended a presentation by Monika Lewandowska on the special treatment Oracle gives the nvl() operator. While showing us an execution plan that demonstrated an important point, Monika mentioned in passing that she hadn’t yet worked out why one of the cardinality estimates (E-rows) showed the value it did.
I looked at the plan and, since we’d been given all the necessary details of the data, found that I could state very confidently that I didn’t know why the number had appeared either.
After getting home, building a smaller model of the dataset, and exchanging a couple of emails with Monika I discovered the (horrible) answer. If you want to build the model, the script is at the end of this note, I’m just going to show you the query I executed (SQL*Plus, 19.11 and 23.6) and two (different) execution plans:
select /* PIDN */
count(*),
max(product_name) max_name
from prod
where product_id = nvl(:nProdId, product_id)
/
The table prod holds 10M rows, the value of column product_id is 1 for half the rows, and unique even numbers for the other half. There is a hybrid histogram on product_id, so the optimizer is (approximately, of course) aware of the data pattern, and product_id is declate not null.
You’ll notice that the query includes a bind variable :nProdID, and a call to nvl() on that bind variable. Here’s the plan, pulled from memory by a call to dbms_xplan.display_cursor(), when I supplied NULL for that variable:
---------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:08.25 | 29717 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:08.25 | 29717 |
| 2 | VIEW | VW_ORE_1807A7AF | 1 | 10M| 10M|00:00:07.09 | 29717 |
| 3 | UNION-ALL | | 1 | 10M| 10M|00:00:05.54 | 29717 |
|* 4 | FILTER | | 1 | | 0 |00:00:00.01 | 0 |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| PROD | 0 | 1 | 0 |00:00:00.01 | 0 |
|* 6 | INDEX RANGE SCAN | PROD_I1 | 0 | 1 | 0 |00:00:00.01 | 0 |
|* 7 | FILTER | | 1 | | 10M|00:00:02.64 | 29717 |
| 8 | TABLE ACCESS FULL | PROD | 1 | 10M| 10M|00:00:01.13 | 29717 |
---------------------------------------------------------------------------------------------------------------------
(If you’re not familiar with the transformation that the optimizer has applied to the original query then you may want to read this note on OR-expansion of nvl() predicates, or this more recent note on the topic by Monika.)
The plan is exactly what I expected, which didn’t correspond to the cardinality estimate that Monika’s data had shown. So, after an exchange of email, I repeated the test but switched to “Monika-mode” to get the following plan (which is only slightly different):
---------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:10.04 | 29717 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:10.04 | 29717 |
| 2 | VIEW | VW_ORE_1807A7AF | 1 | 2478K| 10M|00:00:08.87 | 29717 |
| 3 | UNION-ALL | | 1 | 2478K| 10M|00:00:07.33 | 29717 |
|* 4 | FILTER | | 1 | | 0 |00:00:00.01 | 0 |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| PROD | 0 | 1 | 0 |00:00:00.01 | 0 |
|* 6 | INDEX RANGE SCAN | PROD_I1 | 0 | 1 | 0 |00:00:00.01 | 0 |
|* 7 | FILTER | | 1 | | 10M|00:00:04.42 | 29717 |
|* 8 | TABLE ACCESS FULL | PROD | 1 | 2478K| 10M|00:00:02.89 | 29717 |
---------------------------------------------------------------------------------------------------------------------
Note how the cardinality estimate (E-Rows) for operation 8 has dropped to roughly a quarter of the estimate shown in the previous plan – and that 1/4 is the effect that Monika had shown in the original presentation. What’s going on?
The correct response you should give at this point is: “Where’s the Predicate Information?”
If you translated “product_id = nvl(:nprodid, product_id)” into a procedural form you might write it as:
if :nprodid is null then
product_id = product_id
else
product_id = :nprodid
if
So my first thought about the difference between the two results was that Monika was running a version of Oracle that actually generated the product_id = product_id predicate then did some arithmetic that behaved as if it was comparing two different columns. That would explain why we both saw the estimated cardinality as (very close to) 1/4 of the expected value.
You could check my claim about the 1/4 estimate by creating the table with a duplicate of the product_id column, including its histogram, and checking the estimiate for the predicate product_id = product_id_dup. Roughly speaking: because the histogram tells the optimizer that half the data has the value 1 and the other half has no repeats then in the Cartesian product half the data from the “left hand table” will match half the data of the “right hand table” – and that “half of a half” is where the 1/4 drops out of the arithmetic.
The email we exchanged was about the Oracle version, the predicate information and the client application.
- We were using different versions of Oracle – both 19c but different RUs
- We were using different applications – I was using SQL*Plus, she was using SQL Developer
- The two sets of predicate information did not match; here they are (based on my sample data):
Predicate Information (identified by operation id): (SQL*Plus)
---------------------------------------------------
4 - filter(:NPRODID IS NOT NULL)
6 - access("PRODUCT_ID"=:NPRODID)
7 - filter(:NPRODID IS NULL)
Predicate Information (identified by operation id): ("SQL Developer mode")
---------------------------------------------------
4 - filter(:NPRODID IS NOT NULL)
6 - access("PRODUCT_ID"=TO_NUMBER(:NPRODID))
7 - filter(:NPRODID IS NULL)
8 - filter("PRODUCT_ID"=TO_NUMBER(TO_CHAR("PRODUCT_ID")))
My guess wasn’t right, but it was in the right ballpark. The problem wasn’t a strange optimizer bug dependent on the version of Oracle, but it was the about the arithmetic of a predicate very similar to “product_id = product_id”.
That predicate had appeared because (I have been told) SQL Developer always uses a character datatype to supply bind variables to the server and lets the server sort out any problems of coercion; so the “Monika-mode” hack I had used from SQL*Plus was to declare variable nProdId varchar2(30) rather than variable nProdId number.
Becauses :nprodid was a character bind variable the mechanism the optimizer had to use to deal with the predicate product_id = nvl(:nprodid, product_id) was to convert the right-hand expresssion to a number, producing the predicate product_id = to_number(:nprodid, product_id), except you can’t apply to_number() to a number, so the optimizer had to convert the numeric column inside the nvl() to a character, leading to the rewritten predicate product_id = to_number(nvl(:nprodid, to_char(product_id))) before the _or_expand_nvl_predicate transformation is applied.
If you want to confirm that Oracle goes through this intermediate step you can generate the CBO (10053) trace, and you will find in it the text (reformatted here for readability):
Transformed predicate
"PROD"."PRODUCT_ID"=TO_NUMBER(NVL(:B1,TO_CHAR("PROD"."PRODUCT_ID")))
to
:B1 IS NOT NULL AND "PROD"."PRODUCT_ID"=TO_NUMBER(:B2)
OR :B3 IS NULL AND "PROD"."PRODUCT_ID"=TO_NUMBER(TO_CHAR("PROD"."PRODUCT_ID")
As for the arithmetic, there are some functions where function(column) is treated as “unknown, unpeekable value” leading to a fixed selectivity guess and there are some functions where the selectivity of function(column) is assumed to be the same as the selectivity of the underlying column – in this case the optimizer does the predicate arithmetic of product_id = product_id without realising that it’s (effectively) calculating for a predicate that it would normally eliminate.
ConsequencesChanges in cardinality estimates can lead to changes in execution plans. This was a very simple query with a very obvious change in estimate, but no change in plan; in production databases the variation in estimate may be far less obvious. If you use SQL Developer to create and test SQL and then embed the SQL in some other tool (with the correct bind declarations) for production execution you may have difficulty why “it works okay in test but picks the wrong plan in production”.
There are variations on this theme; the two plans I’ve shown have the same plan hash value – but do different amounts of work because of the calls to conversion functions. They would also appear as two separate child cursors if one session was using character binds and the other was using numeric binds. Have you ever heard the complaint: This query runs faster for userX than userY, even though the plan is the same and they’re both using the same {some column name/description} – maybe they’re using different tools to run the query.
This is just another reason for getting into the habit of: “always check the Predicate Information”. In fact this prompts me to add a comment about including the “peeked_binds” format option in the call to dbms_xplan.display_cursor() et. al. which would give you a very clear declaration of the source of the problem (if they appear). Here are the two sections from my demo:
Peeked Binds (identified by position):
--------------------------------------
1 - :1 (NUMBER): (null)
Peeked Binds (identified by position):
--------------------------------------
1 - :1 (VARCHAR2(30), CSID=873): (null)
While SQL Developer makes it a little awkward to test your SQL with numeric bind variables, SQL*Plus has a similar limitation with date bind variables. Although, as we saw above, you need only write variable n1 number in SQL*Plus to create a numeric bind variable, the equivalent variable d1 date results in the following response (in 23.6):
Usage: VAR[IABLE] [ <variable> [ NUMBER | CHAR | CHAR (n [CHAR|BYTE]) |
VARCHAR2 (n [CHAR|BYTE]) | NCHAR | NCHAR (n) |
NVARCHAR2 (n) | CLOB | NCLOB | BLOB | BFILE |
REFCURSOR | BINARY_FLOAT | BINARY_DOUBLE |
BOOLEAN | VECTOR] ]
If you want to test your code to generated execution plans with date bind variables (from SQL*Plus) you’ll have to create a PL/SQL block that defines a variable, populates it, then does something that forces the statement to be parsed.
Addendum 2The link I posted to Bluesky publicising this note has generated a number of responses, covering SQL Developer for VS Code, “runscript”, and SQLcl (so far). I’ll draw together all the bits some time in the next few days; meanwhile, following an initial response about “Run Script (F5)”, Monika sent me an email about that feature of SQL Developer and I’ll just quote it here (with the reminder that different SQL Developer is constantly being enhanced, so the following may not be true for every version):
If you run this:
variable nProdId number
exec :nProdId := null
select /* PIDN */
count(*),
max(product_name) max_name
from prod
where product_id = nvl(:nProdId, product_id)
/
as a script (F5) in SQL Developer it sends the bind as a number. Unfortunately I work with SQL Developer for VS Code and here it doesn’t work that way.
The model If you want to build the data set click here to view/hide the test script
rem
rem Script: monika_nvl_3a.sql
rem Author: Jonathan Lewis
rem Dated: Sept 2025
rem Purpose:
rem
rem Last tested
rem 23.6.0.0
rem 19.11.0.0
drop table prod purge;
set linesize 132 pagesize 120 trimspool on tab off
execute dbms_random.seed(0)
prompt ====================================
prompt create table PROD: 50 * 200,000 rows
prompt Half the rows have product_id = 1
prompt The rest hold distinct even numbers
prompt ====================================
create table prod
as
with data as (
select rownum rn
from dual
connect by
level <= 2e5
),
products as (
select 'Product ' || to_char(rownum,'FM000') product_name
from dual
connect by
level <= 50
)
select
case when mod(rownum,2) = 1 then 1 else rownum end product_id,
prd.product_name
from
products prd,
data dat
/
alter table prod modify (product_id not null);
create index prod_i1 on prod(product_id);
spool monika_nvl_3a.lst
select *
from (
select
product_id,
count(*) product_count,
sum(count(*)) over () total_rows
from
prod
group by
product_id
order by
product_id
)
where
rownum < 11
/
select count(*) from prod;
prompt ======================
prompt Collecting histograms
prompt ======================
exec dbms_stats.gather_table_stats(user, 'prod', method_opt => 'for all columns size skewonly')
select column_name, histogram, num_buckets
from user_tab_cols
where table_name = 'PROD'
/
alter system flush shared_pool;
alter session set statistics_level = all;
set timing on
set serveroutput off
column max_name format a32
prompt =========================================================
prompt Test for NULL (product not supplied): nProdId as number
prompt E-Rows is 10M, time to run ca. 9 seconds.
prompt Note the simple filter predicate, and the absence of a
prompt predicate for op. 8 when product_id is declared not null.
prompt If not so declared we'd see "product_id is not null"
prompt ===========================================================
variable nProdId number
exec :nProdId := null
select /* PIDN */
count(*),
max(product_name) max_name
from prod
where product_id = nvl(:nProdId, product_id)
/
select *
from dbms_xplan.display_cursor( format=> 'allstats last peeked_binds')
/
prompt =========================================================
prompt Test for NULL (product not supplied): nProdId as varchar2
prompt E-Rows is ca. 2.5M, time to run ca. 11 seconds.
prompt Note the more complex filter predicates, especially op. 8
prompt =========================================================
variable nProdId varchar2(30)
exec :nProdId := null
-- alter session set events '10053 trace name context forever';
select /* PIDN */
count(*),
max(product_name) max_name
from prod
where product_id = nvl(:nProdId, product_id)
/
-- alter session set events '10053 trace name context off';
select *
from dbms_xplan.display_cursor( format=> 'allstats last peeked_binds')
/
spool off
Bind Peeking
Here’s a simple detail about bind peeking (during optimisation) that makes a big difference to “fetch first/next” queries. The optimizer knows about your bind types, and can peek at the bind values – but that doesn’t mean it will do something sensible with them. Here’s a little model to demonstrate the problem – starting with a trival table creation statement and a simple “old-fashioned” query:
rem
rem Script: bind_peeking_wasted.sql
rem Author: Jonathan Lewis
rem Dated: Aug 2025
rem Purpose:
rem
rem Last tested
rem 23.6.0.24.10
create table t1
as
select * from all_objects
;
alter table t1 add constraint t1_pk primary key(object_id);
set serveroutput off
variable b1 number
exec :b1 := 20
set feedback only
select *
from (
select object_id, object_name, object_type, owner
from t1
order by
object_id
)
where
rownum <= :b1
/
set feedback on
select * from table(dbms_xplan.display_cursor(format=>'peeked_binds'));
Since the optimizer is able to peek at the number variable :b1 it could choose to use an index full scan with stop key to fetch the first 20 ordered rows and, as you can see below, that’s exactly what it did:
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
SQL_ID 3vd5qn3smsw5s, child number 0
-------------------------------------
select * from ( select object_id, object_name, object_type, owner
from t1 order by object_id ) where rownum <= :b1
Plan hash value: 3766500789
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 3 (100)| |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 20 | 3160 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID| T1 | 63684 | 3855K| 3 (0)| 00:00:01 |
| 4 | INDEX FULL SCAN | T1_PK | 20 | | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Peeked Binds (identified by position):
--------------------------------------
1 - :B1 (NUMBER): 20
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=:B1)
As you can see the bind value 20 is echoed in the Rows predication for the index full scan (even though the prediction for the table access doesn’t follow the logic). Now change the code so that the rownum is dictated by summing two variables:
variable b2 number
variable b3 number
exec :b2 := 15; :b3 := 5
set feedback only
select *
from (
select object_id, object_name, object_type, owner
from t1
order by
object_id
)
where
rownum <= :b2 + :b3
/
set feedback on
select * from table(dbms_xplan.display_cursor(format=>'peeked_binds'));
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
SQL_ID 4sjjdyq59rwyw, child number 0
-------------------------------------
select * from ( select object_id, object_name, object_type, owner
from t1 order by object_id ) where rownum <= :b2 + :b3
Plan hash value: 270731910
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 1342 (100)| |
|* 1 | COUNT STOPKEY | | | | | | |
| 2 | VIEW | | 63684 | 9826K| | 1342 (1)| 00:00:01 |
|* 3 | SORT ORDER BY STOPKEY| | 63684 | 3855K| 4776K| 1342 (1)| 00:00:01 |
| 4 | TABLE ACCESS FULL | T1 | 63684 | 3855K| | 390 (1)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=:B2+:B3)
3 - filter(ROWNUM<=:B2+:B3)
In this case the index scan doesn’t happen, and the arithmetic predicting 20 rows doesn’t appear anywhere in the plan. In fact we don’t see the peeked binds reported, even though I specifically requested them and even though they will be visible in the view v$sql_bind_capture.
Has the optimizer not looked at them, or has it forgotten them, or has it simply not reported them because it didn’t know how to use them? A quick check of the 10053 (CBO) trace file shows that the optimizer does actually take note of the values fairly early on in its processing:
*******************************************
Peeked values of the binds in SQL statement
*******************************************
----- Bind Info (kkscoacd) -----
PAYLOAD
----------------------------------------------------------------------------------------------------
Bind#0
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=03 fl2=1000000 frm=00 csi=00 siz=48 off=0
kxsbbbfp=7fce857384b8 bln=22 avl=02 flg=05
value=15
Bind#1
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=03 fl2=1000000 frm=00 csi=00 siz=0 off=24
kxsbbbfp=7fce857384d0 bln=22 avl=02 flg=01
value=5
After this, though, there are no further indications of the values being used.
The reason this omission is worth pursuing shows up in an article I wrote a couple of months ago about the “fetch next” syntax. If you use bind variables for the offset and fetch size then Oracle will rewrite your SQL to produce a predicate which effectively reduces to: row_number() <= :offset + :fetch_size and, as with the rownum clause above, will not work out that the sum of the two bind variables means that an indexed access path would be the best choice.
In fact, although I’ve chosen to demonstrate the problem with a sum of two bind variables, any arithmetic applied at this point is enough to disable the use of peeked valued – “:b1 + 0” and “:b1 * 1” both result in a critical loss of information even though one feels that the optimizer ought to be able to transform the constants out of the unparsed query in these extreme cases.
Pagination Cost – 2
This note is a follow-on to a note I published a couple of years ago, and was prompted by a question on the MOS community forum (needs an acount) about the performance impact of using bind variables instead of literal values in a clause of the form: offset 20 rows fetch next 20 rows only
The issue on MOS may have been to do with the complexity of the view that was was being queried, so I thought I’d take a look at what happened when I introduced bind variables to the simple tests from the previous article. Here’s the (cloned) script with the necessary modification:
rem
rem Script: fetch_first_offset_3.sql
rem Author: Jonathan Lewis
rem Dated: May 2025
rem
create table t1
as
select
*
from
all_objects
where rownum <= 50000
order by
dbms_random.value
/
create index t1_i1 on t1(object_name);
alter session set statistics_level = all;
set serveroutput off
column owner format a32
column object_type format a12
column object_name format a32
spool fetch_first_offset_3.lst
prompt ===================================
prompt SQL with literals (non-zero offset)
prompt ===================================
select
owner, object_type, object_name
from
t1
order by
object_name
offset
10 rows
fetch next
20 rows only
/
select * from table(dbms_xplan.display_cursor(format=>'+cost allstats last peeked_binds'));
variable offset_size number
variable fetch_size number
begin
:offset_size := 10; :fetch_size := 20;
end;
/
prompt ==============
prompt SQL with binds
prompt ==============
alter session set events '10053 trace name context forever';
select
owner, object_type, object_name
from
t1
order by
object_name
offset
:offset_size rows
fetch next
:fetch_size rows only
/
alter session set events '10053 trace name context off';
select * from table(dbms_xplan.display_cursor(format=>'+cost allstats last peeked_binds'));
I’ve created a simple data set by copying 50,000 rows from the view all_objects and creating an index on the object_name column then, using two different strategies, I’ve selected the 21st to 30th rows in order of object_name. The first strategy uses literal values in the offset and fetch first/next clauses to skip 10 rows then fetch 20 rows; the second strategy creates a couple of bind variables to specify the offset and fetch sizes.
Here’s the execution plan (pulled from memory, with the rowsource execution statistics enabled) for the example using literal values:
SQL_ID d7tm0uhcmpwc4, child number 0
-------------------------------------
select owner, object_type, object_name from t1 order by object_name
offset 10 rows fetch next 20 rows only
Plan hash value: 3254925009
-----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads |
-----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 33 (100)| 20 |00:00:00.01 | 35 | 333 |
|* 1 | VIEW | | 1 | 30 | 33 (0)| 20 |00:00:00.01 | 35 | 333 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 30 | 33 (0)| 30 |00:00:00.01 | 35 | 333 |
| 3 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 50000 | 33 (0)| 30 |00:00:00.01 | 35 | 333 |
| 4 | INDEX FULL SCAN | T1_I1 | 1 | 30 | 3 (0)| 30 |00:00:00.01 | 5 | 28 |
-----------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("from$_subquery$_002"."rowlimit_$$_rownumber"<=30 AND
"from$_subquery$_002"."rowlimit_$$_rownumber">10))
2 - filter(ROW_NUMBER() OVER ( ORDER BY "OBJECT_NAME")<=30)
As you can see, the optimizer has used (started) an index full scan to access the rows in order of object_name, but the A-Rows column tells you that it has passed just 30 rowids (the 10 to be skipped plus the 20 to be fetched) up to its parent (table access) operation, and the table access operation has passed the required columns up to its parent (window nosort stopkey) which can conveniently discard the first 10 rows that arrive and pass the remain 20 rows up and on to the client without actually doing any sorting.
You can also see in the Predicate Information that the window operation has used the row_number() function to limit itself to the first 30 (i.e. 10 + 20) rows, passing them up to its parent where the “30 rows” predicate is repeated with a further predicate that eliminates the first 10 of those rows, leaving only the 20 rows requested.
So what does the plan look like when we switch to bind variables:
SQL_ID 5f85rkjc8bv8a, child number 0
-------------------------------------
select owner, object_type, object_name from t1 order by object_name
offset :offset_size rows fetch next :fetch_size rows only
Plan hash value: 1024497473
--------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 933 (100)| 20 |00:00:00.03 | 993 | 990 | | | |
|* 1 | VIEW | | 1 | 50000 | 933 (1)| 20 |00:00:00.03 | 993 | 990 | | | |
|* 2 | WINDOW SORT PUSHED RANK| | 1 | 50000 | 933 (1)| 30 |00:00:00.03 | 993 | 990 | 11264 | 11264 |10240 (0)|
|* 3 | FILTER | | 1 | | | 50000 |00:00:00.02 | 993 | 990 | | | |
| 4 | TABLE ACCESS FULL | T1 | 1 | 50000 | 275 (1)| 50000 |00:00:00.01 | 993 | 990 | | | |
--------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("from$_subquery$_002"."rowlimit_$$_rownumber" <= GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:OFFSET_SIZE))),0)+:FETCH_SIZE AND
"from$_subquery$_002"."rowlimit_$$_rownumber" > :OFFSET_SIZE))
2 - filter(ROW_NUMBER() OVER ( ORDER BY "OBJECT_NAME") <= GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:OFFSET_SIZE))),0)+:FETCH_SIZE)
3 - filter(:OFFSET_SIZE < GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:OFFSET_SIZE))),0)+:FETCH_SIZE)
This looks like bad news – we haven’t taken advantage of an index to avoid visiting and sorting all the rows in the table, operation 4 shows us a table scan passing 50,000 rows through a filter up to the window sort at operation 2 which discards the 49,970 rows we definitely don’t want before passing the remaining 30 rows to the view operation that discards the first 10 that we needed to skip. Why don’t we see the far more efficient index scan?
You may have noticed a couple of oddities in the Predicate Information.
- Wherever you see the :offset_size bind variable the optimizer has wrapped it in to_number(to_char()) – why?! My first thought about this was that the double conversion made it impossible for the optimizer to peek at the value and use it to get a better estimate of cost, but that’s (probably) not why the index full scan disappeared.
- The offset and fetch first are both supposed to be numeric (according to the tram-tracks in the manual) so it seems a little strange that Oracle treats just one of them to a double conversion.
- What is that filter() in operation 3 actually trying to achieve? If you tidy up the messy bits it’s just checking two bind variables to make sure that the offset is less than the offset plus fetch size. This is just an example of “conditional SQL”. In this case it’s following the pattern for “columnX between :bind1 and :bind2” – allowing Oracle to short-circuit the sub-plan if the value of bind2 is less than that of bind1. (It wasn’t needed for the example where we used literals because Oracle could do the artithmetic at parse time and see that 10 was – and always would be – less than 30.)
- What are the checks actually saying about the optimizer’s (or developer’s) expectation for the way you might use the feature? The generated SQL actually allows for negative, non-integer values here. Negative offsets are replaced by zero, negative fetch sizes result in the query short-circuiting and returning no data (in fact any fetech size strictly less than 1 will return no rows).
Hoping to find further clues about the poor choice of plan, I took a look at the “UNPARSED QUERY” from the CBO (10053) trace, and cross-checked against the result from using the dbms_utility.expand_sql() procedure; the results were (logically, though not cosmetically) the same. Here, with a little extra cosmetic tidying is the SQL the optimizer actually works with:
select
a1.owner owner,
a1.object_type object_type,
a1.object_name object_name
from (
select
a2.owner owner,
a2.object_type object_type,
a2.object_name object_name,
a2.object_name rowlimit_$_0,
row_number() over (order by a2.object_name) rowlimit_$$_rownumber
from
test_user.t1 a2
where
:b1 < greatest(floor(to_number(to_char(:b2))),0)+:b3
) a1
where
a1.rowlimit_$$_rownumber <= greatest(floor(to_number(to_char(:b4))),0) + :b5
and a1.rowlimit_$$_rownumber > :b6
order by
a1.rowlimit_$_0
;
It’s fascinating that the optimizer manages to expand the original two bind variables to six bind variables (lots of duplication) and then collapse them back to two named bind variables for the purposes of reporting the Predicate Information. For reference:
- b1 = b3 = b5 = fetch_size
- b2 = b4 = b5 = offset_size
Line 15, which I’ve highlighted, is clearly the source of the “conditional SQL” filter predicate at operation 3 of the previous execution plan, so I thought I’d try running this query (pre-defining all 6 bind variables correctly) to see if I could get the index-driven plan by modifying that line.
My first attempt was simply to remove the (highly suspect) to_number(to_char()) – but that didn’t help. Then I thought I’d make it really simple by getting rid of the greatest(floor()) functions – and that didn’t help either,. Finally I decided to change what was now :b4 + :b5 to a single bind variable :b7 with the right values – and that’s when I got the plan I wanted:
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 20 |00:00:00.01 | 35 |
|* 1 | VIEW | | 1 | 30 | 20 |00:00:00.01 | 35 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 30 | 30 |00:00:00.01 | 35 |
|* 3 | FILTER | | 1 | | 30 |00:00:00.01 | 35 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 50000 | 30 |00:00:00.01 | 35 |
| 5 | INDEX FULL SCAN | T1_I1 | 1 | 30 | 30 |00:00:00.01 | 5 |
--------------------------------------------------------------------------------------------------
Of course this doesn’t help answer the question – how do I make the query faster – it just highlights where in the current transformation the performance problem appears. Maybe it’s a pointer to some Oracle developer that there’s some internal code that could be reviewed – possibly for a special (but potentially common) pattern. Perhaps there’s a point of interception where a fairly small, isolated piece of code could be modified to give the optimizer the simpler expression during optimisation.
As for addressing the problem of finding a “client-oriented” mechanism, I found two solutions for my model. First add the (incomplete, but currently adequate) hint /*+ index(t1) */ to the SQL to get:
---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 50311 (100)| 20 |00:00:00.01 | 25 |
|* 1 | VIEW | | 1 | 50000 | 50311 (1)| 20 |00:00:00.01 | 25 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 50000 | 50311 (1)| 20 |00:00:00.01 | 25 |
|* 3 | FILTER | | 1 | | | 20 |00:00:00.01 | 25 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 50000 | 50311 (1)| 20 |00:00:00.01 | 25 |
| 5 | INDEX FULL SCAN | T1_I1 | 1 | 50000 | 339 (1)| 20 |00:00:00.01 | 5 |
---------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("from$_subquery$_002"."rowlimit_$$_rownumber"<=GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:B1))),0
)+:B2 AND "from$_subquery$_002"."rowlimit_$$_rownumber">:B1))
2 - filter(ROW_NUMBER() OVER ( ORDER BY "OBJECT_NAME")<=GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:B1))),0)+:
B2)
3 - filter(:B1<GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:B1))),0)+:B2)
As you can see we now do the index full scan, but it stops after only 20 rowids have been passed up the plan. This isn’t a good solution, of course, since (a) it’s specific to my model and (b) the estimates still show the optimizer working on the basis of handling and forwarding 50,000 rows (E-rows).
The alternative was to tell the optimizer that since we’re doing pagination queries we’re only expecting to fetch a little data each time we execute the query – let’s add the hint /*+ first_rows(30) */ which gives us:
---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 33 (100)| 20 |00:00:00.01 | 25 |
|* 1 | VIEW | | 1 | 30 | 33 (0)| 20 |00:00:00.01 | 25 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 30 | 33 (0)| 20 |00:00:00.01 | 25 |
|* 3 | FILTER | | 1 | | | 20 |00:00:00.01 | 25 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 50000 | 33 (0)| 20 |00:00:00.01 | 25 |
| 5 | INDEX FULL SCAN | T1_I1 | 1 | 30 | 3 (0)| 20 |00:00:00.01 | 5 |
---------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("from$_subquery$_002"."rowlimit_$$_rownumber"<=GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:B1))),0
)+:B2 AND "from$_subquery$_002"."rowlimit_$$_rownumber">:B1))
2 - filter(ROW_NUMBER() OVER ( ORDER BY "OBJECT_NAME")<=GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:B1))),0)+:
B2)
3 - filter(:B1<GREATEST(FLOOR(TO_NUMBER(TO_CHAR(:B1))),0)+:B2)
This is likely to be a much better strategy than “micro-management” hinting; and it may even be appropriate to set the optimizer_mode at the session level with a logon trigger, first_rows_10 or first_rows_100 could well be a generally acceptable result if most of the queries tended to be about reporting the first few rows of a large result set. A key point to note is that both E-Rows and Cost are reasonably representative of the work done, while the corresponding figures when we hinted the use of the index were wildly inaccurate.
Coalesce SQ bug
The “coalesce subquery” transformation introduced in the 11gR2 timeline can produce wrong results and you will find some bug reports and patches on MOS spread over versions from 12cR1 onwards. If you find that you are hitting a bug for which there appears to be no patch there are several options for disabling the feature, with different levels of granularity.
- Set hidden parameter _optimizer_coalesce_subqueries to false to disable all suqbuery coalescing
- Set fix_control to ‘10216738:0’ disable a particular ANY/ALL class of subquery coalescing
- Use the no_coalesce_sq(@query_block) hint to stop specific subqueries from being coalesced
- Set the optimizer_features_enable to 11.2.0.4 (not advised) to stop the optimizer from trying so hard.
Here’s an example of a problem with subquery coalescing that’s resurfaced in various forms, with patches that had varying degrees of success, since Oracle 12c. This example has been tested up to Oracle 19.11 and 23.7. It appeared as a question on the Oracle customer forum a couple of weeks ago where the OP had modelled their production problem on a copy of the sample scott.emp table (see footnote 1 for a script to recreate the table.).
rem
rem Script: coalesce_sq_bug.sql
rem Author: Adrián Gómez Brandón / jpl
rem Dated: 13th March 2025
rem
rem Last tested
rem 23.7.0.0
rem 19.11.0.0
rem
select
/*+
-- optimizer_features_enable('11.2.0.4')
-- opt_param('_optimizer_coalesce_subqueries','false')
-- opt_param('_fix_control','10216738:0')
-- no_coalesce_sq(@em)
-- no_coalesce_sq(@me)
*/
e1.empno,
e2.empno
from emp e1,
emp e2
where
e1.empno != e2.empno
and (e1.empno, e2.empno) not in (select /*+ qb_name(em) */ e3.empno, e3.mgr from emp e3 where e3.mgr is not null)
--
-- predicate for correct result (156 rows)
--
and (e2.empno, e1.empno) not in (select /*+ qb_name(me) */ e4.empno, e4.mgr from emp e4 where e4.mgr is not null)
--
-- predicate for wrong results (169 rows)
--
-- and (e1.empno, e2.empno) not in (select /*+ qb_name(me) */ e4.mgr, e4.empno from emp e4 where e4.mgr is not null)
--
/
There are several lines commented out in the query – mostly comments or hints – but the highlighted lines (28 and 33) are the lines of interest. One of them is currently a comment, the other is not. If you examine the two lines you will see that they say the same thing in two different ways. Simplifying, one says (a, b) not in (x,y) while the other says (b,a) not in (y,x). Unfortunately, though, one of them produces the wrong result.
Here are the execution plans – each preceded by the predicate that produced the plan:
Right result
(e2.empno, e1.empno) not in (select /*+ qb_name(me) */ e4.empno, e4.mgr from emp e4 where e4.mgr is not null)
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 182 | 14196 | 13 (0)| 00:00:01 |
|* 1 | HASH JOIN RIGHT ANTI | | 182 | 14196 | 13 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL | EMP | 13 | 338 | 3 (0)| 00:00:01 |
|* 3 | HASH JOIN ANTI | | 182 | 9464 | 10 (0)| 00:00:01 |
| 4 | NESTED LOOPS | | 182 | 4732 | 7 (0)| 00:00:01 |
| 5 | INDEX FULL SCAN | EMP_PK | 14 | 182 | 1 (0)| 00:00:01 |
|* 6 | INDEX FAST FULL SCAN| EMP_PK | 13 | 169 | 0 (0)| 00:00:01 |
|* 7 | TABLE ACCESS FULL | EMP | 13 | 338 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("E1"."EMPNO"="E3"."EMPNO" AND "E2"."EMPNO"="E3"."MGR")
2 - filter("E3"."MGR" IS NOT NULL)
3 - access("E2"."EMPNO"="E4"."EMPNO" AND "E1"."EMPNO"="E4"."MGR")
6 - filter("E1"."EMPNO"<>"E2"."EMPNO")
7 - filter("E4"."MGR" IS NOT NULL)
Wrong result
(e1.empno, e2.empno) not in (select /*+ qb_name(me) */ e4.mgr, e4.empno from emp e4 where e4.mgr is not null)
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 182 | 9464 | 10 (0)| 00:00:01 |
|* 1 | HASH JOIN ANTI | | 182 | 9464 | 10 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 182 | 4732 | 7 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | EMP_PK | 14 | 182 | 1 (0)| 00:00:01 |
|* 4 | INDEX FAST FULL SCAN| EMP_PK | 13 | 169 | 0 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | EMP | 13 | 338 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("E1"."EMPNO"="E3"."EMPNO" AND "E2"."EMPNO"="E3"."MGR")
4 - filter("E1"."EMPNO"<>"E2"."EMPNO")
5 - filter("E3"."MGR" IS NOT NULL)
The first (correct) plan shows us two hash anti-joins (i.e. “not in first SQ” and “not in second SQ”). The second plan shows us only one hash anti-join, and the predicate information tells us it’s referencing the occurence of emp with the alias e3. It looks as if the optimizer has “lost” the second subquery – but how could that happen?
The biggest clue is in the Outline Data section (which I didn’t show). The second query includes the hints: COALESCE_SQ(@”EM”) COALESCE_SQ(@”ME”). The optimizer thinks the two subqueries can be collapsed into a single subquery that will still get the correct answer. The optimizer was wrong and the effect of the logic it has used in the transformation has (effectively) left us with just the first of the two subqueries – and too many rows in the result set.
WorkaroundsAfter seeing the coalesce_sq() hints in the Outline Data my first thought was simply to add the two no_coalesce_sq() hints you see in the SQL I shown above. This gave me the right answer, with the same plan as the working alternative.
It then occurred to me to check whether I actually needed both no_coalesce_sq() hints – and I didn’t, either one would do just as well on its own. This isn’t really surprising, the hint (positive and negative versions) operates at the Query Block level and names a query block that should/should not be coalesced; since I have just two subqueries that could be considered for coalescing, if I tell the optimizer that one of them must not be coalesced then there is nothing left that the optimizer could use to coalesce the other subquery with. I would have had to be more fussy if there had been 3 or more subqueries for the optimizer to choose from.
Another option, of course, is to disable the subquery coalescing feature completely – which could be done at the system level with a startup parameter or through an alter system call, or it could be done at the session level with an alter session call, or, as I’ve shown here, it’s a parameter that can be set for the query with the opt_param(‘_optimizer_coalesce_subqueries’,’false’) hint.
I spent a little time searching MOS for any appearance of this problem, and did find a handful of bug reports about similar problems, and patch notes about fixes, and patches that superceded patches because of errors that still appeared when the first patch was in place. So I passed the test case on to Oracle and it’s now in the system as an (unpublished) bug, but I got a very rapid reply that a workaround mentioned in some of the other MOS bug notes was appropriate for this example – a less aggressive blocking action, disabling only part of the subquery coalescing code: set fix control 10216738 to zero (hence the second opt_param() hint above).
If you check the v$system_fix_control view you’ll see that 10216738 has the description: “Toggels [sic] subquery coalescing for ANY and ALL subqueries”. Remembering that “not in ()” is equivalent to “!= ALL()” we can recognize that this is the area of code where we’re having a problem, so this is probably the minimal blocking that we can do to avoid the problem.
There was one other hack that I’d tried before getting the information about the fix control – I had re-run the query several times taking the optimizer_features_enable backwards through terminal versions of Oracle until it stopped producing the wrong results – hence the optimizer_features_enable(‘11.2.0.4’) hint, which you probably won’t want to use in a production system. One of the side effects of going back to the 11.2.0.4 optimizer features was that fix control 10216738 dropped to zero, so the fix control is clearly the best option until Oracle comes up with a working patch. My preference would be to do this through the fix_control startup parameter.
Footnote 1 (“scott.emp”) Click here to expand / collapse the script
rem Script: build_emp.sql
rem Dated: Mar 2004
rem
drop table emp;
create table emp (
empno number(4) not null,
ename varchar2(10),
job varchar2(9),
mgr number(4),
hiredate date,
sal number(7, 2),
comm number(7, 2),
deptno number(2)
);
insert into emp values
(7369, 'SMITH', 'CLERK', 7902,
to_date('17-dec-1980', 'dd-mon-yyyy'), 800, null, 20);
insert into emp values
(7499, 'ALLEN', 'SALESMAN', 7698,
to_date('20-feb-1981', 'dd-mon-yyyy'), 1600, 300, 30);
insert into emp values
(7521, 'WARD', 'SALESMAN', 7698,
to_date('22-feb-1981', 'dd-mon-yyyy'), 1250, 500, 30);
insert into emp values
(7566, 'JONES', 'MANAGER', 7839,
to_date('2-apr-1981', 'dd-mon-yyyy'), 2975, null, 20);
insert into emp values
(7654, 'MARTIN', 'SALESMAN', 7698,
to_date('28-sep-1981', 'dd-mon-yyyy'), 1250, 1400, 30);
insert into emp values
(7698, 'BLAKE', 'MANAGER', 7839,
to_date('1-may-1981', 'dd-mon-yyyy'), 2850, null, 30);
insert into emp values
(7782, 'CLARK', 'MANAGER', 7839,
to_date('9-jun-1981', 'dd-mon-yyyy'), 2450, null, 10);
insert into emp values
(7788, 'SCOTT', 'ANALYST', 7566,
to_date('09-dec-1982', 'dd-mon-yyyy'), 3000, null, 20);
insert into emp values
(7839, 'KING', 'PRESIDENT', NULL,
to_date('17-nov-1981', 'dd-mon-yyyy'), 5000, null, 10);
insert into emp values
(7844, 'TURNER', 'SALESMAN', 7698,
to_date('8-sep-1981', 'dd-mon-yyyy'), 1500, 0, 30);
insert into emp values
(7876, 'ADAMS', 'CLERK', 7788,
to_date('12-jan-1983', 'dd-mon-yyyy'), 1100, null, 20);
insert into emp values
(7900, 'JAMES', 'CLERK', 7698,
to_date('3-dec-1981', 'dd-mon-yyyy'), 950, null, 30);
insert into emp values
(7902, 'FORD', 'ANALYST', 7566,
to_date('3-dec-1981', 'dd-mon-yyyy'), 3000, null, 20);
insert into emp values
(7934, 'MILLER', 'CLERK', 7782,
to_date('23-jan-1982', 'dd-mon-yyyy'), 1300, null, 10);
--
-- Move emp to pack the data into the first few blocks
-- (and gather statistics)
--
alter table emp move online;
alter table emp add constraint emp_pk primary key(empno);
ORA-01652
This is just a reminder and a quick note about the Oracle error:
ORA-1652: unable to extend temp segment by 128 in tablespace {not the TEMP tablespace}
Since the named tablespace is not a temporary tablespace, the error has occured as a data sesgment was being created, rebuilt, moved, or was trying to extend – and the error message doesn’t tell you which segment. But (as with many other error messages) things have changed with the most recent version(s) of Oracle. Here’s the same error message from 23.6 when I forced it to happen with a simple “create table as select”:
ORA-01652: unable to grow table TEST_USER.T2 in tablespace TEST_8K_ASSM by 1MB
during operation with SQL ID : gxt04puvaugw5,
temp space used by session : 0 (MB)
Help: https://docs.oracle.com/error-help/db/ora-01652/
The alert log has an additional line:
ORA-01652 addlInfo tsn:7 objn:76831 objd:76831 user:TEST_USER obj:T2 subobj: type:2
So the reporting is much better – and you even get the sql_id of the statement that caused the error – which you might be able to find it in the library cache.
If you’re not on the latest version, though, or don’t get to see the error message in the alert log until hours later, or if there’s nothing in the library cache when you go searching, how do you discover the SQL that caused the error (or even identify the object if you’re on an older version of Oracle).
Unless you are subject to very frequent recurrence of the error you could just dump an errorstack whenever the error occurred:
alter system set events '1652 trace name errorstack level 1';
Now, when the error occurs the guilty session will dump a trace file – which will be reported with the ORA-01652 error message in the alert log. The file will probably be several thousand lines long – it depends on what you’ve been doing before the error occurs – but in the first 50 lines you should find something like this:
----- Current SQL Statement for this session (sql_id=gxt04puvaugw5) -----
create table t2 as select t1.* from t1, (select rownum from dual connect by level <= 10)
The important bit to search for is “Current SQL Statement” or, if you’re on 23c, the SQL_ID that was reported in the original error message.
Don’t forget to disable the trace once you’ve identified the source of the problem.
alter system set events '1652 trace name errorstack off';
Oracle Evolution
I responded to a problem on one of the Oracle forums recently about getting rid of some temporary segments that wouldn’t disappear after a call to “alter table move …” command had failed, and in searching for a note I’d written about how smon handled this situation I came across an article I wrote in the days of version 7.3.3.5 describing its actions so, in an idle moment, decided to check if the note was still correct in the days of 19c.
It wasn’t – so here’s a tiny (and probably pointless) comment about an interesting little change that probably caters for ever increasing sizes of database and volume of activity.
In Oracle 7 (and probably some later versions) smon would wake up every 300 seconds (5 minutes) to check if there were any free extents in the fet$ table (remember – we’re back in the days of freelist management and ASSM didn’t exist) and if it found any adjacent free extents it would coalesce a few of them, i.e. replace a few rows in fet$ describing small extents with a single row describing a larger extent. This hasn’t changed, although the newer versions of Oracle select a couple more columns from fet$ than they used to when doing this check.
Every 25th wakeup (which means every 2 hours and 5 minutes) smon would also look for “lost” segments, i.e. the ones listed in seg$ that were marked as temporary (type# = 3), and do the work needed to clean them out – which entailed deleting the row from seg$, deleting any related rows from fet$ and, possibly, doing a couple of other deletes and updates (e.g. tablespace quotas: tsq$). This has changed.
I enabled tracing on smon in an instance of Oracle 19.11 – and then enabled system-wide tracing on the query of seg$ that was appearing in v$sql but wasn’t appearing in the smon trace. It turned out that the query was now being executed every 2 hours (7,200 seconds) by one of the worker processes (Wnnn) controlled by the space management coordinator (SMCO).
I referred to these segments as “lost” rather than temporary – this was just to emphasise the point that I wasn’t talking about segments in the TEMPORARY tablespaces, but segments in a PERMANENT tablespace that had been created as (for example) the space needed during a call to “create index”. Occasionally, as with the forum question, large scale tasks like this fail unexpectedly and don’t always clean up their own messes – and that’s why there’s a process doing a little housekeeping.
You’ll note, of course, that I haven’t said anything about what happens after the check for these type 3 temporary segments has found a target. Having found that it was the Wnnn processes that searched for the segments I enabled sql_trace system-wide for all the Wnnn processes then started a session to execute a “create massive table” statement, but used a kill -9 on its shadow process from the operating system. This left me with a suitable temporary segment in my schema that disappeared a couple of minutes later, leaving the evidence I needed in the trace file from process W007 in the form of these three (cosmetically edited) consecutive statements:
select file#, block#, ts#, blocks from seg$ where type# = 3
update seg$ set
type#=:4,blocks=:5,extents=:6,minexts=:7,maxexts=:8,extsize=:9,extpct=:10,user#=:11,iniexts=:12,
lists=decode(:13, 65535, NULL, :13),groups=decode(:14, 65535, NULL, :14), cachehint=:15,
hwmincr=:16, spare1=DECODE(:17,0,NULL,:17),scanhint=:18, bitmapranges=:19
where
ts#=:1 and file#=:2 and block#=:3
delete from seg$ where ts#=:1 and file#=:2 and block#=:3
The entire “search and destroy” sequence has migrated to the Wnnn process.
Virtual Nuisance
Here’s a note that’s been sitting as a draft for the last 7 years – finally dusted down, retested, and published. Following my new publishing paradigm, I’m going to tell you the conclusion at the start, then show the working that demonstrates the point.
SummaryWhen cloning a table, or subset of a table, the simplest method in the “good old days” was to execute a call to: “create table cloned_table as select * from source_table where …” (CTAS) then create any indexes that you needed on the clone. Unfortunately if you’ve created any (visible) virtual columns the CTAS will generate real columns holding the calculated values.
There are various workarounds to this, of course, though not all viable for all systems.
- The “obvious and sensible” option is (used to be) to identify just the real columns in the select list. But then you’d have to remember to check and add all the virtual bits afterwards (including the invisible bits).
- A “lazy” option that I’ve seen suggested a few times is to drop the virtual columns, clone the table, then recreate the virtual columns – a slightly better bet might be to make them invisible before cloning then make them visible (on the source) afterwards. Either way, though, this isn’t a viable strategy for a production/shared developer table and there are cases where it simply can’t be used and, again, you have to recreate the the virtual columns on the clone.
A fairly nice strategy appeared in the 12.2 feature “create table XX for exchange …” which doesn’t have to be aimed at a partitioned table. This will create the virtual, including invisible, columns. You still have to insert the data, however, which means you still have to create a select list that covers all the “real” columns in the right order; and you still have to recreate the indexes and some constraints.
Discussion and demonstrationQuoting from the SQL Language Reference manual 23c for the Create Table statement, you will find under the subtitle “for exchange with table” the statement:
This clause lets you create a table that matches the structure of an existing partitioned table.
This is true – but not the whole truth. It will do what it says, but it can also be used to create a table that matches the structure of an existing non-partitioned table. This means that when you want to clone a table (or subset of a table) that includes virtual columns and hidden columns, it’s a little easier to get it right. You still have to get the data into the table, though and recreate indexes and some constraints.
The following script shows two attempts at cloning – first using a simple “create as select * ..”, the second using a “create for exchange” followed by steps to populate the copy – first the wrong way then the right way.
We start by creating a source table (a subset of the view all_objects) then add a couple of indexes, a primary key (which also means not null) constraint, and a couple of virtual columns (one invisible).
rem
rem Script: virtual_nuisance.sql
rem Author: Jonathan Lewis
rem Dated: Feb 2018
rem Purpose:
rem
rem Last tested
rem 23.4.0.0 (23ai)
rem 19.11.0.0
rem 12.2.0.1
rem 11.2.0.4
prompt ====================
prompt Basic Table Creation
prompt ====================
create table source_table nologging
as
select *
from sys.all_objects
where rownum <= 10000
;
-- alter table source_table modify object_id not null;
alter table source_table add constraint st_pk primary key(object_id);
create unique index st_u1 on source_table(object_type, object_id);
create index st_i1 on source_table(object_name);
prompt ===================
prompt Add virtual columns
prompt ===================
alter table source_table
add (
new_name_visible
generated always as ( initcap(object_name) ) virtual,
new_name_invisible invisible
generated always as ( lower(object_name) ) virtual
)
;
Now we create two copies of the table, one with a simple “create as select *”, the other as a “create for exchange” which means we then have to populate it, which we try in two ways – first using “insert select *” which fails, leading us to insert a list of named columns:
prompt ======================================================
prompt Clone table - which instantiates the visible virtual column
prompt ======================================================
create table clone_table as select * from source_table;
prompt ====================================
prompt Create for exchange (12.2+)
prompt ====================================
create table exchange_table for exchange with table source_table;
insert into exchange_table select * from source_table;
-- *
-- ERROR at line 1:
-- ORA-54013: INSERT operation disallowed on virtual columns
-- Help: https://docs.oracle.com/error-help/db/ora-54013/
--
insert into exchange_table (
owner, object_name, subobject_name, object_id, data_object_id, object_type,
created, last_ddl_time, timestamp, status, temporary, generated,
secondary, namespace, edition_name, sharing, editionable, oracle_maintained,
application, default_collation, duplicated, sharded, created_appid,
created_vsnid, modified_appid, modified_vsnid
)
select
owner, object_name, subobject_name, object_id, data_object_id, object_type,
created, last_ddl_time, timestamp, status, temporary, generated,
secondary, namespace, edition_name, sharing, editionable, oracle_maintained,
application, default_collation, duplicated, sharded, created_appid,
created_vsnid, modified_appid, modified_vsnid
from
source_table
/
-- 10000 rows created.
As a check we can select a few rows in the exchange table to see that the virtual columns exist and hold the expected data. (Remember that the invisible columns can be selected by name, even though they are invisible in a describe or a “select *”.)
column object_name format a24
column new_name_visible format a24
column new_name_invisible format a24
select
object_name, object_type, new_name_visible, new_name_invisible
from
exchange_table
where
rownum <= 5
/
OBJECT_NAME OBJECT_TYPE NEW_NAME_VISIBLE NEW_NAME_INVISIBLE
------------------------ ----------------------- ------------------------ ------------------------
ALL_OBJECT_TABLES VIEW All_Object_Tables all_object_tables
ALL_OBJECT_TABLES SYNONYM All_Object_Tables all_object_tables
ALL_ALL_TABLES VIEW All_All_Tables all_all_tables
ALL_ALL_TABLES SYNONYM All_All_Tables all_all_tables
DBA_TABLES VIEW Dba_Tables dba_tables
5 rows selected.
Now check which of the three interesting columns exist in the three tables, and what their state is:
break on table_name skip 1
select
table_name, column_name,
virtual_column "Virtual",
hidden_column "Hidden"
from user_tab_cols
where table_name in ( 'SOURCE_TABLE', 'CLONE_TABLE', 'EXCHANGE_TABLE')
and column_name in ('OBJECT_NAME', 'NEW_NAME_VISIBLE','NEW_NAME_INVISIBLE')
order by
table_name desc, column_name desc
/
TABLE_NAME COLUMN_NAME Vir Hid
------------------------- -------------------- --- ---
SOURCE_TABLE OBJECT_NAME NO NO
NEW_NAME_VISIBLE YES NO
NEW_NAME_INVISIBLE YES YES
EXCHANGE_TABLE OBJECT_NAME NO NO
NEW_NAME_VISIBLE YES NO
NEW_NAME_INVISIBLE YES YES
CLONE_TABLE OBJECT_NAME NO NO
NEW_NAME_VISIBLE NO NO
8 rows selected.
Critically, the clone (CTAS) table doesn’t have the invisible virtual column, and the visible virtual column is no longer virtual. Both the virtual columns have appeared in the exchange table, both virtual, and the invisible column is still invisible (hidden).
But what of the indexes (including the implicit primary key index):
select
table_name, index_name, uniqueness
from
user_indexes
where table_name in ( 'SOURCE_TABLE', 'CLONE_TABLE', 'EXCHANGE_TABLE')
order by
table_name desc, index_name
;
TABLE_NAME INDEX_NAME UNIQUENES
------------------------- -------------------- ---------
SOURCE_TABLE ST_I1 NONUNIQUE
ST_PK UNIQUE
ST_U1 UNIQUE
3 rows selected.
Neither of the copies has any indexes – which is not a surprise for the clone (CTAS) table but is a little disappointing for a table we have (nominally) created in anticipation of an exchange with partition. But that is exactly what the manual would have told us would happen – if we’d looked.
There is one other detail to be careful of when cloning tables – will columns that report as NOT NULL in the source table be declared NOT NULL in the clone/exchange table? The answer depends on why they show as NOT NULL.
In my example the object_id in the source table shows up as NOT NULL when described, but doesn’t in either of the copies because it is a side-effect of the primary key constraint and that constraint has not been recreated in the copying process.
ASSM states
When the dbms_space_usage package came out reporting the “percentage free” space in ASSM blocks with ranges like 0-25%, 25%-50%, etc. with flags FS1 to FS4, I had great trouble remembering whether 0-25% was FS1 or FS4 and it was a strangely long time before I managed to get the obvious association to stick in my mind.
I’ve just found a new way of getting confused again. I’ve been looking at some redo log entries relating to bitmap space management, trying to work out why, and when, Oracle changes the bits. Each “bit” is actually a nybble (half a byte, if you didn’t know) which allows for 8 distinct values, and when Oracle changes the state of a “bit” in an L1 bitmap the redo dump reports OP code 13.22 and lines like:
L1 bitmap change to state 3
So here (E & OE, and mostly for my own benefit) is a table of the flags reported by the dbms_space_usage package and the corresponding state.
- 7 = FS5 -> 100% empty
- 6 = not used ?
- 5 = FS4 -> 75% – 100%
- 4 = FS3 -> 50% – 75%
- 3 = FS2 -> 25% – 50%
- 2 = FS1 -> 0% – 25%
- 1 = FULL
- 0 = unformatted
Everything Changes
I saw a recent tweet (on Bluesky) from SQLDaily highlighting a blog note that Lukas Eder wrote in 2016 with the title: “Avoid using COUNT() in SQL when you could use EXISTS()”. This is a good guideline because it probably maximises the options the optimizer has for producing a good execution plan, but it’s not always necessary in recent versions of Oracle. It depends on what you’re trying to achieve and how you’re using count(), of course, but there are cases where the optimizer is capable of turning a count() subquery into an exists() subquery. Here’s a litle demo running on a PDB in 19.11.0.0:
rem
rem Script: count_to_exists.sql
rem Author: Jonathan Lewis
rem Dated: Dec 2024
rem Purpose:
rem
rem Last tested
rem 23.4(ai)
rem 19.11.0.0
rem 12.2.0.1
rem
create table t1 as
select *
from all_objects
where rownum <= 5e4
/
create table t2 as
select *
from t1
where mod(object_id,50) = 0
/
create index t2_i1 on t2(owner);
set linesize 120
set pagesize 60
set trimspool on
set tab off
column object_name format a32
column owner format a32
select owner, count(*)
from t1
group by
owner
order by
count(*)
/
select owner from t1
minus
select owner from t2
/
set feedback only
set serveroutput off
select
owner, object_id, object_name
from
t1
where
t1.owner like 'D%'
and 0 < (
select count(*)
from t2
where t2.owner = t1.owner
)
/
set feedback on
select * from table(dbms_xplan.display_cursor)
/
I’ve included a couple of queries to show the numbers of rows for each owner in t1 and to highlight the owner values that appear in t1 but don’t appear in t2. (I used the results of those queries to pick the ‘D%’ predicate in the final query.) Here’s the execution plan for the final query in 19.11.0.0:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID cwj9t6wzhczgh, child number 0
-------------------------------------
select owner, object_id, object_name from t1 where t1.owner like
'D%' and 0 < ( select count(*) from t2 where t2.owner = t1.owner
)
Plan hash value: 82129279
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 278 (100)| |
|* 1 | HASH JOIN RIGHT SEMI| | 3244 | 167K| 278 (1)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T2_I1 | 144 | 720 | 2 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL | T1 | 6025 | 282K| 276 (1)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("T2"."OWNER"="T1"."OWNER")
2 - access("T2"."OWNER" LIKE 'D%')
filter("T2"."OWNER" LIKE 'D%')
3 - filter("T1"."OWNER" LIKE 'D%')
There’s no sign of any aggregation in the plan – so the count(*) can’t be happening. Unfortunately the optimizer has transformed the subquery out of existence and executed the query as a semi-join. So let’s make it easier to see what’s happened to the count(*) by stopping the optimizer from unnesting the subquery, and let’s use explain plan so that we can see what the complete Predicate Information looks like when the plan operates with a subquery.
explain plan for
select
owner, object_id, object_name
from
t1
where
t1.owner like 'D%'
and 0 < (
select /*+ no_unnest */
count(*)
from t2
where t2.owner = t1.owner
)
/
select * from table(dbms_xplan.display)
/
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3988898734
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 463 | 22224 | 278 (1)| 00:00:01 |
|* 1 | FILTER | | | | | |
|* 2 | TABLE ACCESS FULL| T1 | 6025 | 282K| 276 (1)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T2_I1 | 2 | 10 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE
"T2"."OWNER"=:B1))
2 - filter("T1"."OWNER" LIKE 'D%')
3 - access("T2"."OWNER"=:B1)
As you can see, the optimizer has worked out that “select count(*) > 0” is equivalent to “exists select” and transformed the subquery accordingly; which means it will stop after finding the first matching row.
ConclusionI started by pointing out that a recent (Bluesky) tweet was echoing a blog note published in 2016. I also agreed that the general principle of using exists() to find the first occurrence rather than using count(*) was a good idea when you didn’t need to know “how many” and just needed to know “are there any”.
However I then demonstrated that the (Oracle) optimizer is so sophisticated that it can recognise when a count() subquery can be transformed into an exists() subquery -presumably because it gets a “clue” from the comparison with zero.
The point of this blog note, and the moral of the story is this: when trying to solve a problem by searching the internet, an article published 8 years ago may not give you the most up to date information or the best possible solution.
Consistent?
Here’s a cut-n-paste from a simple SQL*Plus session running under 19.11.0.0 (23.4 produces the same effects):
SQL> drop table t purge;
Table dropped.
SQL> create table t (id number, val number) rowdependencies;
Table created.
SQL> insert into t values(1,0);
1 row created.
SQL> commit;
Commit complete.
SQL> insert into t values(2,0);
1 row created.
SQL> update t set val = 1;
2 rows updated.
SQL> select * from t;
ID VAL
---------- ----------
1 1
2 1
2 rows selected.
Open a second terminal and start an SQL*Plus session that does the following:
SQL> update t set val = val + 1;
This statement is going to hang because the first (and only) row it ought to see at this moment in time is locked by session 1. If session 1 commits then session 2 should detect that it’s trying to update a row that has changed since the start of the transaction and rollback to restart so that it can do an update that is self-consistent as at a specific point in time.
Question – how many rows will session 2 update?
Here’s the cut-n-paste from session 2 that appeared as soon as session 1 commited:
1 row updated.
SQL> select * from t;
ID VAL
---------- ----------
1 2
2 1
2 rows selected.
This doesn’t look right. At one point in time the committed data in the table was the single row (1,0), and at another point in time the committed data in the table was the pair of rows (1,1) and (2,1). How can a single, self-consistent, update statement operate as at a single point in time and leave the table holding the rows you see above. Surely the statement should either fail leaving val = 1 in both rows, or it should succeed consistently leaving val = 2 in both rows.
As a further detail, note how I’ve enabled rowdependencies on the table. (This is something I usually do whenever I want to take advantage of the flashback version query feature though it’s not a necessity in this case). Here’s what I see when I commit in session 2 then query the entire versioning history of the table:
column versions_startscn format 99,999,999,999,999
column versions_endscn format 99,999,999,999,999
column versions_starttime format a24
column versions_endtime format a24
select
versions_starttime,
versions_startscn,
versions_endtime,
versions_endscn,
versions_operation,
versions_xid,
id, val
from t versions between scn minvalue and maxvalue
/
VERSIONS_STARTTIME VERSIONS_STARTSCN VERSIONS_ENDTIME VERSIONS_ENDSCN V VERSIONS_XID ID VAL
------------------------ ------------------- ------------------------ ------------------- - ---------------- ---------- ----------
09-DEC-24 02.38.22 PM 12,670,482,437,540 U 04001B004F600000 1 2
09-DEC-24 02.35.52 PM 12,670,482,437,468 I 06000000DA740000 2 1
09-DEC-24 02.35.52 PM 12,670,482,437,468 09-DEC-24 02.38.22 PM 12,670,482,437,540 U 06000000DA740000 1 1
09-DEC-24 02.35.52 PM 12,670,482,437,468 1 0
4 rows selected.
Surprisingly there is no “insert transaction” visible for (1,0), but, reading from the bottom upwards we can see that (1,0) ceased to exist at SCN 12,670,482,437,468 and that transaction 06000000DA740000 updated that row to (1,1) and inserted row (2,1). What we’re seeing as “versions” are the “versions as at commit”, of course, rather than the results of all the individual modifications (which is what you can get from Log Miner).
If we look at the (1,1) row again we can see that that verson ceased to exist at SCN 12,670,482,437,540 and the top line of the output tells us that that’s thanks to transaction 06000000DA740000 changing it to (1,2). That’s our update from the second session that we believe should have updated every existing row in the table by adding 1 to the val column.
Finally, when we look at the row for (2,1) we see that it came into existence at the same time (SCN 12,670,482,437,468) as row (1,1) but was not changed by the transaction that “happened” a little later and changed (1,1) to (1,2).
This behaviour first came to my attention in a comment to a note I wrote in 2011 about an upgrade from 11.1.0.6 to 11.2.0.2. The sample I’ve shown above is taken from that comment, modified slightly to separate the table creation from the first insert (just in case create as select introduced some peculiar special effects) and there are other examples and observations in the comments following the one I’ve linked to. The anomaly described in the actual blog note was addressed in 11.2.0.3 and 12.1.0.2 but, as you can see, there are other anomalies still waiting to be addressed.
Fetch size
Part of my effort to clear my long backlog of drafts
This is a trivial script I wrote many years ago – the date on the draft (one of the oldest on my blog) was August 2009 – but the script itself must have started life in 8i if not earlier
All the script does is execute a simple-minded test to find SQL that may be wasting resources and causing contention by using “small” array fetches to return “large” amounts of data. It does this through a scan of v$sql or (for later versions of Oracle) v$sqlstats or v$sqlstats_plan_hash to compare execution counts, number of fetch calls, and total rows returned.
The original code – probably written for 8i – used v$sql and reported the first_load_time for each child cursor but an update to the script took advantage of the introduction in 10g of the view v$sqlstats and started reporting the last_active_time – which means you could choose to ignore (temporarily, perhaps) any statements that had not been active in the very recent past.
I’ve tended to avoid the view v$sqlarea because, in its earliest form, it was simply an aggregate view of the x$ structure behind v$sql. When v$sqlstats appeared its contents were similar to those of v$sqlarea but its published implementation was a simple query against a different x$ so that’s the one I started using (having failed to notice that the implementation for v$sqlarea had changed to a simple query against yet another x$ at about the same time. Both x$ structures are created on the fly into the PGA so they’re probably doing similar amounts of aggregation under the covers although with many differences in the details they now report – including a number of “deltas since last AWR snapshot” in the newer view.
rem
rem Script: small_fetch.sql
rem Author: Jonathan Lewis
rem Dated: (pre) Aug 2009
rem
set linesize 156
set pagesize 40
set trimspool on
set tab off
break on row skip 1
column first_load_time format a20
column last_active_time format a20
column sql_text wrap word
spool small_fetch
select
--
-- v$sql.parsing_user_id,
-- v$sql.parsing_schema_id,
-- v$sql.first_load_time,
--
to_char(last_active_time,'yyyy-mm-dd/hh24:mi:ss') last_active_time,
sql_id,
plan_hash_value,
executions,
end_of_fetch_count,
fetches,
rows_processed,
round(rows_processed/nullif(executions,0),0) rows_per_exec,
round(rows_processed/nullif(fetches,0),0) fetchsize,
sql_text
from
v$sqlstats
-- v$sql
-- v$sqlstats_plan_hash
where
plan_hash_value != 0 -- pl/sql block
-- and (v$sql.parsing_user_id != 0 or v$sql.parsing_schema_id != 0)
and fetches > 25000
and round(rows_processed/nullif(executions,0),0) > 5000
and round(rows_processed/nullif(fetches,0),0) < 50
order by
fetches desc
;
spool off
The numbers I’ve used in the SQL above are fairly arbitrary, your choices for “large” and “small” willl almost certainly be different.
The original script is probably an ad hoc statement that I wrote in a few minutes on a client site to poke around the SQL being executed by a system that was showing network problems and surprisingly large numbers of waits for “SQL*Net message to client”, so I’ve published it more as an example of the type of thing you could do than an example of “best practice”. It may be helpful, but your specific circumstances and requirements may make you want to change the choice of view you use, the column you report, and the predicates you use to filter the results.
You’ll notice from the code that v$sql includes the parsing_user_id and parsing_schema_id, but v$sqlstats doesn’t. A current convenience of using v$sql is that you can eliminate any statements parsed by SYS simply be adding the predicate “parsing_user_id != 0”. The reason why this is convenient is that a lot of the sys-recursive (largely dictionary cache) queries do single row fetches and you don’t want to clutter your report with things you can’t change.
A reason why v$sqlstats might be your preferred choice is that it has a large number of “delta_statistic” column – which give you counts and times since the last AWR snapshot (though how that works if you’re not licenced to run the AWR, I haven’t checked).
CautionThe header on the script suggests that the last system I used, and edited, it on was 11.1.0.7 – so before container databases came into existence. There are all sorts of columns in the views I’ve mentioned that might be very helpful, but there’s also the potential for traps relating to which schema can see what data, and from which PDB. Here, for example, is a cut-n-paste showing something that looks like a bug when running as SYS in a PDB on an instance running 19.11.0.0: note particularly how I’ve changed the view in the three queries from v$sql to v$sqlstats then back to v$sql.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sql group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1282
1 row selected.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sqlstats group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1395
1 row selected.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sql group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1282
1 row selected.
What can v$sqlstats see that v$sql can’t see?
Optimizer Environment 2
Many years ago I wrote a short note to introduce the view v$sql_optimizer_env (and the session and system equivalents) which report the current values of optimizer related parameters (some of which are not found in the startup file) for a cursor child, or a session, or the system. I also mentioned the underlying x$kqlfsqce (for v$sql_optimizer_env) that held far more parameters than were reported in the v$ view.
Recently I commented on a question about different plans (i.e. child cursors) for the same query for different sessions and noted that changing the optimizer_features_enable might not change any of the visible (v$) parameters – other than optimizer_features_enable itself – so it might be necessary to check the x$ to find out what had changed. Here’s a possible query to do that, followed by a few lines of output:
rem
rem Script: optimizer_env_b.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem Purpose:
rem
-- select count(*) from all_users;
define m_sql_id = 'b2zqhgr5tzbpk'
spool optimizer_env_b.lst
select
kqlfsqce_pname,
kqlfsqce_chno,
kqlfsqce_pvalue,
kqlfsqce_flags
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
order by
kqlfsqce_pname,
kqlfsqce_chno
/
spool off
KQLFSQCE_PNAME KQLFSQCE_CHNO KQLFSQCE_PVALUE KQLFSQCE_FLAGS
-------------------------------------------------- ------------- ------------------------- --------------
BlockChain_ledger_infrastructure 0 0 10
BlockChain_ledger_infrastructure 1 0 10
CLI_internal_cursor 0 0 10
CLI_internal_cursor 1 0 10
PMO_altidx_rebuild 0 0 2
PMO_altidx_rebuild 1 0 2
_adaptive_window_consolidator_enabled 0 true 7
_adaptive_window_consolidator_enabled 1 true 7
_add_stale_mv_to_dependency_list 0 true 6
_add_stale_mv_to_dependency_list 1 true 6
_aggregation_optimization_settings 0 0 7
_aggregation_optimization_settings 1 0 7
_allow_level_without_connect_by 0 false 6
_allow_level_without_connect_by 1 false 6
...
uniaud_internal_cursor 0 0 10
uniaud_internal_cursor 1 0 10
workarea_size_policy 0 auto 2
workarea_size_policy 1 auto 2
1226 rows selected.
The script allows you to set an sql_id to report – the value I’ve used is for the statement select count(*) from all_users; The query as it stands has to be run by sys, but you could create a view (say x_$kqlfsqce) in the sys schema, grant select on that to public to allow everyone to query the view then create a public synonym for it that allowed everyone else to use the same query as sys:
create or replace view x_$kqlfsqce as select * from x$kqlfsqce;
grant select on x_$kqlfsqce to public;
create public synonym x$kqlfsqce for x_$kqlfsqce;
The drawback to this query is clear in the final lines of the output – in 19.11 it returns 613 rows per child cursor leaving you have to analyze (by eye?) the entire list to find the parameter names where the parameter values differ. A better query would report only those parameters where the values differed (ideally listing the values side by side, but that’s another story). For example:
rem
rem Script: optimizer_env_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem
-- select count(*) from all_users;
define m_sql_id = 'b2zqhgr5tzbpk'
spool optimizer_env_2.lst
with child_count as (
select
/*+ materialize */
count(distinct kqlfsqce_chno) child_count
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
)
select
kqlfsqce_chno,
kqlfsqce_pname,
kqlfsqce_pvalue,
kqlfsqce_flags
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
and (kqlfsqce_pnum, kqlfsqce_pvalue) in (
select
kqlfsqce_pnum, kqlfsqce_pvalue
from x$kqlfsqce
where kqlfsqce_sqlid = '&m_sql_id'
group by
kqlfsqce_pnum, kqlfsqce_pvalue
having count(*) != (select child_count from child_count)
)
order by
kqlfsqce_pname,
kqlfsqce_chno
/
spool off
KQLFSQCE_CHNO KQLFSQCE_PNAME KQLFSQCE_PVALUE KQLFSQCE_FLAGS
------------- -------------------------------------------------- ------------------------- --------------
0 _bloom_filter_setops_enabled true 7
1 _bloom_filter_setops_enabled false 7
0 _bloom_pruning_setops_enabled true 7
1 _bloom_pruning_setops_enabled false 7
0 _cell_offload_grand_total true 7
1 _cell_offload_grand_total false 7
0 _cell_offload_vector_groupby_fact_key true 7
1 _cell_offload_vector_groupby_fact_key false 7
...
0 _sqlexec_reorder_wif_enabled true 7
1 _sqlexec_reorder_wif_enabled false 7
0 optimizer_features_enable 19.1.0 2
1 optimizer_features_enable 12.2.0.1 0
58 rows selected.
For this example I ran my query against all_users twice, but changed the optimizer_features_enable from 19.1.0 (the default for 19.11) to 12.2.0.1 for the second execution, so I got two child cursors. As you can see from the output the number of optimizer parameters that changed was 29 (half of 58) and the only parameter that would have been visible in v$sql_optimizer_env was optimizer_features_enable itself.
The SQL identifies the duplicates by initially counting the number of child cursors for the sql_id – because you might have more than just two, and knowing the number of child cursors is the only way of identifying which parameters have different values for different child cursors.
The fact that (in general) you won’t necessarily know how many child cursors you have to compare and report is what makes it a little harder to write a query that reports variations in values across the page – so it’s left as an exercise for the reader. Personally I’d be inclined to add a clause “kqlfsqce_chno in (N, M)”, picking child cursors with different execution plans, to limit the number to just two interesting child cursors and stick with reporting down the page.
AddendumIn case you’re wondering about the optimizer parameters that aren’t startup parameters, here’s a quick query to run as SYS. (The output is from 19.11 again):
select
distinct kqlfsqce_pname
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
minus
select
ksppinm
from
x$ksppi
/
KQLFSQCE_PNAME
--------------------------------------------------------------------------------
BlockChain_ledger_infrastructure
CLI_internal_cursor
PMO_altidx_rebuild
_enable_pmo_ctas
_fix_control_key
_is_lock_table_for_ddl_wait_lock
_is_lock_table_for_split_merge
_long_varchar_allow_IOT
_multi_commit_global_index_maint
_nls_forced_binary
_suppress_scn_chk_for_cqn
_update_bji_ipdml_enabled
_update_global_index_from_part_exchange
_use_hidden_partitions
advanced_queuing_internal_cursor
flashback_data_archive_internal_cursor
flashback_table_rpi
ilm_access_tracking
ilm_dml_timestamp
ilm_filter
iot_internal_cursor
is_recur_flags
kkb_drop_empty_segments
only_move_row
optimizer_features_hinted
optimizer_mode_hinted
parallel_autodop
parallel_dblink
parallel_ddl_forced_degree
parallel_ddl_forced_instances
parallel_ddl_mode
parallel_ddldml
parallel_degree
parallel_dml_forced_dop
parallel_dop_doubled
parallel_execution_enabled
parallel_hinted
parallel_max_degree
parallel_query_default_dop
parallel_query_forced_dop
parallel_query_mode
partition_pruning_internal_cursor
rim_node_exist
sql_from_coordinator
sqlstat_enabled
total_cpu_count
total_processor_group_count
transaction_isolation_level
uniaud_internal_cursor
49 rows selected.
Following up my announcement of this note on Bluesky someone asked if this information appeared in the 10053 (CBO) trace file. The answer is yes. Under a heading “PARAMETERS USED BY THE OPTIMIZER” there are three sub-headings (also in capitals, but made more readable below):
- Parameters with altered values
- Parameters with default values
- Parameters in opt_param hint
The first two of these sub-headings share the full set of optimizer parameters, with no duplicates across the two lists. The last sub-heading introduces duplication, listing the parameter where the opt_param() hint has over-ridden the default (system) or altered (session) settings.
WWSS
What would Shakespeare say?
- Q: So, Mr. Shakespeare, which RDBMS do you use to look up your “famous quotes”?
- A: My lords, I refer me to the Oracle.
- Q: And how would you feel about migrating to SQL Server or DB2?
- A: A plague a’ both your houses.
- Q: So what is your basic requirement for finding a good quote to use?
- A: If it were done when ’tis done, then ’twere well it were done quickly.
- Q: Do you think that using multiple block sizes in your database would improve performance?
- A: It is a tale told by an idiot full of sound and fury signifying nothing.
- Q: The CBO has a big part to play in performance – what makes it so difficult to understand?
- A: Her infinite variety.
- Q: And what would you say to a DBA who claims to understand the CBO?
- A: There are more things in heaven and earth, Horatio, than are dreamt of in your philsophy.
- Q: Do you think it’s a good idea to use hints when the CBO misbehaves?
- A: Though this be madness, yet there is method in ‘t.
- Q: And how much faith do you have in the execution plans that the CBO generates?
- A: Oft expectation fails, and most oft there Where most it promises.
- Q: NULLs can, of course, make things hard for the CBO – what’s your view on them?
- A: Nothing can come of nothing.
- Q: And how do you interpret NULLs?
- A: To be or not to be, that is the question.
- Q: Do you document the SQL statements you write?
- A: It is a custom more honor’d in the breach than the observance.
- Q: What is your experience with Oracle support?
- A: The course of true love never did run smooth.
- Q: And how you do think support will be affected by the introduction of AI?
- A: More matter with less art.
- Q: Are we going to see 23ai installed at Anne Hathaway’s cottage any time soon?
- A: Expectation is the root of all heartache.
- Q: And finally, what do you make of Sym42 and the MASH program?
- A: O brave new world that has such people in it.
4096 Columns
If you thought 1,000 columns was over the top for a table in Oracle, then you’ll love 23ai which allows you to stretch that to 4,096 columns if you’ve set the parameter max_columns = extended in the spfile (or the pdb equivalent).
Naturally I was curious to see if there had been any significant structural changes to the way Oracle stored a row when the number of columns became very large, so I took the code I had written to examine the case of 1,000 columns, and hacked it around a little bit.
rem
rem Script: wide_table_4096.sql
rem Author: Jonathan Lewis
rem Dated: 7th April 2023
rem
rem Last tested
rem 23.4.25.4 (23ai)
rem
rem See also
rem https://jonathanlewis.wordpress.com/2015/02/19/255-columns/
rem https://jonathanlewis.wordpress.com/2015/05/18/migrated-rows/
rem
set pagesize 0
set feedback off
drop table if exists t1 purge;
spool temp.sql
prompt create table t1(
select
'col' || to_char(rownum,'fm0000') || ' varchar2(10),'
from
all_objects
where rownum <= 4095
;
prompt col4096 varchar2(10)
prompt )
prompt /
spool off
@temp
insert /*+ append */ into t1 (col0001, col0002, col0236,col0491, col0746, col4096)
values ('0001','0002', '0236','0491','0746','4096')
;
commit;
set pagesize 40 linesize 132 feedback on
-- column ref_file_no new_value m_fileno
column abs_file_no new_value m_fileno
column block_no new_value m_blockno
column value new_value m_trace_file
spool wide_table_4096
select
dbms_rowid.rowid_to_absolute_fno(rowid, user, 'T1') abs_file_no,
dbms_rowid.rowid_relative_fno(rowid,'BIGFILE') rel_file_no,
dbms_rowid.rowid_relative_fno(rowid,'BIGFILE') big_rel_file_no,
dbms_rowid.rowid_relative_fno(rowid,'SMALLFILE') small_rel_file_no,
dbms_rowid.rowid_relative_fno(rowid) def_rel_file_no,
dbms_rowid.rowid_block_number(rowid,'BIGFILE') block_no,
dbms_rowid.rowid_block_number(rowid,'SMALLFILE') small_block_no,
dbms_rowid.rowid_block_number(rowid) def_block_no
from
t1
/
alter system flush buffer_cache;
alter system dump datafile &m_fileno block &m_blockno;
set heading off
-- select payload from my_trace_file;
select payload
from v$diag_trace_file_contents
where
adr_home = (select value from V$diag_info where name = 'ADR Home')
and trace_filename = (
select substr(
value,
2 + ( select length(value)
from v$diag_info
where name = 'Diag Trace'
)
)
from v$diag_info
where name = 'Default Trace File'
)
/
spool off
The code is simple – I’ve used SQL to generate the SQL to create a table of 4,096 columns, then I’ve inserted a single row into the table, populating the first column, the last column, and a few in between. Then I’ve used a few calls to procedures in the dbms_rowid to convert the rowid of that row into a file and block number. When I installed 23ai and created my “user” tablespace it defaulted to a bigfile tablespace, but I’ve left a few variations of the calls in place to suit different configurations.
After finding the file and block number where the row starts I’ve dumped the entire block to my trace file and then read the tracefile back and spooled it to a flat file. My account is a “normal” account, but I’ve given it privileges to read various dynamic performance views – you might prefer to create a view and public synonym in the sys schema and grant select to public on the view. (See: https://jonathanlewis.wordpress.com/2019/10/03/trace-files-2/)
I’m not going to show you the symbolic dump of the block, instead I’ll show you a few egrep commands and their output:
******
****** Number of rows in the block:
******
jonathan@localhost working]$ egrep nro wide_table_4096.lst
nrow=17
0xe:pti[0] nrow=17 offs=0
******
****** Number of rows in each column, and the row number of each row:
******
tab 0, row 0, @0x1e4a
tl: 262 fb: -----L-- lb: 0x1 cc: 255
--
tab 0, row 1, @0x1d42
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 2, @0x1c3a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 3, @0x1b32
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 4, @0x1a2a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 5, @0x1922
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 6, @0x181a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 7, @0x1712
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 9, @0x1502
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 10, @0x13fa
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 11, @0x12f2
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 12, @0x11ea
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 13, @0x10de
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 14, @0xfd2
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 15, @0xec6
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 16, @0xea5
tl: 33 fb: --H-F--- lb: 0x1 cc: 16
******
****** First few columns of row 16 (17th row in the block)
******
[jonathan@localhost working]$ egrep -A+16 "row 16" wide_table_4096.lst
tab 0, row 16, @0xea5
tl: 33 fb: --H-F--- lb: 0x1 cc: 16
nrid: 0x00002275.f
col 0: [ 4] 30 30 30 31
col 1: [ 4] 30 30 30 32
col 2: *NULL*
col 3: *NULL*
The row has been stored as 17 separate row pieces with a maximum of 255 colums per row piece. (Note: 16 * 255 + 16 = 4096). The final egrep output shows you that the starting columns of the row are stored as the 17th row. The values of col 0 and col 1 are one clue, but the –H-F— flags (header, first) are the technical indicator. Each row piece points (backwards backwards) to the next rowpiece in the chain ending with row 0 where the —–L– flag tells us it’s the last rowpiece of the chain.
So, no great difference in the treatment of tables with very large numbers of columns – it’s just lots of row pieces of up to 255 columns, where the row is (initially) split up into 255 column chunks working from the end of the row, potentially leaving you with a very small number of columns in the first row piece.
There was one significant difference in the block dump that I haven’t shown yet. The ITL (interested transaction list) ended up with 5 entries as a result of the single insert.
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0009.01c.0000033c 0x00000103.0102.2a --U- 17 fsc 0x0000.0128d402
0x02 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
0x03 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
0x04 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
0x05 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
I did’t know if this is new, or something I had missed when the limit on the number of columns changed from 255 to 1,000, but the number of ITL entries varied according to the number of row pieces created during my single row insert. Repeating the test but changing the “last column populated” I got the following results:
- One row piece (col0255) – itc = 2, Lck = 1
- Two row pieces (col0256) – itc = 3, Lck = 2
- Three row pieces (col0512) – itc = 4, Lck = 3
- Four row pieces (col0768) – itc = 5, Lck = 4
- Five row pieces (col1024) – itc = 5, Lck = 5 — i.e. same itc as 4096 columns/17 row pieces
Of course the “coincidence” that the ITL stopped growing at the fouth row made me decide to check back to an older version of Oracle (19.11.0.0) to see what happened where the limit was 1,000 columns – and it turned out that 19c also created extra ITL entries in the same way, up to the same limit. (I can’t help wondering whether this limit was “accidentally” carried forward to 23ai or whether there was a deliberate decision that 4 “spare” ITL slots was the best compromise between ITL waits and unnecessary space use.)
FootnoteThere are probably several more questions to ask about features that might (or might not) adapt to the raising of the column limit. For example 11g had a limit of 20 (or ceil(column_count / 10)) for the number of extended stats definitions you could create for a single table when the maximum number of columns allowed was 1,000. Might this have changed for 23ai? (Initial answer: No. Longer answer: to be investigated – could you really get 370 sets of extended stats if you started with 3,700 columns in a table or would something breaks at (say) 256 sets?)
Audit Cleaning #JoelKallmanday
This note is an observation of an odd little detail that’s nice to know about the way that Unified Audit (see part 8 of Neil Chandler’s series on Fundamental Security) handles deleting old data. I don’t have a lot of background experience with Unified Audit, but I was surprised by some comments on a recent thread on the MOSC database forum (needs an account) about problems of chained rows and the difficulty of reclaiming space from the “audit tablespace”, so I spent a couple of hours looking at the tables, views and procedures to see if there were any pre-emptive strategies that would avoid the space management issue. All the comments below are based on a single pdb running under 19.11.
The unifiied audit is kept (in recent versions of Oracle) in an interval-partitioned table called aud$unified, owned by user audsys. The partitioning is on a column called event_timestamp – declared as timestamp(6) not null, the default interval is one month but you can change this by calling the procedure dbms_audit_mgmt.alter_partition_interval – and for some people it would be a very good idea to do so.
The forum thread included the complaint that the segment space advisor has found a lot of chained rows, and that a rebuild could remove 49 percent of them. Since the sum of the declared lengths of the tables columns is around 70KB and there are 3 CLOB columns declared as “enable storage in row”, I don’t think anyone should be surprised to see a lot of chained rows. (And I wouldn’t trust the advisor to do a very good job estimating what might happen with the CLOBs.)
Here’s a simple query to report the partitions, with high_value for the audit table:
select
partition_name, blocks, num_rows, high_value
from
dba_tab_partitions
where
table_name = 'AUD$UNIFIED'
order by
partition_position desc
/
There are all sorts of things you’re not allowed to do with it – even if you’re connected as sys you should only be using the management package to handle it.
delete from audsys.aud$unified where event_timestamp < timestamp'2022-06-14 00:00:00'
*
ERROR at line 1:
ORA-46385: DML and DDL operations are not allowed on table "AUDSYS"."AUD$UNIFIED".
The thinking behind the audit table seems to be that you should be checking it on a regular basis then archiving (or just deleting) the data you’ve checked so that the audit table doesn’t become enormous. To this end Oracle allows you to set a “last archived” timestamp, which you can use in a call to the procedure dbms_audit_mgmt.clean_audit_trail. You can check the current value of this timestamp by querying view dba_audit_mgmt_last_arch_ts.
SQL> select last_archive_ts
2 from dba_audit_mgmt_last_arch_ts
3 /
LAST_ARCHIVE_TS
---------------------------------------------------------------------------
01-MAY-22 12.00.01.000000 AM +00:00
1 row selected.
There is also a function dbms_audit_mgmt.get_last_archive_timestamp() to return the timestamp, but this didn’t seem to work on my 19.11 instance. (The procedure to set_last_archive_timestamp() does work.)
There are procedures in the package that let you automate this cleaning, but the point(s) I want to make are things I got to by doing some simple tests calling each step of the process in turn. The oldest few rows in my audit table currently look like this:
PARTITION_NAME BLOCKS NUM_ROWS HIGH_VALUE
---------------------- ---------- ---------- ----------------------------------------
...
SYS_P28453 50 701 TIMESTAMP' 2022-11-01 00:00:00'
SYS_P27945 43 683 TIMESTAMP' 2022-10-01 00:00:00'
SYS_P20893 88 1328 TIMESTAMP' 2022-09-01 00:00:00'
SYS_P20176 110 1766 TIMESTAMP' 2022-08-01 00:00:00'
SYS_P20025 244 3318 TIMESTAMP' 2022-07-01 00:00:00'
SYS_P17365 103 1684 TIMESTAMP' 2022-06-01 00:00:00'
Assume I want to delete everything from the last 4 partitions, I could execute the following anonymous block:
declare
ts timestamp;
begin
dbms_audit_mgmt.set_last_archive_timestamp(
dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED,
timestamp '2022-09-01 00:00:00'
);
select last_archive_ts
into ts
from dba_audit_mgmt_last_arch_ts
;
dbms_output.put_line('Set: ' || ts);
dbms_audit_mgmt.clean_audit_trail(
audit_trail_type => dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED,
use_last_arch_timestamp => true -- default
);
end;
/
Set: 01-SEP-22 12.00.00.000000 AM
PL/SQL procedure successfully completed.
When I check dba_tab_partitions again, the end of the list is now as follows:
PARTITION_NAME BLOCKS NUM_ROWS HIGH_VALUE
---------------------- ---------- ---------- ----------------------------------------
...
SYS_P28453 50 701 TIMESTAMP' 2022-11-01 00:00:00'
SYS_P27945 43 683 TIMESTAMP' 2022-10-01 00:00:00'
SYS_P20893 88 1328 TIMESTAMP' 2022-09-01 00:00:00'
Oracle has dropped three of the 4 partitions – although it has kept the the partition with the high_value of ‘2022-09-01’, which is now empty because that’s what I was expecting when I gave Oracle that timestamp. Looking at the trace file – I can see (using a suitable egrep command) three calls (driven by calls to dbms_pdb_exec_sql()) to drop an explicitly named partition followed by this delete statement:
delete from audsys.aud$unified where event_timestamp < :1 and (dbid = :2 or dbid = 0)
So Oracle has been very clever, but not quite clever enough, when removing data from the audit table; it drops any partition whose high_value is (strictly) less than the timestamp you supply then, if necessary, uses a delete on just the one partition that has become the oldest partition.
So, a couple of suggestions for cleaning the audit trail (subject to you checking that I haven’t missed any important details, and any variations due to version of Oracle).
- If you plan to set up a regular call to clean the audit table it might be a very good idea to match the partitioning interval with the schedule so that you never have to remove part of the data from a partition (i.e. you want to avoid running a big delete)
- When you see what goes on with the last_archive_timestamp in my tests, it looks like a good idea to ensure that the value you pass to the cleaning procedure is a “high_value plus one (micro)second” (did you spot the “spare” second in my query result on dba_audit_mgmt_last_arch_ts) so that you drop every relevant partition rather than deleting every single row from the new “oldest” partition.
The delete statement is a little puzzling – how can you drop one partition but think you need to check the dbid before deleting the data from another partition? Maybe the apparent anomaly is a sign that Oracle is reusing a delete statement that would have to replace the drop commands if you had a setup where multiple PDBs were writing to the same (CDB) audit table.
You’ll notice that I’ve passed a (symbolic) parameter value of dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED to the calls to the packaged procedures. If you haven’t moved all the different audit trails (the “legacy” audit, the FGA audit, and any others) into the unified audit there are other parameter values for cleaning the other trails.
Questions appear quite frequently about how to do arithmetic or comparison with the high_value column from dba_tab_partitions – the problem has disappeared in very recent versions of Oracle, but dbms_audit_mgmt has an interesting utility procedure: get_part_highval_as_char() I spoke too soon, the only table that this procedure will work for is audsys.aud$unified. Unless you’re on 23c (with CLOB and JSON versions of the column) you’ll have to fall back to something like the method I showed in a post I wrote earlier this year.
SQL> select dbms_audit_mgmt.get_part_highval_as_char('SYS_P31737') from dual;
DBMS_AUDIT_MGMT.GET_PART_HIGHVAL_AS_CHAR('SYS_P31737')
--------------------------------------------------------------------------------
2023-04-01 00:00:00
1 row selected.
This is a note about a topic that I’m not really familiar with, written rather quickly, and not yet reviewed for grammar and typing errors.
IN / NOT IN
A question appeared on the Oracle developer forum recently (Oct 2024) that could be answered by a blog note that I started drafting about 18 months ago after answering a very similar question that had appeared in the same forum. Take a look at the following query, with results, about a status column in a fairly large dataset:
select status, count(*)
from my_table
group by status
order by 2 desc;
STATUS COUNT(*)
-------------------------
STATUS_7 10580209
STATUS_6 23902
STATUS_5 1504
STATUS_4 48
STATUS_3 17
STATUS_2 8
STATUS_1 5
As you can see, almost all the data is at ‘STATUS_7’ or ‘STATUS_6’. The user wanted some code to report the rows that were not in these two statuses, which leads to one of two possible constructs:
select message_id
from my_table
where status in ('STATUS_1', 'STATUS_2', 'STATUS_3', 'STATUS_4', 'STATUS_5')
;
select message_id
from my_table
where status not in ('STATUS_6', 'STATUS_7')
;
Thanks to an index on status the first query returned its results in 0.05 seconds while the second query had to do a full table scan which took 6 seconds to complete. The problem the user faced was that there might be other status values appearing in the future, so the slow query was the safe one because it wouldn’t need to be edited in the future. So how do you get that “not in” query to run as quickly as the “in” query?
The first thing to check, of course, is how much freedom you have to modify the database and the code; it’s also important to check that any suggestion you make will produce the same result set. (It’s also a reasonably safe bet (on any forum) that the user has simplified the requirement in a way that makes any suggested solution in need of refinement to fit the real problem.)
In this case the answer is (seems to be) simple – change the question by taking advantage of the right Oracle feature and you can produce a solution that may even eliminate some of the current application overhead. “Add” a column to the table that holds a value only for rows that are in a status of interest then index that column – and the best way to “add” that column in all supported (and many unsupported) versions of Oracle is probably to create it as a virtual column:
alter table my_table
add rare_status /* invisible */
generated always as
(case when status in ('STATUS_6','STATUS_7') then null else status end)
virtual
/
create index mt_i1 on my_table(rare_status);
execute dbms_stats.gather_table_stats(user,'my_table', method_opt=>'for columns rare_status size 1')
select rare_status, message_id from my_table where rare_status is not null;
You’ll notice, of course that I’ve included, as a comment, the invisible option when adding the virtual column. This is generally a good idea, though it may be a little confusing for people to see that they can query a table using a column that doesn’t (apparently) exist. The plus point for this approach is that (naughty) code that executes an “insert as select” statement without specifying the list of target columns will still succeed rather than raising error: ORA-00947: not enough values “for no apparent reason!”.
The fact that I’ve used the predicate “is not null” might prompt you to ask if there are any traps in this approach if status is allowed to be null, and that’s something to check if you extend the strategy to handle a more complicated requirement. In this case the “not in (6,7)” and “in (1,2,3,4,5)” return the same results even if there are rows where the status is null.
Here’s the execution plan (ignore the numbers since there’s no data in the table) for the “is not null” query:
Plan hash value: 3640753122
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 38 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE | 2 | 38 | 2 (0)| 00:00:01 |
|* 2 | INDEX FULL SCAN | MT_I1 | 1 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("RARE_STATUS" IS NOT NULL)
If we return to the user’s original query – it selects message_id and their plan shows the need to visit the table to collect that value. Since the index we’ve now created on the “interesting” rows is so small we might as well extend it to include message_id – remembering to use the same type of case expression so that we include values only for the necessary small number of rows in the index.
alter table my_table
add rare_message_id invisible
generated always as
(case when status in ('STATUS_6','STATUS_7') then null else message_id end)
virtual
/
drop index mt_i1;
create index mt_i1 on my_table(rare_status, rare_message_id);
execute dbms_stats.gather_table_stats(user,'my_table', method_opt=>'for columns rare_message_id size 1')
select rare_status, rare_message_id from my_table where rare_status is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 3308683209
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 18 | 1 (0)| 00:00:01 |
|* 1 | INDEX FULL SCAN | MT_I1 | 2 | 18 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("RARE_STATUS" IS NOT NULL)
I’ve mentioned it before, and I keep wishing for it to appear in Oracle, but I’d really like to see the “filtered index” mechanism that can be used in SQL Server to handle this type of requirement without writing editing the code; something like:
create index mt_ss1 on my_table (status)
carrying (message_id)
where status not in ('STATUS_6','STATUS_7')
/
I’ve invented a syntax to say that this index should also include the message_id – again a feature (though I don’t remember the syntax) in SQL Server. Of course Oracle can carry “spare” columns, but only by adding them as if they belong in the index – which means their presence affects the selectivity and clustering_factor of the index and can, therefore, have serious side-effects on execution plans.
NOT IN bug
Some years ago I wrote a note describing how with a query using identical literal values in the predicate, the optimizer was able to produce exactly the same plan body with exactly the same Predicate Information section while producing different cardinality estimates depending on how you expressed the requirement for “column not in list of values”.
I won’t go into details of the anomaly since it disappeared in 11.2.0.4, but what I hadn’t paid any attention to was that the arithmetic (for both variations) was wrong, and it was only a recent question on the Oracle database forum that made me revisit the article, allowing me to spot and test 23.3 to see if the arithmetic was still wrong in 23.3. It is.
To demonstrate the problem, here’s a modified version of the code from the previous article. I usually try to write code that will create reproducible data across all versions of Oracle, but I didn’t do that in the previous post, and there was a significant difference between the data generated by 11.2.0.3 and the data generated by 23.3 so, among other details, I’ve changed the code to produce a fixed number of nulls.
rem
rem Script: not_in_anomaly_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2024
rem
rem Last tested
rem 23.3.0.0
rem 19.11.0.0
rem
create table t1
as
select
ao.*
from
all_objects ao
where
rownum <= 10000
/
update t1 set
object_type = null
where
rownum <= 1000
-- object_type = 'TABLE'
/
commit;
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T1',
method_opt => 'for all columns size 1'
);
end;
/
break on report
compute sum of ct on report
select
object_type, count(*) ct
from
t1
group by
object_type
order by
count(*)
/
select
sample_size, num_nulls, num_distinct, histogram
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'OBJECT_TYPE'
;
This code (in my 23.3 database) produces a table of 10,000 rows, of which 1,000 have object_type set to NULL, with 13 distinct (non-null) values across the rest of the rows. There was a huge skew in the distribution of the data, but I’ve avoided that complication by collecting stats without histograms. Here are the results from the two queries against the data dictionary to confirm those details.
OBJECT_TYPE CT
----------------------- ----------
CONTEXT 2
PROCEDURE 2
DIRECTORY 2
OPERATOR 3
SEQUENCE 3
FUNCTION 14
PACKAGE 22
DOMAIN 22
INDEX 48
TABLE 71
TYPE 805
1000 <<--- NULL
SYNONYM 4002
VIEW 4004
----------
sum 10000
14 rows selected.
SAMPLE_SIZE NUM_NULLS NUM_DISTINCT HISTOGRAM
----------- ---------- ------------ ---------------
9000 1000 13 NONE
After generating this data I checked the execution plans – in particular the predicate information and cardinality estimates – of several different queries involving variations of “not in” predicates; but I just want to show one query, with its execution plan from autotrace, that demonstrates the error very clearly:
select
count(*)
from t1
where object_type not in ('INDEX', 'SYNONYM', 'VIEW')
union all
select
count(*)
from t1
where object_type in ('INDEX', 'SYNONYM', 'VIEW')
;
Execution Plan
----------------------------------------------------------
Plan hash value: 575959041
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 12 | 52 (4)| 00:00:01 |
| 1 | UNION-ALL | | 2 | 12 | 52 (4)| 00:00:01 |
| 2 | SORT AGGREGATE | | 1 | 6 | | |
|* 3 | TABLE ACCESS FULL| T1 | 7079 | 42474 | 26 (4)| 00:00:01 |
| 4 | SORT AGGREGATE | | 1 | 6 | | |
|* 5 | TABLE ACCESS FULL| T1 | 2077 | 12462 | 26 (4)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("OBJECT_TYPE"<>'INDEX' AND "OBJECT_TYPE"<>'SYNONYM' AND
"OBJECT_TYPE"<>'VIEW')
5 - filter("OBJECT_TYPE"='INDEX' OR "OBJECT_TYPE"='SYNONYM' OR
"OBJECT_TYPE"='VIEW')
As you can see, the optimizer has rewritten the “not in()” as a conjunct of inequalities and the “in” as a disjunct of equalities. If I had written the query in exactly that form I would still have got the same results – provided I didn’t go all the way back to 11.2.0.3 in my testing.
The critical point is this: check the Rows column for operations 3 and 5 (the two tablescans), and add the two values together. The result is 9,156. Between them the two branches of the query cover all the rows where col1 is not null – and that’s 9,000 rows. Whatever the optimizer is doing it ought (for such a simple query) to match the total row count.
The “in” branch gives the right estimate: for my 9,000 rows with a non-null col1 the number of distinct values is 13 so selectivity of “column = constant” is 1/13; adding the values for the three selected values that gives us a selectivity of 3/13 and a cardinality of 2076.9 (q.e.d.)
The “not in” branch ought, therefore to give us a selectivity of 10/13 (i.e. 1 – 3/13) for a cardinality of 6,923(.08). Oracle’s estimate of 7,079 is out by 156, which is only a little over 2% in this case, so maybe the error is hardly ever noticeable to most people.
Where does the error come from. There are a couple of different ways of thinking about how the numbers can be produced. Here’s one viewpoint which shifts us from addition/subtraction to multiplication/division (which is a little more like the way the optimizer runs the arithmetic, I think).
To eliminate the first value we scale the rows down by one thirteenth; to eliminate the second value we scale that result by one twelfth (there are only 12 distinct values left), to eliminate the final value we scale the result by a further one eleventh – so the total is: 9000 * 12/13 * 11/12 * 10/11 = 6923.07 (q.e.d.)
BUT it looks as if Oracle is not allowing for the reduction in the number of distinct values as it eliminates each value in the list – so the optimizer’s arithmetic becomes: 9000 * 12/13 * 12/13 * 12/13 = 7078.7 (q.e.d).
ConclusionThere is a flaw in the way that the optimizer calculates the selectivity of “not in {list of values}” – the error may be fairly small in most cases but due to the nature of the error (and with the assistance of histograms and/or large numbers of rows with NULLs in the relevant column) there may be cases where a bad enough cardinality estimate appears and make the optimizer choose a poor execution plan. The nature of the error is such that the optimizer will over-estimate the cardinality.
Index puzzle
What could make a “create index” statement fail with an error about a table having too many columns (ORA-01792)? And why would this suddenly happen when the code that raised the error had been dropping and recreating the index every night for the best part of 3 years with no problems?
This is a question that came up on the Oracle developer forum recently, and was finally answered by the original poster after getting an important clue from another participant in the thread.
The clue (supplied by Solomon Yakobson) was that when you create a “function-based” index Oracle will create a hidden virtual column on the table but if the table has already reached the limit on the number of columns (1,000 until very recently) the attempt to create the column will fail with error ORA-01792, leading to the create index statement failing.
But this drop/create has been working perfectly for years – and a query for the table definition reported only 40 columns – so how did an extra 960 columns sneak in to break things? It’s a bug that can appear when you mix virtual columns with function-based indexes – Oracle bypasses a section of the code path that is supposed to clean up after the “drop index”. Here’s a very simple example to demonstrate.
rem
rem Script: fbi_puzzle_4.sql
rem Author: Jonathan Lewis / Marco Antonio Romito
rem Dated: Sep 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
drop table t1;
create table t1(
n1 number(6,0),
n2 number(6,0) as (n1 + 1) virtual,
n3 number(6,0)
)
/
prompt =============================
prompt An index that causes problems
prompt =============================
create index t1_i1 on t1(n1, nvl(n3,0), n2);
column column_name format a32
column data_default format a32
select
column_id, column_name,
hidden_column, virtual_column, user_generated, internal_column_id,
data_default
from
user_tab_cols
where
table_name = 'T1'
order by
internal_column_id
/
I’ve created a table of three columns, where the second column is a virtual column. Then I’ve created an index on the table which includes references to all three columns, but one of those references includes a call to the nvl() function – which means Oracle will create a fourth (hidden, virtual) column for that column expression. Here are the results of the subsequent query against user_tab_cols:
COLUMN_ID COLUMN_NAME HID VIR USE INTERNAL_COLUMN_ID DATA_DEFAULT
---------- -------------------------------- --- --- --- ------------------ --------------------------------
1 N1 NO NO YES 1
2 N2 NO YES YES 2 "N1"+1
3 N3 NO NO YES 3
SYS_NC00004$ YES YES NO 4 NVL("N3",0)
4 rows selected.
As you can see n2 is flagged as virtual (but not hidden) generated as “N1” + 1; and there’s a fourth, system-generated, hidden, virtual column generated as NVL(“N3”,0).
As ever, things are more likely to break when you mix features. Here are a few more lines of code to run after creating the index and reporting the columns:
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
select
column_id, column_name,
hidden_column, virtual_column, user_generated, internal_column_id,
data_default
from
user_tab_cols
where
table_name = 'T1'
order by
internal_column_id
/
You might expect Oracle to drop the system-generated column as it drops the index – but here are the results we now get from querying user_tab_cols:
COLUMN_ID COLUMN_NAME HID VIR USE INTERNAL_COLUMN_ID DATA_DEFAULT
---------- -------------------------------- --- --- --- ------------------ --------------------------------
1 N1 NO NO YES 1
2 N2 NO YES YES 2 "N1"+1
3 N3 NO NO YES 3
SYS_NC00004$ YES YES NO 4 NVL("N3",0)
SYS_NC00005$ YES YES NO 5 NVL("N3",0)
SYS_NC00006$ YES YES NO 6 NVL("N3",0)
SYS_NC00007$ YES YES NO 7 NVL("N3",0)
SYS_NC00008$ YES YES NO 8 NVL("N3",0)
SYS_NC00009$ YES YES NO 9 NVL("N3",0)
9 rows selected.
Oracle “forgets” to drop the system-generated virtual column, and generates a new virtual column every time the index is recreated. Since the columns are hidden columns you won’t notice that something has gone wrong if you query the (more commonly used) view user_tab_columns – so there may well be a number of sites which have tables with huge numbers of hidden virtual columns, all quietly working their way towards unexpected ORA-01792 errors.
This “failure to drop” isn’t consistent behaviour – if the index had been defined as (n1, n2, nvl(n3,0)) then the system-generated column would have been dropped every time the index was dropped. I haven’t tried to work out exactly what conditions have to be met for the error to appear but I think it may be something to do with a user-defined virtual column appearing in an index after an expression that has to be supported by a system-generated virtual column.
Bear in mind that if I create two indexes (n1, nvl(n3,0)) and (n2, nvl(n3,0)) then Oracle will generate just one virtual column to support the expression nvl(n3,0), so when I drop one of these indexes Oracle will have to say: “does dropping this index mean I should drop a virtual column?” and I’m guessing that somewhere in that area of code Oracle is bypassing some of the checks it should be doing.
SummaryIf you have any indexes that contain both virtual columns and column expressions, and if you drop and recreate those indexes from time to time, check user_tab_cols to make sure that you don’t have multiple identical system-generated, hidden, virtual columns that support the expression(s) in those indexes.
WorkaroundIn my case I found the following steps were sufficient to clear up the excess occurrences:
- drop the problem index (n1, nvl(n3,0), n2)
- create an index on just the problem expression (nvl(n3,0))
- drop that index (which got rid of all the copies of the generated virtual columns)
- recreate the problem index.
In more complicated cases (e.g. several indexes following this pattern, multiple indexes that include the same expression etc.) you may have to develop a more sophisticated approach.



