Jonathan Lewis
Coalesce SQ bug
The “coalesce subquery” transformation introduced in the 11gR2 timeline can produce wrong results and you will find some bug reports and patches on MOS spread over versions from 12cR1 onwards. If you find that you are hitting a bug for which there appears to be no patch there are several options for disabling the feature, with different levels of granularity.
- Set hidden parameter _optimizer_coalesce_subqueries to false to disable all suqbuery coalescing
- Set fix_control to ‘10216738:0’ disable a particular ANY/ALL class of subquery coalescing
- Use the no_coalesce_sq(@query_block) hint to stop specific subqueries from being coalesced
- Set the optimizer_features_enable to 11.2.0.4 (not advised) to stop the optimizer from trying so hard.
Here’s an example of a problem with subquery coalescing that’s resurfaced in various forms, with patches that had varying degrees of success, since Oracle 12c. This example has been tested up to Oracle 19.11 and 23.7. It appeared as a question on the Oracle customer forum a couple of weeks ago where the OP had modelled their production problem on a copy of the sample scott.emp table (see footnote 1 for a script to recreate the table.).
rem
rem Script: coalesce_sq_bug.sql
rem Author: Adrián Gómez Brandón / jpl
rem Dated: 13th March 2025
rem
rem Last tested
rem 23.7.0.0
rem 19.11.0.0
rem
select
/*+
-- optimizer_features_enable('11.2.0.4')
-- opt_param('_optimizer_coalesce_subqueries','false')
-- opt_param('_fix_control','10216738:0')
-- no_coalesce_sq(@em)
-- no_coalesce_sq(@me)
*/
e1.empno,
e2.empno
from emp e1,
emp e2
where
e1.empno != e2.empno
and (e1.empno, e2.empno) not in (select /*+ qb_name(em) */ e3.empno, e3.mgr from emp e3 where e3.mgr is not null)
--
-- predicate for correct result (156 rows)
--
and (e2.empno, e1.empno) not in (select /*+ qb_name(me) */ e4.empno, e4.mgr from emp e4 where e4.mgr is not null)
--
-- predicate for wrong results (169 rows)
--
-- and (e1.empno, e2.empno) not in (select /*+ qb_name(me) */ e4.mgr, e4.empno from emp e4 where e4.mgr is not null)
--
/
There are several lines commented out in the query – mostly comments or hints – but the highlighted lines (28 and 33) are the lines of interest. One of them is currently a comment, the other is not. If you examine the two lines you will see that they say the same thing in two different ways. Simplifying, one says (a, b) not in (x,y) while the other says (b,a) not in (y,x). Unfortunately, though, one of them produces the wrong result.
Here are the execution plans – each preceded by the predicate that produced the plan:
Right result
(e2.empno, e1.empno) not in (select /*+ qb_name(me) */ e4.empno, e4.mgr from emp e4 where e4.mgr is not null)
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 182 | 14196 | 13 (0)| 00:00:01 |
|* 1 | HASH JOIN RIGHT ANTI | | 182 | 14196 | 13 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL | EMP | 13 | 338 | 3 (0)| 00:00:01 |
|* 3 | HASH JOIN ANTI | | 182 | 9464 | 10 (0)| 00:00:01 |
| 4 | NESTED LOOPS | | 182 | 4732 | 7 (0)| 00:00:01 |
| 5 | INDEX FULL SCAN | EMP_PK | 14 | 182 | 1 (0)| 00:00:01 |
|* 6 | INDEX FAST FULL SCAN| EMP_PK | 13 | 169 | 0 (0)| 00:00:01 |
|* 7 | TABLE ACCESS FULL | EMP | 13 | 338 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("E1"."EMPNO"="E3"."EMPNO" AND "E2"."EMPNO"="E3"."MGR")
2 - filter("E3"."MGR" IS NOT NULL)
3 - access("E2"."EMPNO"="E4"."EMPNO" AND "E1"."EMPNO"="E4"."MGR")
6 - filter("E1"."EMPNO"<>"E2"."EMPNO")
7 - filter("E4"."MGR" IS NOT NULL)
Wrong result
(e1.empno, e2.empno) not in (select /*+ qb_name(me) */ e4.mgr, e4.empno from emp e4 where e4.mgr is not null)
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 182 | 9464 | 10 (0)| 00:00:01 |
|* 1 | HASH JOIN ANTI | | 182 | 9464 | 10 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 182 | 4732 | 7 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | EMP_PK | 14 | 182 | 1 (0)| 00:00:01 |
|* 4 | INDEX FAST FULL SCAN| EMP_PK | 13 | 169 | 0 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | EMP | 13 | 338 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("E1"."EMPNO"="E3"."EMPNO" AND "E2"."EMPNO"="E3"."MGR")
4 - filter("E1"."EMPNO"<>"E2"."EMPNO")
5 - filter("E3"."MGR" IS NOT NULL)
The first (correct) plan shows us two hash anti-joins (i.e. “not in first SQ” and “not in second SQ”). The second plan shows us only one hash anti-join, and the predicate information tells us it’s referencing the occurence of emp with the alias e3. It looks as if the optimizer has “lost” the second subquery – but how could that happen?
The biggest clue is in the Outline Data section (which I didn’t show). The second query includes the hints: COALESCE_SQ(@”EM”) COALESCE_SQ(@”ME”). The optimizer thinks the two subqueries can be collapsed into a single subquery that will still get the correct answer. The optimizer was wrong and the effect of the logic it has used in the transformation has (effectively) left us with just the first of the two subqueries – and too many rows in the result set.
WorkaroundsAfter seeing the coalesce_sq() hints in the Outline Data my first thought was simply to add the two no_coalesce_sq() hints you see in the SQL I shown above. This gave me the right answer, with the same plan as the working alternative.
It then occurred to me to check whether I actually needed both no_coalesce_sq() hints – and I didn’t, either one would do just as well on its own. This isn’t really surprising, the hint (positive and negative versions) operates at the Query Block level and names a query block that should/should not be coalesced; since I have just two subqueries that could be considered for coalescing, if I tell the optimizer that one of them must not be coalesced then there is nothing left that the optimizer could use to coalesce the other subquery with. I would have had to be more fussy if there had been 3 or more subqueries for the optimizer to choose from.
Another option, of course, is to disable the subquery coalescing feature completely – which could be done at the system level with a startup parameter or through an alter system call, or it could be done at the session level with an alter session call, or, as I’ve shown here, it’s a parameter that can be set for the query with the opt_param(‘_optimizer_coalesce_subqueries’,’false’) hint.
I spent a little time searching MOS for any appearance of this problem, and did find a handful of bug reports about similar problems, and patch notes about fixes, and patches that superceded patches because of errors that still appeared when the first patch was in place. So I passed the test case on to Oracle and it’s now in the system as an (unpublished) bug, but I got a very rapid reply that a workaround mentioned in some of the other MOS bug notes was appropriate for this example – a less aggressive blocking action, disabling only part of the subquery coalescing code: set fix control 10216738 to zero (hence the second opt_param() hint above).
If you check the v$system_fix_control view you’ll see that 10216738 has the description: “Toggels [sic] subquery coalescing for ANY and ALL subqueries”. Remembering that “not in ()” is equivalent to “!= ALL()” we can recognize that this is the area of code where we’re having a problem, so this is probably the minimal blocking that we can do to avoid the problem.
There was one other hack that I’d tried before getting the information about the fix control – I had re-run the query several times taking the optimizer_features_enable backwards through terminal versions of Oracle until it stopped producing the wrong results – hence the optimizer_features_enable(‘11.2.0.4’) hint, which you probably won’t want to use in a production system. One of the side effects of going back to the 11.2.0.4 optimizer features was that fix control 10216738 dropped to zero, so the fix control is clearly the best option until Oracle comes up with a working patch. My preference would be to do this through the fix_control startup parameter.
Footnote 1 (“scott.emp”) Click here to expand / collapse the scriptrem Script: build_emp.sql rem Dated: Mar 2004 rem drop table emp; create table emp ( empno number(4) not null, ename varchar2(10), job varchar2(9), mgr number(4), hiredate date, sal number(7, 2), comm number(7, 2), deptno number(2) ); insert into emp values (7369, 'SMITH', 'CLERK', 7902, to_date('17-dec-1980', 'dd-mon-yyyy'), 800, null, 20); insert into emp values (7499, 'ALLEN', 'SALESMAN', 7698, to_date('20-feb-1981', 'dd-mon-yyyy'), 1600, 300, 30); insert into emp values (7521, 'WARD', 'SALESMAN', 7698, to_date('22-feb-1981', 'dd-mon-yyyy'), 1250, 500, 30); insert into emp values (7566, 'JONES', 'MANAGER', 7839, to_date('2-apr-1981', 'dd-mon-yyyy'), 2975, null, 20); insert into emp values (7654, 'MARTIN', 'SALESMAN', 7698, to_date('28-sep-1981', 'dd-mon-yyyy'), 1250, 1400, 30); insert into emp values (7698, 'BLAKE', 'MANAGER', 7839, to_date('1-may-1981', 'dd-mon-yyyy'), 2850, null, 30); insert into emp values (7782, 'CLARK', 'MANAGER', 7839, to_date('9-jun-1981', 'dd-mon-yyyy'), 2450, null, 10); insert into emp values (7788, 'SCOTT', 'ANALYST', 7566, to_date('09-dec-1982', 'dd-mon-yyyy'), 3000, null, 20); insert into emp values (7839, 'KING', 'PRESIDENT', NULL, to_date('17-nov-1981', 'dd-mon-yyyy'), 5000, null, 10); insert into emp values (7844, 'TURNER', 'SALESMAN', 7698, to_date('8-sep-1981', 'dd-mon-yyyy'), 1500, 0, 30); insert into emp values (7876, 'ADAMS', 'CLERK', 7788, to_date('12-jan-1983', 'dd-mon-yyyy'), 1100, null, 20); insert into emp values (7900, 'JAMES', 'CLERK', 7698, to_date('3-dec-1981', 'dd-mon-yyyy'), 950, null, 30); insert into emp values (7902, 'FORD', 'ANALYST', 7566, to_date('3-dec-1981', 'dd-mon-yyyy'), 3000, null, 20); insert into emp values (7934, 'MILLER', 'CLERK', 7782, to_date('23-jan-1982', 'dd-mon-yyyy'), 1300, null, 10); -- -- Move emp to pack the data into the first few blocks -- (and gather statistics) -- alter table emp move online; alter table emp add constraint emp_pk primary key(empno);
ORA-01652
This is just a reminder and a quick note about the Oracle error:
ORA-1652: unable to extend temp segment by 128 in tablespace {not the TEMP tablespace}
Since the named tablespace is not a temporary tablespace, the error has occured as a data sesgment was being created, rebuilt, moved, or was trying to extend – and the error message doesn’t tell you which segment. But (as with many other error messages) things have changed with the most recent version(s) of Oracle. Here’s the same error message from 23.6 when I forced it to happen with a simple “create table as select”:
ORA-01652: unable to grow table TEST_USER.T2 in tablespace TEST_8K_ASSM by 1MB
during operation with SQL ID : gxt04puvaugw5,
temp space used by session : 0 (MB)
Help: https://docs.oracle.com/error-help/db/ora-01652/
The alert log has an additional line:
ORA-01652 addlInfo tsn:7 objn:76831 objd:76831 user:TEST_USER obj:T2 subobj: type:2
So the reporting is much better – and you even get the sql_id of the statement that caused the error – which you might be able to find it in the library cache.
If you’re not on the latest version, though, or don’t get to see the error message in the alert log until hours later, or if there’s nothing in the library cache when you go searching, how do you discover the SQL that caused the error (or even identify the object if you’re on an older version of Oracle).
Unless you are subject to very frequent recurrence of the error you could just dump an errorstack whenever the error occurred:
alter system set events '1652 trace name errorstack level 1';
Now, when the error occurs the guilty session will dump a trace file – which will be reported with the ORA-01652 error message in the alert log. The file will probably be several thousand lines long – it depends on what you’ve been doing before the error occurs – but in the first 50 lines you should find something like this:
----- Current SQL Statement for this session (sql_id=gxt04puvaugw5) -----
create table t2 as select t1.* from t1, (select rownum from dual connect by level <= 10)
The important bit to search for is “Current SQL Statement” or, if you’re on 23c, the SQL_ID that was reported in the original error message.
Don’t forget to disable the trace once you’ve identified the source of the problem.
alter system set events '1652 trace name errorstack off';
Oracle Evolution
I responded to a problem on one of the Oracle forums recently about getting rid of some temporary segments that wouldn’t disappear after a call to “alter table move …” command had failed, and in searching for a note I’d written about how smon handled this situation I came across an article I wrote in the days of version 7.3.3.5 describing its actions so, in an idle moment, decided to check if the note was still correct in the days of 19c.
It wasn’t – so here’s a tiny (and probably pointless) comment about an interesting little change that probably caters for ever increasing sizes of database and volume of activity.
In Oracle 7 (and probably some later versions) smon would wake up every 300 seconds (5 minutes) to check if there were any free extents in the fet$ table (remember – we’re back in the days of freelist management and ASSM didn’t exist) and if it found any adjacent free extents it would coalesce a few of them, i.e. replace a few rows in fet$ describing small extents with a single row describing a larger extent. This hasn’t changed, although the newer versions of Oracle select a couple more columns from fet$ than they used to when doing this check.
Every 25th wakeup (which means every 2 hours and 5 minutes) smon would also look for “lost” segments, i.e. the ones listed in seg$ that were marked as temporary (type# = 3), and do the work needed to clean them out – which entailed deleting the row from seg$, deleting any related rows from fet$ and, possibly, doing a couple of other deletes and updates (e.g. tablespace quotas: tsq$). This has changed.
I enabled tracing on smon in an instance of Oracle 19.11 – and then enabled system-wide tracing on the query of seg$ that was appearing in v$sql but wasn’t appearing in the smon trace. It turned out that the query was now being executed every 2 hours (7,200 seconds) by one of the worker processes (Wnnn) controlled by the space management coordinator (SMCO).
I referred to these segments as “lost” rather than temporary – this was just to emphasise the point that I wasn’t talking about segments in the TEMPORARY tablespaces, but segments in a PERMANENT tablespace that had been created as (for example) the space needed during a call to “create index”. Occasionally, as with the forum question, large scale tasks like this fail unexpectedly and don’t always clean up their own messes – and that’s why there’s a process doing a little housekeeping.
You’ll note, of course, that I haven’t said anything about what happens after the check for these type 3 temporary segments has found a target. Having found that it was the Wnnn processes that searched for the segments I enabled sql_trace system-wide for all the Wnnn processes then started a session to execute a “create massive table” statement, but used a kill -9 on its shadow process from the operating system. This left me with a suitable temporary segment in my schema that disappeared a couple of minutes later, leaving the evidence I needed in the trace file from process W007 in the form of these three (cosmetically edited) consecutive statements:
select file#, block#, ts#, blocks from seg$ where type# = 3
update seg$ set
type#=:4,blocks=:5,extents=:6,minexts=:7,maxexts=:8,extsize=:9,extpct=:10,user#=:11,iniexts=:12,
lists=decode(:13, 65535, NULL, :13),groups=decode(:14, 65535, NULL, :14), cachehint=:15,
hwmincr=:16, spare1=DECODE(:17,0,NULL,:17),scanhint=:18, bitmapranges=:19
where
ts#=:1 and file#=:2 and block#=:3
delete from seg$ where ts#=:1 and file#=:2 and block#=:3
The entire “search and destroy” sequence has migrated to the Wnnn process.
Virtual Nuisance
Here’s a note that’s been sitting as a draft for the last 7 years – finally dusted down, retested, and published. Following my new publishing paradigm, I’m going to tell you the conclusion at the start, then show the working that demonstrates the point.
SummaryWhen cloning a table, or subset of a table, the simplest method in the “good old days” was to execute a call to: “create table cloned_table as select * from source_table where …” (CTAS) then create any indexes that you needed on the clone. Unfortunately if you’ve created any (visible) virtual columns the CTAS will generate real columns holding the calculated values.
There are various workarounds to this, of course, though not all viable for all systems.
- The “obvious and sensible” option is (used to be) to identify just the real columns in the select list. But then you’d have to remember to check and add all the virtual bits afterwards (including the invisible bits).
- A “lazy” option that I’ve seen suggested a few times is to drop the virtual columns, clone the table, then recreate the virtual columns – a slightly better bet might be to make them invisible before cloning then make them visible (on the source) afterwards. Either way, though, this isn’t a viable strategy for a production/shared developer table and there are cases where it simply can’t be used and, again, you have to recreate the the virtual columns on the clone.
A fairly nice strategy appeared in the 12.2 feature “create table XX for exchange …” which doesn’t have to be aimed at a partitioned table. This will create the virtual, including invisible, columns. You still have to insert the data, however, which means you still have to create a select list that covers all the “real” columns in the right order; and you still have to recreate the indexes and some constraints.
Discussion and demonstrationQuoting from the SQL Language Reference manual 23c for the Create Table statement, you will find under the subtitle “for exchange with table” the statement:
This clause lets you create a table that matches the structure of an existing partitioned table.
This is true – but not the whole truth. It will do what it says, but it can also be used to create a table that matches the structure of an existing non-partitioned table. This means that when you want to clone a table (or subset of a table) that includes virtual columns and hidden columns, it’s a little easier to get it right. You still have to get the data into the table, though and recreate indexes and some constraints.
The following script shows two attempts at cloning – first using a simple “create as select * ..”, the second using a “create for exchange” followed by steps to populate the copy – first the wrong way then the right way.
We start by creating a source table (a subset of the view all_objects) then add a couple of indexes, a primary key (which also means not null) constraint, and a couple of virtual columns (one invisible).
rem
rem Script: virtual_nuisance.sql
rem Author: Jonathan Lewis
rem Dated: Feb 2018
rem Purpose:
rem
rem Last tested
rem 23.4.0.0 (23ai)
rem 19.11.0.0
rem 12.2.0.1
rem 11.2.0.4
prompt ====================
prompt Basic Table Creation
prompt ====================
create table source_table nologging
as
select *
from sys.all_objects
where rownum <= 10000
;
-- alter table source_table modify object_id not null;
alter table source_table add constraint st_pk primary key(object_id);
create unique index st_u1 on source_table(object_type, object_id);
create index st_i1 on source_table(object_name);
prompt ===================
prompt Add virtual columns
prompt ===================
alter table source_table
add (
new_name_visible
generated always as ( initcap(object_name) ) virtual,
new_name_invisible invisible
generated always as ( lower(object_name) ) virtual
)
;
Now we create two copies of the table, one with a simple “create as select *”, the other as a “create for exchange” which means we then have to populate it, which we try in two ways – first using “insert select *” which fails, leading us to insert a list of named columns:
prompt ======================================================
prompt Clone table - which instantiates the visible virtual column
prompt ======================================================
create table clone_table as select * from source_table;
prompt ====================================
prompt Create for exchange (12.2+)
prompt ====================================
create table exchange_table for exchange with table source_table;
insert into exchange_table select * from source_table;
-- *
-- ERROR at line 1:
-- ORA-54013: INSERT operation disallowed on virtual columns
-- Help: https://docs.oracle.com/error-help/db/ora-54013/
--
insert into exchange_table (
owner, object_name, subobject_name, object_id, data_object_id, object_type,
created, last_ddl_time, timestamp, status, temporary, generated,
secondary, namespace, edition_name, sharing, editionable, oracle_maintained,
application, default_collation, duplicated, sharded, created_appid,
created_vsnid, modified_appid, modified_vsnid
)
select
owner, object_name, subobject_name, object_id, data_object_id, object_type,
created, last_ddl_time, timestamp, status, temporary, generated,
secondary, namespace, edition_name, sharing, editionable, oracle_maintained,
application, default_collation, duplicated, sharded, created_appid,
created_vsnid, modified_appid, modified_vsnid
from
source_table
/
-- 10000 rows created.
As a check we can select a few rows in the exchange table to see that the virtual columns exist and hold the expected data. (Remember that the invisible columns can be selected by name, even though they are invisible in a describe or a “select *”.)
column object_name format a24
column new_name_visible format a24
column new_name_invisible format a24
select
object_name, object_type, new_name_visible, new_name_invisible
from
exchange_table
where
rownum <= 5
/
OBJECT_NAME OBJECT_TYPE NEW_NAME_VISIBLE NEW_NAME_INVISIBLE
------------------------ ----------------------- ------------------------ ------------------------
ALL_OBJECT_TABLES VIEW All_Object_Tables all_object_tables
ALL_OBJECT_TABLES SYNONYM All_Object_Tables all_object_tables
ALL_ALL_TABLES VIEW All_All_Tables all_all_tables
ALL_ALL_TABLES SYNONYM All_All_Tables all_all_tables
DBA_TABLES VIEW Dba_Tables dba_tables
5 rows selected.
Now check which of the three interesting columns exist in the three tables, and what their state is:
break on table_name skip 1
select
table_name, column_name,
virtual_column "Virtual",
hidden_column "Hidden"
from user_tab_cols
where table_name in ( 'SOURCE_TABLE', 'CLONE_TABLE', 'EXCHANGE_TABLE')
and column_name in ('OBJECT_NAME', 'NEW_NAME_VISIBLE','NEW_NAME_INVISIBLE')
order by
table_name desc, column_name desc
/
TABLE_NAME COLUMN_NAME Vir Hid
------------------------- -------------------- --- ---
SOURCE_TABLE OBJECT_NAME NO NO
NEW_NAME_VISIBLE YES NO
NEW_NAME_INVISIBLE YES YES
EXCHANGE_TABLE OBJECT_NAME NO NO
NEW_NAME_VISIBLE YES NO
NEW_NAME_INVISIBLE YES YES
CLONE_TABLE OBJECT_NAME NO NO
NEW_NAME_VISIBLE NO NO
8 rows selected.
Critically, the clone (CTAS) table doesn’t have the invisible virtual column, and the visible virtual column is no longer virtual. Both the virtual columns have appeared in the exchange table, both virtual, and the invisible column is still invisible (hidden).
But what of the indexes (including the implicit primary key index):
select
table_name, index_name, uniqueness
from
user_indexes
where table_name in ( 'SOURCE_TABLE', 'CLONE_TABLE', 'EXCHANGE_TABLE')
order by
table_name desc, index_name
;
TABLE_NAME INDEX_NAME UNIQUENES
------------------------- -------------------- ---------
SOURCE_TABLE ST_I1 NONUNIQUE
ST_PK UNIQUE
ST_U1 UNIQUE
3 rows selected.
Neither of the copies has any indexes – which is not a surprise for the clone (CTAS) table but is a little disappointing for a table we have (nominally) created in anticipation of an exchange with partition. But that is exactly what the manual would have told us would happen – if we’d looked.
There is one other detail to be careful of when cloning tables – will columns that report as NOT NULL in the source table be declared NOT NULL in the clone/exchange table? The answer depends on why they show as NOT NULL.
In my example the object_id in the source table shows up as NOT NULL when described, but doesn’t in either of the copies because it is a side-effect of the primary key constraint and that constraint has not been recreated in the copying process.
ASSM states
When the dbms_space_usage package came out reporting the “percentage free” space in ASSM blocks with ranges like 0-25%, 25%-50%, etc. with flags FS1 to FS4, I had great trouble remembering whether 0-25% was FS1 or FS4 and it was a strangely long time before I managed to get the obvious association to stick in my mind.
I’ve just found a new way of getting confused again. I’ve been looking at some redo log entries relating to bitmap space management, trying to work out why, and when, Oracle changes the bits. Each “bit” is actually a nybble (half a byte, if you didn’t know) which allows for 8 distinct values, and when Oracle changes the state of a “bit” in an L1 bitmap the redo dump reports OP code 13.22 and lines like:
L1 bitmap change to state 3
So here (E & OE, and mostly for my own benefit) is a table of the flags reported by the dbms_space_usage package and the corresponding state.
- 7 = FS5 -> 100% empty
- 6 = not used ?
- 5 = FS4 -> 75% – 100%
- 4 = FS3 -> 50% – 75%
- 3 = FS2 -> 25% – 50%
- 2 = FS1 -> 0% – 25%
- 1 = FULL
- 0 = unformatted
Everything Changes
I saw a recent tweet (on Bluesky) from SQLDaily highlighting a blog note that Lukas Eder wrote in 2016 with the title: “Avoid using COUNT() in SQL when you could use EXISTS()”. This is a good guideline because it probably maximises the options the optimizer has for producing a good execution plan, but it’s not always necessary in recent versions of Oracle. It depends on what you’re trying to achieve and how you’re using count(), of course, but there are cases where the optimizer is capable of turning a count() subquery into an exists() subquery. Here’s a litle demo running on a PDB in 19.11.0.0:
rem
rem Script: count_to_exists.sql
rem Author: Jonathan Lewis
rem Dated: Dec 2024
rem Purpose:
rem
rem Last tested
rem 23.4(ai)
rem 19.11.0.0
rem 12.2.0.1
rem
create table t1 as
select *
from all_objects
where rownum <= 5e4
/
create table t2 as
select *
from t1
where mod(object_id,50) = 0
/
create index t2_i1 on t2(owner);
set linesize 120
set pagesize 60
set trimspool on
set tab off
column object_name format a32
column owner format a32
select owner, count(*)
from t1
group by
owner
order by
count(*)
/
select owner from t1
minus
select owner from t2
/
set feedback only
set serveroutput off
select
owner, object_id, object_name
from
t1
where
t1.owner like 'D%'
and 0 < (
select count(*)
from t2
where t2.owner = t1.owner
)
/
set feedback on
select * from table(dbms_xplan.display_cursor)
/
I’ve included a couple of queries to show the numbers of rows for each owner in t1 and to highlight the owner values that appear in t1 but don’t appear in t2. (I used the results of those queries to pick the ‘D%’ predicate in the final query.) Here’s the execution plan for the final query in 19.11.0.0:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID cwj9t6wzhczgh, child number 0
-------------------------------------
select owner, object_id, object_name from t1 where t1.owner like
'D%' and 0 < ( select count(*) from t2 where t2.owner = t1.owner
)
Plan hash value: 82129279
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 278 (100)| |
|* 1 | HASH JOIN RIGHT SEMI| | 3244 | 167K| 278 (1)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T2_I1 | 144 | 720 | 2 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL | T1 | 6025 | 282K| 276 (1)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("T2"."OWNER"="T1"."OWNER")
2 - access("T2"."OWNER" LIKE 'D%')
filter("T2"."OWNER" LIKE 'D%')
3 - filter("T1"."OWNER" LIKE 'D%')
There’s no sign of any aggregation in the plan – so the count(*) can’t be happening. Unfortunately the optimizer has transformed the subquery out of existence and executed the query as a semi-join. So let’s make it easier to see what’s happened to the count(*) by stopping the optimizer from unnesting the subquery, and let’s use explain plan so that we can see what the complete Predicate Information looks like when the plan operates with a subquery.
explain plan for
select
owner, object_id, object_name
from
t1
where
t1.owner like 'D%'
and 0 < (
select /*+ no_unnest */
count(*)
from t2
where t2.owner = t1.owner
)
/
select * from table(dbms_xplan.display)
/
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3988898734
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 463 | 22224 | 278 (1)| 00:00:01 |
|* 1 | FILTER | | | | | |
|* 2 | TABLE ACCESS FULL| T1 | 6025 | 282K| 276 (1)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T2_I1 | 2 | 10 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE
"T2"."OWNER"=:B1))
2 - filter("T1"."OWNER" LIKE 'D%')
3 - access("T2"."OWNER"=:B1)
As you can see, the optimizer has worked out that “select count(*) > 0” is equivalent to “exists select” and transformed the subquery accordingly; which means it will stop after finding the first matching row.
ConclusionI started by pointing out that a recent (Bluesky) tweet was echoing a blog note published in 2016. I also agreed that the general principle of using exists() to find the first occurrence rather than using count(*) was a good idea when you didn’t need to know “how many” and just needed to know “are there any”.
However I then demonstrated that the (Oracle) optimizer is so sophisticated that it can recognise when a count() subquery can be transformed into an exists() subquery -presumably because it gets a “clue” from the comparison with zero.
The point of this blog note, and the moral of the story is this: when trying to solve a problem by searching the internet, an article published 8 years ago may not give you the most up to date information or the best possible solution.
Consistent?
Here’s a cut-n-paste from a simple SQL*Plus session running under 19.11.0.0 (23.4 produces the same effects):
SQL> drop table t purge;
Table dropped.
SQL> create table t (id number, val number) rowdependencies;
Table created.
SQL> insert into t values(1,0);
1 row created.
SQL> commit;
Commit complete.
SQL> insert into t values(2,0);
1 row created.
SQL> update t set val = 1;
2 rows updated.
SQL> select * from t;
ID VAL
---------- ----------
1 1
2 1
2 rows selected.
Open a second terminal and start an SQL*Plus session that does the following:
SQL> update t set val = val + 1;
This statement is going to hang because the first (and only) row it ought to see at this moment in time is locked by session 1. If session 1 commits then session 2 should detect that it’s trying to update a row that has changed since the start of the transaction and rollback to restart so that it can do an update that is self-consistent as at a specific point in time.
Question – how many rows will session 2 update?
Here’s the cut-n-paste from session 2 that appeared as soon as session 1 commited:
1 row updated.
SQL> select * from t;
ID VAL
---------- ----------
1 2
2 1
2 rows selected.
This doesn’t look right. At one point in time the committed data in the table was the single row (1,0), and at another point in time the committed data in the table was the pair of rows (1,1) and (2,1). How can a single, self-consistent, update statement operate as at a single point in time and leave the table holding the rows you see above. Surely the statement should either fail leaving val = 1 in both rows, or it should succeed consistently leaving val = 2 in both rows.
As a further detail, note how I’ve enabled rowdependencies on the table. (This is something I usually do whenever I want to take advantage of the flashback version query feature though it’s not a necessity in this case). Here’s what I see when I commit in session 2 then query the entire versioning history of the table:
column versions_startscn format 99,999,999,999,999
column versions_endscn format 99,999,999,999,999
column versions_starttime format a24
column versions_endtime format a24
select
versions_starttime,
versions_startscn,
versions_endtime,
versions_endscn,
versions_operation,
versions_xid,
id, val
from t versions between scn minvalue and maxvalue
/
VERSIONS_STARTTIME VERSIONS_STARTSCN VERSIONS_ENDTIME VERSIONS_ENDSCN V VERSIONS_XID ID VAL
------------------------ ------------------- ------------------------ ------------------- - ---------------- ---------- ----------
09-DEC-24 02.38.22 PM 12,670,482,437,540 U 04001B004F600000 1 2
09-DEC-24 02.35.52 PM 12,670,482,437,468 I 06000000DA740000 2 1
09-DEC-24 02.35.52 PM 12,670,482,437,468 09-DEC-24 02.38.22 PM 12,670,482,437,540 U 06000000DA740000 1 1
09-DEC-24 02.35.52 PM 12,670,482,437,468 1 0
4 rows selected.
Surprisingly there is no “insert transaction” visible for (1,0), but, reading from the bottom upwards we can see that (1,0) ceased to exist at SCN 12,670,482,437,468 and that transaction 06000000DA740000 updated that row to (1,1) and inserted row (2,1). What we’re seeing as “versions” are the “versions as at commit”, of course, rather than the results of all the individual modifications (which is what you can get from Log Miner).
If we look at the (1,1) row again we can see that that verson ceased to exist at SCN 12,670,482,437,540 and the top line of the output tells us that that’s thanks to transaction 06000000DA740000 changing it to (1,2). That’s our update from the second session that we believe should have updated every existing row in the table by adding 1 to the val column.
Finally, when we look at the row for (2,1) we see that it came into existence at the same time (SCN 12,670,482,437,468) as row (1,1) but was not changed by the transaction that “happened” a little later and changed (1,1) to (1,2).
This behaviour first came to my attention in a comment to a note I wrote in 2011 about an upgrade from 11.1.0.6 to 11.2.0.2. The sample I’ve shown above is taken from that comment, modified slightly to separate the table creation from the first insert (just in case create as select introduced some peculiar special effects) and there are other examples and observations in the comments following the one I’ve linked to. The anomaly described in the actual blog note was addressed in 11.2.0.3 and 12.1.0.2 but, as you can see, there are other anomalies still waiting to be addressed.
Fetch size
Part of my effort to clear my long backlog of drafts
This is a trivial script I wrote many years ago – the date on the draft (one of the oldest on my blog) was August 2009 – but the script itself must have started life in 8i if not earlier
All the script does is execute a simple-minded test to find SQL that may be wasting resources and causing contention by using “small” array fetches to return “large” amounts of data. It does this through a scan of v$sql or (for later versions of Oracle) v$sqlstats or v$sqlstats_plan_hash to compare execution counts, number of fetch calls, and total rows returned.
The original code – probably written for 8i – used v$sql and reported the first_load_time for each child cursor but an update to the script took advantage of the introduction in 10g of the view v$sqlstats and started reporting the last_active_time – which means you could choose to ignore (temporarily, perhaps) any statements that had not been active in the very recent past.
I’ve tended to avoid the view v$sqlarea because, in its earliest form, it was simply an aggregate view of the x$ structure behind v$sql. When v$sqlstats appeared its contents were similar to those of v$sqlarea but its published implementation was a simple query against a different x$ so that’s the one I started using (having failed to notice that the implementation for v$sqlarea had changed to a simple query against yet another x$ at about the same time. Both x$ structures are created on the fly into the PGA so they’re probably doing similar amounts of aggregation under the covers although with many differences in the details they now report – including a number of “deltas since last AWR snapshot” in the newer view.
rem
rem Script: small_fetch.sql
rem Author: Jonathan Lewis
rem Dated: (pre) Aug 2009
rem
set linesize 156
set pagesize 40
set trimspool on
set tab off
break on row skip 1
column first_load_time format a20
column last_active_time format a20
column sql_text wrap word
spool small_fetch
select
--
-- v$sql.parsing_user_id,
-- v$sql.parsing_schema_id,
-- v$sql.first_load_time,
--
to_char(last_active_time,'yyyy-mm-dd/hh24:mi:ss') last_active_time,
sql_id,
plan_hash_value,
executions,
end_of_fetch_count,
fetches,
rows_processed,
round(rows_processed/nullif(executions,0),0) rows_per_exec,
round(rows_processed/nullif(fetches,0),0) fetchsize,
sql_text
from
v$sqlstats
-- v$sql
-- v$sqlstats_plan_hash
where
plan_hash_value != 0 -- pl/sql block
-- and (v$sql.parsing_user_id != 0 or v$sql.parsing_schema_id != 0)
and fetches > 25000
and round(rows_processed/nullif(executions,0),0) > 5000
and round(rows_processed/nullif(fetches,0),0) < 50
order by
fetches desc
;
spool off
The numbers I’ve used in the SQL above are fairly arbitrary, your choices for “large” and “small” willl almost certainly be different.
The original script is probably an ad hoc statement that I wrote in a few minutes on a client site to poke around the SQL being executed by a system that was showing network problems and surprisingly large numbers of waits for “SQL*Net message to client”, so I’ve published it more as an example of the type of thing you could do than an example of “best practice”. It may be helpful, but your specific circumstances and requirements may make you want to change the choice of view you use, the column you report, and the predicates you use to filter the results.
You’ll notice from the code that v$sql includes the parsing_user_id and parsing_schema_id, but v$sqlstats doesn’t. A current convenience of using v$sql is that you can eliminate any statements parsed by SYS simply be adding the predicate “parsing_user_id != 0”. The reason why this is convenient is that a lot of the sys-recursive (largely dictionary cache) queries do single row fetches and you don’t want to clutter your report with things you can’t change.
A reason why v$sqlstats might be your preferred choice is that it has a large number of “delta_statistic” column – which give you counts and times since the last AWR snapshot (though how that works if you’re not licenced to run the AWR, I haven’t checked).
CautionThe header on the script suggests that the last system I used, and edited, it on was 11.1.0.7 – so before container databases came into existence. There are all sorts of columns in the views I’ve mentioned that might be very helpful, but there’s also the potential for traps relating to which schema can see what data, and from which PDB. Here, for example, is a cut-n-paste showing something that looks like a bug when running as SYS in a PDB on an instance running 19.11.0.0: note particularly how I’ve changed the view in the three queries from v$sql to v$sqlstats then back to v$sql.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sql group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1282
1 row selected.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sqlstats group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1395
1 row selected.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sql group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1282
1 row selected.
What can v$sqlstats see that v$sql can’t see?
Optimizer Environment 2
Many years ago I wrote a short note to introduce the view v$sql_optimizer_env (and the session and system equivalents) which report the current values of optimizer related parameters (some of which are not found in the startup file) for a cursor child, or a session, or the system. I also mentioned the underlying x$kqlfsqce (for v$sql_optimizer_env) that held far more parameters than were reported in the v$ view.
Recently I commented on a question about different plans (i.e. child cursors) for the same query for different sessions and noted that changing the optimizer_features_enable might not change any of the visible (v$) parameters – other than optimizer_features_enable itself – so it might be necessary to check the x$ to find out what had changed. Here’s a possible query to do that, followed by a few lines of output:
rem
rem Script: optimizer_env_b.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem Purpose:
rem
-- select count(*) from all_users;
define m_sql_id = 'b2zqhgr5tzbpk'
spool optimizer_env_b.lst
select
kqlfsqce_pname,
kqlfsqce_chno,
kqlfsqce_pvalue,
kqlfsqce_flags
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
order by
kqlfsqce_pname,
kqlfsqce_chno
/
spool off
KQLFSQCE_PNAME KQLFSQCE_CHNO KQLFSQCE_PVALUE KQLFSQCE_FLAGS
-------------------------------------------------- ------------- ------------------------- --------------
BlockChain_ledger_infrastructure 0 0 10
BlockChain_ledger_infrastructure 1 0 10
CLI_internal_cursor 0 0 10
CLI_internal_cursor 1 0 10
PMO_altidx_rebuild 0 0 2
PMO_altidx_rebuild 1 0 2
_adaptive_window_consolidator_enabled 0 true 7
_adaptive_window_consolidator_enabled 1 true 7
_add_stale_mv_to_dependency_list 0 true 6
_add_stale_mv_to_dependency_list 1 true 6
_aggregation_optimization_settings 0 0 7
_aggregation_optimization_settings 1 0 7
_allow_level_without_connect_by 0 false 6
_allow_level_without_connect_by 1 false 6
...
uniaud_internal_cursor 0 0 10
uniaud_internal_cursor 1 0 10
workarea_size_policy 0 auto 2
workarea_size_policy 1 auto 2
1226 rows selected.
The script allows you to set an sql_id to report – the value I’ve used is for the statement select count(*) from all_users; The query as it stands has to be run by sys, but you could create a view (say x_$kqlfsqce) in the sys schema, grant select on that to public to allow everyone to query the view then create a public synonym for it that allowed everyone else to use the same query as sys:
create or replace view x_$kqlfsqce as select * from x$kqlfsqce;
grant select on x_$kqlfsqce to public;
create public synonym x$kqlfsqce for x_$kqlfsqce;
The drawback to this query is clear in the final lines of the output – in 19.11 it returns 613 rows per child cursor leaving you have to analyze (by eye?) the entire list to find the parameter names where the parameter values differ. A better query would report only those parameters where the values differed (ideally listing the values side by side, but that’s another story). For example:
rem
rem Script: optimizer_env_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem
-- select count(*) from all_users;
define m_sql_id = 'b2zqhgr5tzbpk'
spool optimizer_env_2.lst
with child_count as (
select
/*+ materialize */
count(distinct kqlfsqce_chno) child_count
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
)
select
kqlfsqce_chno,
kqlfsqce_pname,
kqlfsqce_pvalue,
kqlfsqce_flags
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
and (kqlfsqce_pnum, kqlfsqce_pvalue) in (
select
kqlfsqce_pnum, kqlfsqce_pvalue
from x$kqlfsqce
where kqlfsqce_sqlid = '&m_sql_id'
group by
kqlfsqce_pnum, kqlfsqce_pvalue
having count(*) != (select child_count from child_count)
)
order by
kqlfsqce_pname,
kqlfsqce_chno
/
spool off
KQLFSQCE_CHNO KQLFSQCE_PNAME KQLFSQCE_PVALUE KQLFSQCE_FLAGS
------------- -------------------------------------------------- ------------------------- --------------
0 _bloom_filter_setops_enabled true 7
1 _bloom_filter_setops_enabled false 7
0 _bloom_pruning_setops_enabled true 7
1 _bloom_pruning_setops_enabled false 7
0 _cell_offload_grand_total true 7
1 _cell_offload_grand_total false 7
0 _cell_offload_vector_groupby_fact_key true 7
1 _cell_offload_vector_groupby_fact_key false 7
...
0 _sqlexec_reorder_wif_enabled true 7
1 _sqlexec_reorder_wif_enabled false 7
0 optimizer_features_enable 19.1.0 2
1 optimizer_features_enable 12.2.0.1 0
58 rows selected.
For this example I ran my query against all_users twice, but changed the optimizer_features_enable from 19.1.0 (the default for 19.11) to 12.2.0.1 for the second execution, so I got two child cursors. As you can see from the output the number of optimizer parameters that changed was 29 (half of 58) and the only parameter that would have been visible in v$sql_optimizer_env was optimizer_features_enable itself.
The SQL identifies the duplicates by initially counting the number of child cursors for the sql_id – because you might have more than just two, and knowing the number of child cursors is the only way of identifying which parameters have different values for different child cursors.
The fact that (in general) you won’t necessarily know how many child cursors you have to compare and report is what makes it a little harder to write a query that reports variations in values across the page – so it’s left as an exercise for the reader. Personally I’d be inclined to add a clause “kqlfsqce_chno in (N, M)”, picking child cursors with different execution plans, to limit the number to just two interesting child cursors and stick with reporting down the page.
AddendumIn case you’re wondering about the optimizer parameters that aren’t startup parameters, here’s a quick query to run as SYS. (The output is from 19.11 again):
select
distinct kqlfsqce_pname
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
minus
select
ksppinm
from
x$ksppi
/
KQLFSQCE_PNAME
--------------------------------------------------------------------------------
BlockChain_ledger_infrastructure
CLI_internal_cursor
PMO_altidx_rebuild
_enable_pmo_ctas
_fix_control_key
_is_lock_table_for_ddl_wait_lock
_is_lock_table_for_split_merge
_long_varchar_allow_IOT
_multi_commit_global_index_maint
_nls_forced_binary
_suppress_scn_chk_for_cqn
_update_bji_ipdml_enabled
_update_global_index_from_part_exchange
_use_hidden_partitions
advanced_queuing_internal_cursor
flashback_data_archive_internal_cursor
flashback_table_rpi
ilm_access_tracking
ilm_dml_timestamp
ilm_filter
iot_internal_cursor
is_recur_flags
kkb_drop_empty_segments
only_move_row
optimizer_features_hinted
optimizer_mode_hinted
parallel_autodop
parallel_dblink
parallel_ddl_forced_degree
parallel_ddl_forced_instances
parallel_ddl_mode
parallel_ddldml
parallel_degree
parallel_dml_forced_dop
parallel_dop_doubled
parallel_execution_enabled
parallel_hinted
parallel_max_degree
parallel_query_default_dop
parallel_query_forced_dop
parallel_query_mode
partition_pruning_internal_cursor
rim_node_exist
sql_from_coordinator
sqlstat_enabled
total_cpu_count
total_processor_group_count
transaction_isolation_level
uniaud_internal_cursor
49 rows selected.
Following up my announcement of this note on Bluesky someone asked if this information appeared in the 10053 (CBO) trace file. The answer is yes. Under a heading “PARAMETERS USED BY THE OPTIMIZER” there are three sub-headings (also in capitals, but made more readable below):
- Parameters with altered values
- Parameters with default values
- Parameters in opt_param hint
The first two of these sub-headings share the full set of optimizer parameters, with no duplicates across the two lists. The last sub-heading introduces duplication, listing the parameter where the opt_param() hint has over-ridden the default (system) or altered (session) settings.
WWSS
What would Shakespeare say?
- Q: So, Mr. Shakespeare, which RDBMS do you use to look up your “famous quotes”?
- A: My lords, I refer me to the Oracle.
- Q: And how would you feel about migrating to SQL Server or DB2?
- A: A plague a’ both your houses.
- Q: So what is your basic requirement for finding a good quote to use?
- A: If it were done when ’tis done, then ’twere well it were done quickly.
- Q: Do you think that using multiple block sizes in your database would improve performance?
- A: It is a tale told by an idiot full of sound and fury signifying nothing.
- Q: The CBO has a big part to play in performance – what makes it so difficult to understand?
- A: Her infinite variety.
- Q: And what would you say to a DBA who claims to understand the CBO?
- A: There are more things in heaven and earth, Horatio, than are dreamt of in your philsophy.
- Q: Do you think it’s a good idea to use hints when the CBO misbehaves?
- A: Though this be madness, yet there is method in ‘t.
- Q: And how much faith do you have in the execution plans that the CBO generates?
- A: Oft expectation fails, and most oft there Where most it promises.
- Q: NULLs can, of course, make things hard for the CBO – what’s your view on them?
- A: Nothing can come of nothing.
- Q: And how do you interpret NULLs?
- A: To be or not to be, that is the question.
- Q: Do you document the SQL statements you write?
- A: It is a custom more honor’d in the breach than the observance.
- Q: What is your experience with Oracle support?
- A: The course of true love never did run smooth.
- Q: And how you do think support will be affected by the introduction of AI?
- A: More matter with less art.
- Q: Are we going to see 23ai installed at Anne Hathaway’s cottage any time soon?
- A: Expectation is the root of all heartache.
- Q: And finally, what do you make of Sym42 and the MASH program?
- A: O brave new world that has such people in it.
4096 Columns
If you thought 1,000 columns was over the top for a table in Oracle, then you’ll love 23ai which allows you to stretch that to 4,096 columns if you’ve set the parameter max_columns = extended in the spfile (or the pdb equivalent).
Naturally I was curious to see if there had been any significant structural changes to the way Oracle stored a row when the number of columns became very large, so I took the code I had written to examine the case of 1,000 columns, and hacked it around a little bit.
rem
rem Script: wide_table_4096.sql
rem Author: Jonathan Lewis
rem Dated: 7th April 2023
rem
rem Last tested
rem 23.4.25.4 (23ai)
rem
rem See also
rem https://jonathanlewis.wordpress.com/2015/02/19/255-columns/
rem https://jonathanlewis.wordpress.com/2015/05/18/migrated-rows/
rem
set pagesize 0
set feedback off
drop table if exists t1 purge;
spool temp.sql
prompt create table t1(
select
'col' || to_char(rownum,'fm0000') || ' varchar2(10),'
from
all_objects
where rownum <= 4095
;
prompt col4096 varchar2(10)
prompt )
prompt /
spool off
@temp
insert /*+ append */ into t1 (col0001, col0002, col0236,col0491, col0746, col4096)
values ('0001','0002', '0236','0491','0746','4096')
;
commit;
set pagesize 40 linesize 132 feedback on
-- column ref_file_no new_value m_fileno
column abs_file_no new_value m_fileno
column block_no new_value m_blockno
column value new_value m_trace_file
spool wide_table_4096
select
dbms_rowid.rowid_to_absolute_fno(rowid, user, 'T1') abs_file_no,
dbms_rowid.rowid_relative_fno(rowid,'BIGFILE') rel_file_no,
dbms_rowid.rowid_relative_fno(rowid,'BIGFILE') big_rel_file_no,
dbms_rowid.rowid_relative_fno(rowid,'SMALLFILE') small_rel_file_no,
dbms_rowid.rowid_relative_fno(rowid) def_rel_file_no,
dbms_rowid.rowid_block_number(rowid,'BIGFILE') block_no,
dbms_rowid.rowid_block_number(rowid,'SMALLFILE') small_block_no,
dbms_rowid.rowid_block_number(rowid) def_block_no
from
t1
/
alter system flush buffer_cache;
alter system dump datafile &m_fileno block &m_blockno;
set heading off
-- select payload from my_trace_file;
select payload
from v$diag_trace_file_contents
where
adr_home = (select value from V$diag_info where name = 'ADR Home')
and trace_filename = (
select substr(
value,
2 + ( select length(value)
from v$diag_info
where name = 'Diag Trace'
)
)
from v$diag_info
where name = 'Default Trace File'
)
/
spool off
The code is simple – I’ve used SQL to generate the SQL to create a table of 4,096 columns, then I’ve inserted a single row into the table, populating the first column, the last column, and a few in between. Then I’ve used a few calls to procedures in the dbms_rowid to convert the rowid of that row into a file and block number. When I installed 23ai and created my “user” tablespace it defaulted to a bigfile tablespace, but I’ve left a few variations of the calls in place to suit different configurations.
After finding the file and block number where the row starts I’ve dumped the entire block to my trace file and then read the tracefile back and spooled it to a flat file. My account is a “normal” account, but I’ve given it privileges to read various dynamic performance views – you might prefer to create a view and public synonym in the sys schema and grant select to public on the view. (See: https://jonathanlewis.wordpress.com/2019/10/03/trace-files-2/)
I’m not going to show you the symbolic dump of the block, instead I’ll show you a few egrep commands and their output:
******
****** Number of rows in the block:
******
jonathan@localhost working]$ egrep nro wide_table_4096.lst
nrow=17
0xe:pti[0] nrow=17 offs=0
******
****** Number of rows in each column, and the row number of each row:
******
tab 0, row 0, @0x1e4a
tl: 262 fb: -----L-- lb: 0x1 cc: 255
--
tab 0, row 1, @0x1d42
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 2, @0x1c3a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 3, @0x1b32
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 4, @0x1a2a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 5, @0x1922
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 6, @0x181a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 7, @0x1712
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 9, @0x1502
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 10, @0x13fa
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 11, @0x12f2
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 12, @0x11ea
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 13, @0x10de
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 14, @0xfd2
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 15, @0xec6
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 16, @0xea5
tl: 33 fb: --H-F--- lb: 0x1 cc: 16
******
****** First few columns of row 16 (17th row in the block)
******
[jonathan@localhost working]$ egrep -A+16 "row 16" wide_table_4096.lst
tab 0, row 16, @0xea5
tl: 33 fb: --H-F--- lb: 0x1 cc: 16
nrid: 0x00002275.f
col 0: [ 4] 30 30 30 31
col 1: [ 4] 30 30 30 32
col 2: *NULL*
col 3: *NULL*
The row has been stored as 17 separate row pieces with a maximum of 255 colums per row piece. (Note: 16 * 255 + 16 = 4096). The final egrep output shows you that the starting columns of the row are stored as the 17th row. The values of col 0 and col 1 are one clue, but the –H-F— flags (header, first) are the technical indicator. Each row piece points (backwards backwards) to the next rowpiece in the chain ending with row 0 where the —–L– flag tells us it’s the last rowpiece of the chain.
So, no great difference in the treatment of tables with very large numbers of columns – it’s just lots of row pieces of up to 255 columns, where the row is (initially) split up into 255 column chunks working from the end of the row, potentially leaving you with a very small number of columns in the first row piece.
There was one significant difference in the block dump that I haven’t shown yet. The ITL (interested transaction list) ended up with 5 entries as a result of the single insert.
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0009.01c.0000033c 0x00000103.0102.2a --U- 17 fsc 0x0000.0128d402
0x02 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
0x03 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
0x04 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
0x05 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
I did’t know if this is new, or something I had missed when the limit on the number of columns changed from 255 to 1,000, but the number of ITL entries varied according to the number of row pieces created during my single row insert. Repeating the test but changing the “last column populated” I got the following results:
- One row piece (col0255) – itc = 2, Lck = 1
- Two row pieces (col0256) – itc = 3, Lck = 2
- Three row pieces (col0512) – itc = 4, Lck = 3
- Four row pieces (col0768) – itc = 5, Lck = 4
- Five row pieces (col1024) – itc = 5, Lck = 5 — i.e. same itc as 4096 columns/17 row pieces
Of course the “coincidence” that the ITL stopped growing at the fouth row made me decide to check back to an older version of Oracle (19.11.0.0) to see what happened where the limit was 1,000 columns – and it turned out that 19c also created extra ITL entries in the same way, up to the same limit. (I can’t help wondering whether this limit was “accidentally” carried forward to 23ai or whether there was a deliberate decision that 4 “spare” ITL slots was the best compromise between ITL waits and unnecessary space use.)
FootnoteThere are probably several more questions to ask about features that might (or might not) adapt to the raising of the column limit. For example 11g had a limit of 20 (or ceil(column_count / 10)) for the number of extended stats definitions you could create for a single table when the maximum number of columns allowed was 1,000. Might this have changed for 23ai? (Initial answer: No. Longer answer: to be investigated – could you really get 370 sets of extended stats if you started with 3,700 columns in a table or would something breaks at (say) 256 sets?)
Audit Cleaning #JoelKallmanday
This note is an observation of an odd little detail that’s nice to know about the way that Unified Audit (see part 8 of Neil Chandler’s series on Fundamental Security) handles deleting old data. I don’t have a lot of background experience with Unified Audit, but I was surprised by some comments on a recent thread on the MOSC database forum (needs an account) about problems of chained rows and the difficulty of reclaiming space from the “audit tablespace”, so I spent a couple of hours looking at the tables, views and procedures to see if there were any pre-emptive strategies that would avoid the space management issue. All the comments below are based on a single pdb running under 19.11.
The unifiied audit is kept (in recent versions of Oracle) in an interval-partitioned table called aud$unified, owned by user audsys. The partitioning is on a column called event_timestamp – declared as timestamp(6) not null, the default interval is one month but you can change this by calling the procedure dbms_audit_mgmt.alter_partition_interval – and for some people it would be a very good idea to do so.
The forum thread included the complaint that the segment space advisor has found a lot of chained rows, and that a rebuild could remove 49 percent of them. Since the sum of the declared lengths of the tables columns is around 70KB and there are 3 CLOB columns declared as “enable storage in row”, I don’t think anyone should be surprised to see a lot of chained rows. (And I wouldn’t trust the advisor to do a very good job estimating what might happen with the CLOBs.)
Here’s a simple query to report the partitions, with high_value for the audit table:
select
partition_name, blocks, num_rows, high_value
from
dba_tab_partitions
where
table_name = 'AUD$UNIFIED'
order by
partition_position desc
/
There are all sorts of things you’re not allowed to do with it – even if you’re connected as sys you should only be using the management package to handle it.
delete from audsys.aud$unified where event_timestamp < timestamp'2022-06-14 00:00:00'
*
ERROR at line 1:
ORA-46385: DML and DDL operations are not allowed on table "AUDSYS"."AUD$UNIFIED".
The thinking behind the audit table seems to be that you should be checking it on a regular basis then archiving (or just deleting) the data you’ve checked so that the audit table doesn’t become enormous. To this end Oracle allows you to set a “last archived” timestamp, which you can use in a call to the procedure dbms_audit_mgmt.clean_audit_trail. You can check the current value of this timestamp by querying view dba_audit_mgmt_last_arch_ts.
SQL> select last_archive_ts
2 from dba_audit_mgmt_last_arch_ts
3 /
LAST_ARCHIVE_TS
---------------------------------------------------------------------------
01-MAY-22 12.00.01.000000 AM +00:00
1 row selected.
There is also a function dbms_audit_mgmt.get_last_archive_timestamp() to return the timestamp, but this didn’t seem to work on my 19.11 instance. (The procedure to set_last_archive_timestamp() does work.)
There are procedures in the package that let you automate this cleaning, but the point(s) I want to make are things I got to by doing some simple tests calling each step of the process in turn. The oldest few rows in my audit table currently look like this:
PARTITION_NAME BLOCKS NUM_ROWS HIGH_VALUE
---------------------- ---------- ---------- ----------------------------------------
...
SYS_P28453 50 701 TIMESTAMP' 2022-11-01 00:00:00'
SYS_P27945 43 683 TIMESTAMP' 2022-10-01 00:00:00'
SYS_P20893 88 1328 TIMESTAMP' 2022-09-01 00:00:00'
SYS_P20176 110 1766 TIMESTAMP' 2022-08-01 00:00:00'
SYS_P20025 244 3318 TIMESTAMP' 2022-07-01 00:00:00'
SYS_P17365 103 1684 TIMESTAMP' 2022-06-01 00:00:00'
Assume I want to delete everything from the last 4 partitions, I could execute the following anonymous block:
declare
ts timestamp;
begin
dbms_audit_mgmt.set_last_archive_timestamp(
dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED,
timestamp '2022-09-01 00:00:00'
);
select last_archive_ts
into ts
from dba_audit_mgmt_last_arch_ts
;
dbms_output.put_line('Set: ' || ts);
dbms_audit_mgmt.clean_audit_trail(
audit_trail_type => dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED,
use_last_arch_timestamp => true -- default
);
end;
/
Set: 01-SEP-22 12.00.00.000000 AM
PL/SQL procedure successfully completed.
When I check dba_tab_partitions again, the end of the list is now as follows:
PARTITION_NAME BLOCKS NUM_ROWS HIGH_VALUE
---------------------- ---------- ---------- ----------------------------------------
...
SYS_P28453 50 701 TIMESTAMP' 2022-11-01 00:00:00'
SYS_P27945 43 683 TIMESTAMP' 2022-10-01 00:00:00'
SYS_P20893 88 1328 TIMESTAMP' 2022-09-01 00:00:00'
Oracle has dropped three of the 4 partitions – although it has kept the the partition with the high_value of ‘2022-09-01’, which is now empty because that’s what I was expecting when I gave Oracle that timestamp. Looking at the trace file – I can see (using a suitable egrep command) three calls (driven by calls to dbms_pdb_exec_sql()) to drop an explicitly named partition followed by this delete statement:
delete from audsys.aud$unified where event_timestamp < :1 and (dbid = :2 or dbid = 0)
So Oracle has been very clever, but not quite clever enough, when removing data from the audit table; it drops any partition whose high_value is (strictly) less than the timestamp you supply then, if necessary, uses a delete on just the one partition that has become the oldest partition.
So, a couple of suggestions for cleaning the audit trail (subject to you checking that I haven’t missed any important details, and any variations due to version of Oracle).
- If you plan to set up a regular call to clean the audit table it might be a very good idea to match the partitioning interval with the schedule so that you never have to remove part of the data from a partition (i.e. you want to avoid running a big delete)
- When you see what goes on with the last_archive_timestamp in my tests, it looks like a good idea to ensure that the value you pass to the cleaning procedure is a “high_value plus one (micro)second” (did you spot the “spare” second in my query result on dba_audit_mgmt_last_arch_ts) so that you drop every relevant partition rather than deleting every single row from the new “oldest” partition.
The delete statement is a little puzzling – how can you drop one partition but think you need to check the dbid before deleting the data from another partition? Maybe the apparent anomaly is a sign that Oracle is reusing a delete statement that would have to replace the drop commands if you had a setup where multiple PDBs were writing to the same (CDB) audit table.
You’ll notice that I’ve passed a (symbolic) parameter value of dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED to the calls to the packaged procedures. If you haven’t moved all the different audit trails (the “legacy” audit, the FGA audit, and any others) into the unified audit there are other parameter values for cleaning the other trails.
Questions appear quite frequently about how to do arithmetic or comparison with the high_value
column from dba_tab_partitions
– the problem has disappeared in very recent versions of Oracle, but dbms_audit_mgm
t has an interesting utility procedure: get_part_highval_as_char()
I spoke too soon, the only table that this procedure will work for is audsys.aud$unified. Unless you’re on 23c (with CLOB and JSON versions of the column) you’ll have to fall back to something like the method I showed in a post I wrote earlier this year.
SQL> select dbms_audit_mgmt.get_part_highval_as_char('SYS_P31737') from dual;
DBMS_AUDIT_MGMT.GET_PART_HIGHVAL_AS_CHAR('SYS_P31737')
--------------------------------------------------------------------------------
2023-04-01 00:00:00
1 row selected.
This is a note about a topic that I’m not really familiar with, written rather quickly, and not yet reviewed for grammar and typing errors.
IN / NOT IN
A question appeared on the Oracle developer forum recently (Oct 2024) that could be answered by a blog note that I started drafting about 18 months ago after answering a very similar question that had appeared in the same forum. Take a look at the following query, with results, about a status column in a fairly large dataset:
select status, count(*)
from my_table
group by status
order by 2 desc;
STATUS COUNT(*)
-------------------------
STATUS_7 10580209
STATUS_6 23902
STATUS_5 1504
STATUS_4 48
STATUS_3 17
STATUS_2 8
STATUS_1 5
As you can see, almost all the data is at ‘STATUS_7’ or ‘STATUS_6’. The user wanted some code to report the rows that were not in these two statuses, which leads to one of two possible constructs:
select message_id
from my_table
where status in ('STATUS_1', 'STATUS_2', 'STATUS_3', 'STATUS_4', 'STATUS_5')
;
select message_id
from my_table
where status not in ('STATUS_6', 'STATUS_7')
;
Thanks to an index on status the first query returned its results in 0.05 seconds while the second query had to do a full table scan which took 6 seconds to complete. The problem the user faced was that there might be other status values appearing in the future, so the slow query was the safe one because it wouldn’t need to be edited in the future. So how do you get that “not in” query to run as quickly as the “in” query?
The first thing to check, of course, is how much freedom you have to modify the database and the code; it’s also important to check that any suggestion you make will produce the same result set. (It’s also a reasonably safe bet (on any forum) that the user has simplified the requirement in a way that makes any suggested solution in need of refinement to fit the real problem.)
In this case the answer is (seems to be) simple – change the question by taking advantage of the right Oracle feature and you can produce a solution that may even eliminate some of the current application overhead. “Add” a column to the table that holds a value only for rows that are in a status of interest then index that column – and the best way to “add” that column in all supported (and many unsupported) versions of Oracle is probably to create it as a virtual column:
alter table my_table
add rare_status /* invisible */
generated always as
(case when status in ('STATUS_6','STATUS_7') then null else status end)
virtual
/
create index mt_i1 on my_table(rare_status);
execute dbms_stats.gather_table_stats(user,'my_table', method_opt=>'for columns rare_status size 1')
select rare_status, message_id from my_table where rare_status is not null;
You’ll notice, of course that I’ve included, as a comment, the invisible option when adding the virtual column. This is generally a good idea, though it may be a little confusing for people to see that they can query a table using a column that doesn’t (apparently) exist. The plus point for this approach is that (naughty) code that executes an “insert as select” statement without specifying the list of target columns will still succeed rather than raising error: ORA-00947: not enough values “for no apparent reason!”.
The fact that I’ve used the predicate “is not null” might prompt you to ask if there are any traps in this approach if status is allowed to be null, and that’s something to check if you extend the strategy to handle a more complicated requirement. In this case the “not in (6,7)” and “in (1,2,3,4,5)” return the same results even if there are rows where the status is null.
Here’s the execution plan (ignore the numbers since there’s no data in the table) for the “is not null” query:
Plan hash value: 3640753122
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 38 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE | 2 | 38 | 2 (0)| 00:00:01 |
|* 2 | INDEX FULL SCAN | MT_I1 | 1 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("RARE_STATUS" IS NOT NULL)
If we return to the user’s original query – it selects message_id and their plan shows the need to visit the table to collect that value. Since the index we’ve now created on the “interesting” rows is so small we might as well extend it to include message_id – remembering to use the same type of case expression so that we include values only for the necessary small number of rows in the index.
alter table my_table
add rare_message_id invisible
generated always as
(case when status in ('STATUS_6','STATUS_7') then null else message_id end)
virtual
/
drop index mt_i1;
create index mt_i1 on my_table(rare_status, rare_message_id);
execute dbms_stats.gather_table_stats(user,'my_table', method_opt=>'for columns rare_message_id size 1')
select rare_status, rare_message_id from my_table where rare_status is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 3308683209
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 18 | 1 (0)| 00:00:01 |
|* 1 | INDEX FULL SCAN | MT_I1 | 2 | 18 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("RARE_STATUS" IS NOT NULL)
I’ve mentioned it before, and I keep wishing for it to appear in Oracle, but I’d really like to see the “filtered index” mechanism that can be used in SQL Server to handle this type of requirement without writing editing the code; something like:
create index mt_ss1 on my_table (status)
carrying (message_id)
where status not in ('STATUS_6','STATUS_7')
/
I’ve invented a syntax to say that this index should also include the message_id – again a feature (though I don’t remember the syntax) in SQL Server. Of course Oracle can carry “spare” columns, but only by adding them as if they belong in the index – which means their presence affects the selectivity and clustering_factor of the index and can, therefore, have serious side-effects on execution plans.
NOT IN bug
Some years ago I wrote a note describing how with a query using identical literal values in the predicate, the optimizer was able to produce exactly the same plan body with exactly the same Predicate Information section while producing different cardinality estimates depending on how you expressed the requirement for “column not in list of values”.
I won’t go into details of the anomaly since it disappeared in 11.2.0.4, but what I hadn’t paid any attention to was that the arithmetic (for both variations) was wrong, and it was only a recent question on the Oracle database forum that made me revisit the article, allowing me to spot and test 23.3 to see if the arithmetic was still wrong in 23.3. It is.
To demonstrate the problem, here’s a modified version of the code from the previous article. I usually try to write code that will create reproducible data across all versions of Oracle, but I didn’t do that in the previous post, and there was a significant difference between the data generated by 11.2.0.3 and the data generated by 23.3 so, among other details, I’ve changed the code to produce a fixed number of nulls.
rem
rem Script: not_in_anomaly_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2024
rem
rem Last tested
rem 23.3.0.0
rem 19.11.0.0
rem
create table t1
as
select
ao.*
from
all_objects ao
where
rownum <= 10000
/
update t1 set
object_type = null
where
rownum <= 1000
-- object_type = 'TABLE'
/
commit;
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T1',
method_opt => 'for all columns size 1'
);
end;
/
break on report
compute sum of ct on report
select
object_type, count(*) ct
from
t1
group by
object_type
order by
count(*)
/
select
sample_size, num_nulls, num_distinct, histogram
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'OBJECT_TYPE'
;
This code (in my 23.3 database) produces a table of 10,000 rows, of which 1,000 have object_type set to NULL, with 13 distinct (non-null) values across the rest of the rows. There was a huge skew in the distribution of the data, but I’ve avoided that complication by collecting stats without histograms. Here are the results from the two queries against the data dictionary to confirm those details.
OBJECT_TYPE CT
----------------------- ----------
CONTEXT 2
PROCEDURE 2
DIRECTORY 2
OPERATOR 3
SEQUENCE 3
FUNCTION 14
PACKAGE 22
DOMAIN 22
INDEX 48
TABLE 71
TYPE 805
1000 <<--- NULL
SYNONYM 4002
VIEW 4004
----------
sum 10000
14 rows selected.
SAMPLE_SIZE NUM_NULLS NUM_DISTINCT HISTOGRAM
----------- ---------- ------------ ---------------
9000 1000 13 NONE
After generating this data I checked the execution plans – in particular the predicate information and cardinality estimates – of several different queries involving variations of “not in” predicates; but I just want to show one query, with its execution plan from autotrace, that demonstrates the error very clearly:
select
count(*)
from t1
where object_type not in ('INDEX', 'SYNONYM', 'VIEW')
union all
select
count(*)
from t1
where object_type in ('INDEX', 'SYNONYM', 'VIEW')
;
Execution Plan
----------------------------------------------------------
Plan hash value: 575959041
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 12 | 52 (4)| 00:00:01 |
| 1 | UNION-ALL | | 2 | 12 | 52 (4)| 00:00:01 |
| 2 | SORT AGGREGATE | | 1 | 6 | | |
|* 3 | TABLE ACCESS FULL| T1 | 7079 | 42474 | 26 (4)| 00:00:01 |
| 4 | SORT AGGREGATE | | 1 | 6 | | |
|* 5 | TABLE ACCESS FULL| T1 | 2077 | 12462 | 26 (4)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("OBJECT_TYPE"<>'INDEX' AND "OBJECT_TYPE"<>'SYNONYM' AND
"OBJECT_TYPE"<>'VIEW')
5 - filter("OBJECT_TYPE"='INDEX' OR "OBJECT_TYPE"='SYNONYM' OR
"OBJECT_TYPE"='VIEW')
As you can see, the optimizer has rewritten the “not in()” as a conjunct of inequalities and the “in” as a disjunct of equalities. If I had written the query in exactly that form I would still have got the same results – provided I didn’t go all the way back to 11.2.0.3 in my testing.
The critical point is this: check the Rows column for operations 3 and 5 (the two tablescans), and add the two values together. The result is 9,156. Between them the two branches of the query cover all the rows where col1 is not null – and that’s 9,000 rows. Whatever the optimizer is doing it ought (for such a simple query) to match the total row count.
The “in” branch gives the right estimate: for my 9,000 rows with a non-null col1 the number of distinct values is 13 so selectivity of “column = constant” is 1/13; adding the values for the three selected values that gives us a selectivity of 3/13 and a cardinality of 2076.9 (q.e.d.)
The “not in” branch ought, therefore to give us a selectivity of 10/13 (i.e. 1 – 3/13) for a cardinality of 6,923(.08). Oracle’s estimate of 7,079 is out by 156, which is only a little over 2% in this case, so maybe the error is hardly ever noticeable to most people.
Where does the error come from. There are a couple of different ways of thinking about how the numbers can be produced. Here’s one viewpoint which shifts us from addition/subtraction to multiplication/division (which is a little more like the way the optimizer runs the arithmetic, I think).
To eliminate the first value we scale the rows down by one thirteenth; to eliminate the second value we scale that result by one twelfth (there are only 12 distinct values left), to eliminate the final value we scale the result by a further one eleventh – so the total is: 9000 * 12/13 * 11/12 * 10/11 = 6923.07 (q.e.d.)
BUT it looks as if Oracle is not allowing for the reduction in the number of distinct values as it eliminates each value in the list – so the optimizer’s arithmetic becomes: 9000 * 12/13 * 12/13 * 12/13 = 7078.7 (q.e.d).
ConclusionThere is a flaw in the way that the optimizer calculates the selectivity of “not in {list of values}” – the error may be fairly small in most cases but due to the nature of the error (and with the assistance of histograms and/or large numbers of rows with NULLs in the relevant column) there may be cases where a bad enough cardinality estimate appears and make the optimizer choose a poor execution plan. The nature of the error is such that the optimizer will over-estimate the cardinality.
Index puzzle
What could make a “create index” statement fail with an error about a table having too many columns (ORA-01792)? And why would this suddenly happen when the code that raised the error had been dropping and recreating the index every night for the best part of 3 years with no problems?
This is a question that came up on the Oracle developer forum recently, and was finally answered by the original poster after getting an important clue from another participant in the thread.
The clue (supplied by Solomon Yakobson) was that when you create a “function-based” index Oracle will create a hidden virtual column on the table but if the table has already reached the limit on the number of columns (1,000 until very recently) the attempt to create the column will fail with error ORA-01792, leading to the create index statement failing.
But this drop/create has been working perfectly for years – and a query for the table definition reported only 40 columns – so how did an extra 960 columns sneak in to break things? It’s a bug that can appear when you mix virtual columns with function-based indexes – Oracle bypasses a section of the code path that is supposed to clean up after the “drop index”. Here’s a very simple example to demonstrate.
rem
rem Script: fbi_puzzle_4.sql
rem Author: Jonathan Lewis / Marco Antonio Romito
rem Dated: Sep 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
drop table t1;
create table t1(
n1 number(6,0),
n2 number(6,0) as (n1 + 1) virtual,
n3 number(6,0)
)
/
prompt =============================
prompt An index that causes problems
prompt =============================
create index t1_i1 on t1(n1, nvl(n3,0), n2);
column column_name format a32
column data_default format a32
select
column_id, column_name,
hidden_column, virtual_column, user_generated, internal_column_id,
data_default
from
user_tab_cols
where
table_name = 'T1'
order by
internal_column_id
/
I’ve created a table of three columns, where the second column is a virtual column. Then I’ve created an index on the table which includes references to all three columns, but one of those references includes a call to the nvl() function – which means Oracle will create a fourth (hidden, virtual) column for that column expression. Here are the results of the subsequent query against user_tab_cols:
COLUMN_ID COLUMN_NAME HID VIR USE INTERNAL_COLUMN_ID DATA_DEFAULT
---------- -------------------------------- --- --- --- ------------------ --------------------------------
1 N1 NO NO YES 1
2 N2 NO YES YES 2 "N1"+1
3 N3 NO NO YES 3
SYS_NC00004$ YES YES NO 4 NVL("N3",0)
4 rows selected.
As you can see n2 is flagged as virtual (but not hidden) generated as “N1” + 1; and there’s a fourth, system-generated, hidden, virtual column generated as NVL(“N3”,0).
As ever, things are more likely to break when you mix features. Here are a few more lines of code to run after creating the index and reporting the columns:
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
select
column_id, column_name,
hidden_column, virtual_column, user_generated, internal_column_id,
data_default
from
user_tab_cols
where
table_name = 'T1'
order by
internal_column_id
/
You might expect Oracle to drop the system-generated column as it drops the index – but here are the results we now get from querying user_tab_cols:
COLUMN_ID COLUMN_NAME HID VIR USE INTERNAL_COLUMN_ID DATA_DEFAULT
---------- -------------------------------- --- --- --- ------------------ --------------------------------
1 N1 NO NO YES 1
2 N2 NO YES YES 2 "N1"+1
3 N3 NO NO YES 3
SYS_NC00004$ YES YES NO 4 NVL("N3",0)
SYS_NC00005$ YES YES NO 5 NVL("N3",0)
SYS_NC00006$ YES YES NO 6 NVL("N3",0)
SYS_NC00007$ YES YES NO 7 NVL("N3",0)
SYS_NC00008$ YES YES NO 8 NVL("N3",0)
SYS_NC00009$ YES YES NO 9 NVL("N3",0)
9 rows selected.
Oracle “forgets” to drop the system-generated virtual column, and generates a new virtual column every time the index is recreated. Since the columns are hidden columns you won’t notice that something has gone wrong if you query the (more commonly used) view user_tab_columns – so there may well be a number of sites which have tables with huge numbers of hidden virtual columns, all quietly working their way towards unexpected ORA-01792 errors.
This “failure to drop” isn’t consistent behaviour – if the index had been defined as (n1, n2, nvl(n3,0)) then the system-generated column would have been dropped every time the index was dropped. I haven’t tried to work out exactly what conditions have to be met for the error to appear but I think it may be something to do with a user-defined virtual column appearing in an index after an expression that has to be supported by a system-generated virtual column.
Bear in mind that if I create two indexes (n1, nvl(n3,0)) and (n2, nvl(n3,0)) then Oracle will generate just one virtual column to support the expression nvl(n3,0), so when I drop one of these indexes Oracle will have to say: “does dropping this index mean I should drop a virtual column?” and I’m guessing that somewhere in that area of code Oracle is bypassing some of the checks it should be doing.
SummaryIf you have any indexes that contain both virtual columns and column expressions, and if you drop and recreate those indexes from time to time, check user_tab_cols to make sure that you don’t have multiple identical system-generated, hidden, virtual columns that support the expression(s) in those indexes.
WorkaroundIn my case I found the following steps were sufficient to clear up the excess occurrences:
- drop the problem index (n1, nvl(n3,0), n2)
- create an index on just the problem expression (nvl(n3,0))
- drop that index (which got rid of all the copies of the generated virtual columns)
- recreate the problem index.
In more complicated cases (e.g. several indexes following this pattern, multiple indexes that include the same expression etc.) you may have to develop a more sophisticated approach.
num_index_keys – 2
A recent MOS Community forum posting (needs an account) raised the problem of a query with a specific index hint switching between two plans, one using “inlist iteration of a unique index scan” the other using a simple “index range scan with filter predicate”. Here’s an example of the pattern of the query, with the two execution plans:
select /*+ index(INDEXES_A_LOCAL_IDX) */
*
from INDEXES_A
WHERE ID in (
:1 ,:2 ,:3 ,:4 ,:5 ,:6 ,:7 ,:8 ,:9 ,:10 ,:11 ,:12 ,:13 ,:14 ,:15 ,:16 ,:17 ,:18 ,:19 ,:20 ,
:21 ,:22 ,:23 ,:24 ,:25 ,:26 ,:27 ,:28 ,:29 ,:30 ,:31 ,:32 ,:33 ,:34 ,:35 ,:36 ,:37 ,:38 ,:39 ,:40 ,
:41 ,:42 ,:43 ,:44 ,:45 ,:46 ,:47 ,:48 ,:49 ,:50 ,:51 ,:52 ,:53 ,:54 ,:55 ,:56 ,:57 ,:58 ,:59 ,:60 ,
:61 ,:62 ,:63 ,:64 ,:65 ,:66 ,:67 ,:68 ,:69 ,:70 ,:71 ,:72 ,:73 ,:74 ,:75 ,:76 ,:77 ,:78 ,:79 ,:80 ,
:81 ,:82 ,:83 ,:84 ,:85 ,:86 ,:87 ,:88 ,:89 ,:90 ,:91 ,:92 ,:93 ,:94 ,:95 ,:96 ,:97 ,:98 ,:99, :100
)
AND CLIENT_ID = :101
AND PRODUCT_ID = :102
;
Plan hash value: 743077671
---------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 110 (100)| | | |
| 1 | PARTITION LIST SINGLE | | 8 | 608 | 110 (0)| 00:00:01 | KEY | KEY |
| 2 | INLIST ITERATOR | | | | | | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| INDEXES_A | 8 | 608 | 110 (0)| 00:00:01 | KEY | KEY |
|* 4 | INDEX UNIQUE SCAN | INDEXES_A_LOCAL_IDX | 8 | | 102 (0)| 00:00:01 | KEY | KEY |
---------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1011922169
----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 88 (100)| | | |
| 1 | PARTITION LIST SINGLE | | 1 | 79 | 88 (3)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| INDEXES_A | 1 | 79 | 88 (3)| 00:00:01 | KEY | KEY |
|* 3 | INDEX RANGE SCAN | INDEXES_A_LOCAL_IDX | 1 | | 87 (3)| 00:00:01 | KEY | KEY |
----------------------------------------------------------------------------------------------------------------------------------
The index in the hint is the primary key index, defined as (client_id, product_id, id) and, as you can see, the query specifies exactly those three column with equality predicates – albeit with an “in-list” for the final column of the index. You’ll also notice that the table is list-partitioned (the partition key is the client_id, one value per partition) and the index is a local index.
It may not be immediately obvious how the two plans can achieve the same result, until you check the Predicate Information sections of the two plans which, in the same order and with several lines of “OR” predicates removed, look like this:
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("CLIENT_ID"=:101 AND "PRODUCT_ID"=:102 AND (("ID"=:1 OR "ID"=:2 OR "ID"=:3 OR "ID"=:4 OR
"ID"=:5 OR "ID"=:6 OR "ID"=:7 OR "ID"=:8 OR "ID"=:9 OR "ID"=:10 OR "ID"=:11 OR "ID"=:12 OR "ID"=:13 OR "ID"=:14 OR
...
"ID"=:95 OR "ID"=:96 OR "ID"=:97 OR "ID"=:98 OR "ID"=:99 OR "ID"=:100)))
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("CLIENT_ID"=:101 AND "PRODUCT_ID"=:102)
filter(("ID"=:1 OR "ID"=:2 OR "ID"=:3 OR "ID"=:4 OR "ID"=:5 OR "ID"=:6 OR "ID"=:7 OR "ID"=:8 OR "ID"=:9 OR "ID"=:10 OR
...
"ID"=:95 OR "ID"=:96 OR "ID"=:97 OR "ID"=:98 OR "ID"=:99 OR "ID"=:100))
The plan with the inlist iterator operation uses the list as an access predicate – it does 100 precision probes of the index to find rowids and visits the table for each rowid in turn.
The plan using the index range scan uses the list as a filter predicate – it does a single range scan through the section of the index covering the given client_id and product_id and, for every index entry in that range, checks the id against the the list of values, visiting the table only for the cases when it finds a match (There are probably some cunning little tricks to minimise the work done checking the id – sorting the list into order once at the start of execution would be a reasonable one, starting the range scan at the lowest listed id and stopping at the highest would be another.)
Why can the optimizer produce two different plans? It’s just the result of the costing algorithm. In this case the optimizer compares the cost of doing a unique access 100 times (which is roughly 100 times the cost of doing it once) with the cost of a single large range scan (plus the CPU cost of lots of comparisons).
If you want to jump to it, here’s the summary and workaround.
I’ve blogged about this behaviour in a previous post, though the context in that case was an upgrade from 10g to 12c and the example and consequences were slightly different, so I decided to create a model of this case to demonstrate the point – and then discovered an oddity in the optimizer that probably made the switch appear to be rather random. I’m going to start by creating some data in a list-partitioned table with a local primary key index.
rem
rem Script: num_index_keys_3.sql
rem Author: Jonathan Lewis
rem Dated: Mar 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
rem 23.3.0.0
rem
execute dbms_random.seed(0)
create table t1(
client_id not null,
product_id not null,
id not null,
v1,
padding
)
partition by list (client_id) (
partition p0 values ('Client0'),
partition p1 values ('Client1'),
partition p2 values ('Client2'),
partition p3 values ('Client3'),
partition p4 values ('Client4'),
partition p5 values ('Client5'),
partition p6 values ('Client6'),
partition p7 values ('Client7'),
partition p8 values ('Client8'),
partition p9 values ('Client9')
)
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
cast('Client' || least(trunc(abs(3 * dbms_random.normal)),9) as varchar2(254)) client_id,
cast(trunc(abs(400 * dbms_random.normal)) as number(6,0)) product_id,
cast('Id' || lpad(rownum,7,'0') as varchar2(254)) id,
lpad(rownum,10,'0') v1,
lpad('x',250,'x') padding
from
generator v1,
generator v2
where
rownum <= 1e6 -- > comment to avoid WordPress format issue
order by
dbms_random.value
/
execute dbms_stats.gather_table_stats(user, 't1', method_opt=>'for all columns size 1')
alter table t1 add constraint t1_pk primary key(client_id, product_id, id) using index local
-- in a production system this index (for this data pattern) ought to be defined "compress 2"
/
alter index t1_pk rebuild partition p3 pctfree 92
/
I’ve created a list-partitioned table with one client_id per partition. I’ve used the dbms_random.normal() function to generate an uneven distribution of client_id values and product_id values, and rownum to generate the final id column; then I’ve randomised the order of data before inserting it so that the index (partitions) will have large values for the clustering_factor. I’ve also rebuilt partition p3 of the index with a pctfree of 92 because I wanted to have a physically large index partition without generating a huge amount of table data. (The OP has partition sizes varying from a few hundred rows to a couple of hundred million rows – which is partly why the switch in plan can seem fairly random, and why the performance impact can be catastrophic.)
With my hacking factors I’ve got 10 values for client_id, 1,603 values for product_id, and the distribution of data across partitions is as follows:
select
client_id, count(distinct product_id), count(*), min(id) min_id, max(id) max_id
from t1
group by client_id
order by client_id
/
CLIENT_ID COUNT(DISTINCTPRODUCT_ID) COUNT(*) MIN_ID MAX_ID
------------------------ ------------------------- ---------- ------------ ------------
Client0 1474 261218 Id0000002 Id0999997
Client1 1455 233591 Id0000001 Id1000000
Client2 1448 187657 Id0000006 Id0999998
Client3 1390 134710 Id0000003 Id0999996
Client4 1339 87095 Id0000034 Id0999981
Client5 1270 50005 Id0000056 Id0999902
Client6 1198 25894 Id0000124 Id0999975
Client7 1068 12083 Id0000033 Id0999804
Client8 914 4970 Id0000185 Id0999811
Client9 827 2777 Id0000764 Id0999515
After running a couple more queries to examine the data I picked partition p3 for testing, and 284 as a suitable product_id (249 rows in that partition), and created an anonymous PL/SQL block to execute a query to select 100 of those rows (using a couple of queries to generate most of the text).
Here’s a version of the code – cut back to select only 7 rows with an option for an 8th row – followed by a call to generate the SQL Monitor report (available thanks to the hint in the SQL) for the most recent SQL statement execution. The code counts the rows returned just to confirm that every id in the list really does correspond to a row in the partition:
set serveroutput on
<<anon>>
declare
ct number := 0;
m_pad t1.padding%type;
client_id varchar2(32) := 'Client3';
product_id number := 284;
b1 varchar2(32) := 'Id0000628';
b2 varchar2(32) := 'Id0002350';
b3 varchar2(32) := 'Id0002472';
b4 varchar2(32) := 'Id0007921';
b5 varchar2(32) := 'Id0008073';
b6 varchar2(32) := 'Id0008398';
b7 varchar2(32) := 'Id0012196';
b8 varchar2(32) := 'Id0013212';
begin
for c1 in (
select /*+ monitor index(t1 t1_pk) */
*
from t1 t1
where t1.client_id = anon.client_id
and t1.product_id = anon.product_id
and t1.id in (
b1, b2, b3, b4, b5, b6, b7
-- b1, b2, b3, b4, b5, b6, b7, b8
)
) loop
m_pad := c1.padding;
ct := ct + 1;
end loop;
dbms_output.put_line(ct);
end;
/
set heading off
set pagesize 0 linesize 255
set tab off trimspool on
set long 250000
set longchunksize 250000
column text_Line format a254
select
dbms_sql_monitor.report_sql_monitor(
type =>'TEXT',
report_level => 'all'
) text_line
from dual
/
set pagesize 40 linesize 132
The code above produced the following SQL Monitor plan:
============================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
============================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 7 | | |
| 1 | PARTITION LIST SINGLE | | 7 | 16 | 1 | +0 | 1 | 7 | | |
| 2 | INLIST ITERATOR | | | | 1 | +0 | 1 | 7 | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | T1 | 7 | 16 | 1 | +0 | 7 | 7 | | |
| 4 | INDEX UNIQUE SCAN | T1_PK | 7 | 9 | 1 | +0 | 7 | 7 | | |
============================================================================================================================================
As you can see, we’ve used an inlist iterator and done 7 unique probes of the index. The cost of the indexing operation alone is 9: i.e. 2 for the root and branch blocks (which get pinned), and one for each of the 7 leaf blocks that Oracle expects to access.
Change the code so that we include the eighth variable in the list and the SQL Monitor plan shows:
===================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
===================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 8 | | |
| 1 | PARTITION LIST SINGLE | | 8 | 17 | 1 | +0 | 1 | 8 | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | T1 | 8 | 17 | 1 | +0 | 1 | 8 | | |
| 3 | INDEX RANGE SCAN | T1_PK | 8 | 9 | 1 | +0 | 1 | 8 | | |
===================================================================================================================================================
The optimizer has decided to do an index range scan and examine every index entry for ‘Client3’, product 284 to find the 8 requested IDs – and it’s just about arithmetic. Tell the optimizer to use all three columns as access columns by adding the hint /*+ num_index_keys(t1 t1_pk 3) */ to the SQL and the plan changes to the inlist iterator version – with a higher cost:
============================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
============================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 8 | | |
| 1 | PARTITION LIST SINGLE | | 8 | 18 | 1 | +0 | 1 | 8 | | |
| 2 | INLIST ITERATOR | | | | 1 | +0 | 1 | 8 | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | T1 | 8 | 18 | 1 | +0 | 8 | 8 | | |
| 4 | INDEX UNIQUE SCAN | T1_PK | 8 | 10 | 1 | +0 | 8 | 8 | | |
============================================================================================================================================
Using the inlist iterator the total cost is 18: that’s 2 for the root and branch blocks, 8 for leaf blocks, 8 for table blocks. Using the range scan the total cost is 17: that’s 9 for the index scan plus the extra CPU, plus 8 for visiting 8 table blocks. (If you’re wondering why the index range scan is 9, there are 9,324 leaf blocks in the index partition and 1,390 product ids in the partition (and one client_id) : 9,324/1,390 = 6.7; round up, add one for the root, one for the branch –> total 9. The optimizer is using averages to estimate the size of the range scan needed.
The switch happens – it’s just arithmetic, and depends on how many leaf blocks Oracle thinks will be needed by the range scan compared to the number of leaf blocks that will be accessed by the iterated unique probe. The larger the list the more likely it is that Oracle will switch to using a range scan and filter.
The anomaly/bugWhen I first wrote the PL/SQL block I thought I might need about 100 values to demonstrate the problem. In fact I would have needed a much larger data set to need that many values, nevertheless my first attempt introduced a puzzle that turned out to be an unpleasant glitch in the optimizer arithmetic.
If I set the PL/SQL block up to query using 96 variables I get the following plan:
===================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
===================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 96 | | |
| 1 | PARTITION LIST SINGLE | | 96 | 105 | 1 | +0 | 1 | 96 | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | T1 | 96 | 105 | 1 | +0 | 1 | 96 | | |
| 3 | INDEX RANGE SCAN | T1_PK | 96 | 9 | 1 | +0 | 1 | 96 | | |
===================================================================================================================================================
It shouldn’t be a surprise to see the range scan/filter plan, with a total cost of 105: 9 for the index range scan + 96 for the table accesses.
Increase the number of variable to 97 and watch the cost (the plan doesn’t change):
===================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
===================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 97 | | |
| 1 | PARTITION LIST SINGLE | | 97 | 10 | 1 | +0 | 1 | 97 | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | T1 | 97 | 10 | 1 | +0 | 1 | 97 | | |
| 3 | INDEX RANGE SCAN | T1_PK | 1 | 9 | 1 | +0 | 1 | 97 | | |
===================================================================================================================================================
Suddenly the Rows (estim) for the index is 1 (even though the row estimate for the table is 97) with the result that the total cost is now 10 – i.e. 9 for the index range scan plus 1 for the use of the single “predicted” rowid to visit the table.
Given very large data sets, and products with very large numbers of rows (which means big range scans and the appearance of the inlist iterator strategy for relatively long lists) you can imagine how baffling it must be when the optimizer suddenly says – “Hey, one extra value in the list, that means the range scan is really cheap” – and that’s probably why the OP was able to say: “The unique scan operation runs sub second, the range scan takes approx 10 secs to complete.”
You may recall that I pointed out in the code that my index really ought to be defined with “compress 2” – that’s true, but if you’ve got a system like this one already and haven’t used basic index compression, don’t rebuild the index to compress it just yet. If you halve the size of the index then (roughly) you will probably halve the size of the list where the optimizer switches from inlist iteration to the range scan/filter strategy.
Why, you might ask does this error appear at 97 rows? The answer is in the statistics about the partitions / clients that I listed above. This partition/client has 134,710 rows and 1,390 distinct products; 134,710 / 1,390 = 96.91. In my case (which had no histograms) when the table row estimate exceeded the average number of rows per product the optimizer hit a glitch in the code path and used the wrong numbers in the arithmetic.
Let’s do one more experiment – take the code with 96 variables (which does most of the arithmetic “correctly” and use the num_index_keys() hint to force Oracle into the iteration strategy, and here’s the plan from SQL Monitor:
============================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
============================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 96 | | |
| 1 | PARTITION LIST SINGLE | | 96 | 194 | 1 | +0 | 1 | 96 | | |
| 2 | INLIST ITERATOR | | | | 1 | +0 | 1 | 96 | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | T1 | 96 | 194 | 1 | +0 | 96 | 96 | | |
| 4 | INDEX UNIQUE SCAN | T1_PK | 96 | 98 | 1 | +0 | 96 | 96 | | |
============================================================================================================================================
The total cost is 194, compared to the cost of 105 for the range scan/filter plan. Despite the fact that Oracle “thinks” that the relevant section of index is only 9 leaf blocks (and I know it’s 17), it has assumed that every single one of the 96 probes it expects to make will have to do a physical read of a leaf block. The optimizer has not allowed for the “self-caching” that is going to happen as the lists get longer and longer – and that’s a fundamental reason behind this change from inlist iteration to range scanning with filters.
SummaryThere may be all sorts of variations behind the switch between “iterate/unique” and “range/filter” and it’s likely that the presence of histograms would have some effect on the break point, and there may be some arithmetic relating the low/high values in the table/partition compared with the least/greatest values used in the list.
The bottom line, though, is that there’s some instability in the choice of execution plan. If you start seeing it and it’s causing performance problems your best bet is probably to use the num_index_keys() hint to ensure that you get the same plan every time.
FootnoteWhen I re-ran the test on an instance of 23.3 Free it behaved the same way, with one (slight?) difference. My 19.11 instance had been estimating the number of rows correctly, so the sudden change appeared when I used 97 variables in the list; in 23.3 I had to use 100 variables in the list before the error appeared – but that was exactly the point where the table row estimate hit 97.
Indexing
This is a small case study from one of the MOS Community Fora (needs a MOS account) that I drafted about a year ago. It started out as a question about a two node RAC system running Oracle 19.17 where an SQL Monitor report showed time spent in a wait for resmgr:cpu quantum . Here’s the report (with a little cosmetic editing) that was included in the posting:
SELECT COUNT (DISTINCT (ITEM_KEY))
FROM WF_NOTIFICATIONS
WHERE
ITEM_KEY = :B2
AND TO_USER = :B1
AND MESSAGE_TYPE = 'APINVAPR'
;
===================================================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (%) | (# samples) |
===================================================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +162 | 1 | 1 | | | | |
| 1 | SORTAGGREGATE | | 1 | | 1 | +162 | 1 | 1 | | | | |
| 2 | VIEW | VW_DAG_0 | 1 | 102K | 1 | +162 | 1 | 1 | | | | |
| 3 | SORT GROUP BY NOSORT | | 1 | 102K | 1 | +162 | 1 | 1 | | | | |
| 4 | TABLE ACCESS BY INDEX ROWID | WF_NOTIFICATIONS | 1 | 102K | 162 | +1 | 1 | 1 | 408K | 3GB | 96.91 | Cpu (15) |
| | | | | | | | | | | | | resmgr:cpu quantum (1) |
| | | | | | | | | | | | | db file sequential read (141) |
| 5 | INDEX RANGE SCAN | WF_NOTIFICATIONS_N5 | 740K | 4836 | 160 | +3 | 1 | 3M | 17679 | 138MB | 3.09 | Cpu (3) |
| | | | | | | | | | | | | db file sequential read (2) |
===================================================================================================================================================================================
As far as the original question is concerned there are two key points to note – first that the wait in question appears once in the ASH / Activity data out of a total of 162 samples, so it’s not “important”. (That’s okay, the OP didn’t say it was a problem, only that they wanted to understand what it meant.)
Secondly, from a performance perspective, the index range scan produced 3 million rowids but the table access discarded almost all the rows, returning just one row for the distinct aggregation. Clearly the query could be made a lot more efficient (and, yes, you could consider the option for rewriting it with an “exists” subquery).
One of the other forum users addressed the reasons behind the resource manager wait so I followed up on the basic performance characteristics, commenting on the apparent absence of (or failure to use) a suitable index. The OP responded with details from the data dictionary covering index definitions, data volumes, and (rather usefully) counts of distinct values. I did a little cosmetic editing to the information supplied to produce the following details:
Table Name NUM_ROWS BLOCKS
-------------------- ---------- ----------
WF_NOTIFICATIONS 8151040 406969
Table Distinct Num. Sample
Table Name Attribute Values Density Nulls Size
-------------------- ------------- ---------- ------- ---------- ----------
WF_NOTIFICATIONS MESSAGE_TYPE 11 .0909 0 815104
WF_NOTIFICATIONS TO_USER 1026 .0010 60 815098
WF_NOTIFICATIONS ITEM_KEY 3538701 .0000 1032240 711880
CREATE UNIQUE INDEX "APPLSYS"."WF_NOTIFICATIONS_PK" ON "APPLSYS"."WF_NOTIFICATIONS" ("NOTIFICATION_ID")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N1" ON "APPLSYS"."WF_NOTIFICATIONS" ("RECIPIENT_ROLE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N2" ON "APPLSYS"."WF_NOTIFICATIONS" ("GROUP_ID", "NOTIFICATION_ID")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N3" ON "APPLSYS"."WF_NOTIFICATIONS" ("MAIL_STATUS", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N4" ON "APPLSYS"."WF_NOTIFICATIONS" ("ORIGINAL_RECIPIENT")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N5" ON "APPLSYS"."WF_NOTIFICATIONS" ("MESSAGE_TYPE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N6" ON "APPLSYS"."WF_NOTIFICATIONS" ("MORE_INFO_ROLE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N7" ON "APPLSYS"."WF_NOTIFICATIONS" ("FROM_ROLE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N8" ON "APPLSYS"."WF_NOTIFICATIONS" ("RESPONDER")
INDEX_NAME BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ----------- -----------------
WF_NOTIFICATIONS_PK 2 27130 1914910
WF_NOTIFICATIONS_N1 2 49400 2485980
WF_NOTIFICATIONS_N2 2 37620 1917700
WF_NOTIFICATIONS_N3 2 47060 816790
WF_NOTIFICATIONS_N4 2 36760 2490300
WF_NOTIFICATIONS_N5 2 49200 962650
WF_NOTIFICATIONS_N6 2 50830 727210
WF_NOTIFICATIONS_N7 3 57020 1034680
WF_NOTIFICATIONS_N8 2 9271 455861
There are many details we could pick up from these stats but to start with I’ll just point out that only one of the three columns in the query’s where clause appears in an index, and that column (message_type) has only 11 distinct values. The index is wf_notifications_n5(message_type, status) and for some reason the optimizer has decided to use that index for this query.
The choice is surprising since the reported cost for the query is 102,000 when the table is (apparently) only 406,969 blocks, which means we should expect the tablescan cost (using the default value of 8 for the “_db_file_optimizer_read_count”) to be in the region of 50,000; maybe someone has set the db_file_multiblock_read_count to 4 (or maybe the value for processes is so high that Oracle’s internal algorithm has made a similar adjustment behind the scenes).
The cost for the index range scan alone is 4,836 – which is approximately the number of leaf blocks divided by the number of distinct values for the message_type: 49,200/11 = 4,472
While we’re looking at the statistics you might note that the stats gathering seems to be using estimate_percent => 10 rather than auto_sample_size.
You’ll notice that the Rows (Estim) for the index range scan is 740,000 while the Rows (Actual) is 3 million. This tells you that there’s no histogram on the column (740,000 is very close to 8.15M rows / 11 distinct values) and that a histogram would be a good idea – a correct estimate might have been enough to tip the optimizer into a tablescan. In fact we can also note in passing that the to_user column might benefit from a histogram given that 12c and later can create frequency histograms up to 2,048 buckets.
A tablescan would, of course, still be expensive and do far more work than needed. Ideally we need an index to be able to find the rows that match all three equality predicates. However, we can see that there are roughly 2 rows per value for item_key, so an index on just (item_key) might be good enough. Averages, of course, can be very misleading: we might have one row for each of 3.5M values of item_key and one value of item_key with 3.5M rows in our table, so we do need to know more about the data distribution before we can make any solid suggestions.
You always have to be careful when adding indexes to tables to address specific queries – they may have adverse side-effects on other queries (and on the DML workload); and you should always look for options to maximise the benefit of any index you add. The “obvious” choice in this case is (message_type, to_user, item_key) compress 2. which would allow the query to execute without visiting the table at all, and probably do very little work. A better alternative might be (message_type, item_key) compress 1, because it could be good enough for the supplied query and very helpful if there is a similar query of the form:
SELECT COUNT (DISTINCT (ITEM_KEY))
FROM WF_NOTIFICATIONS
WHERE
ITEM_KEY = :B2
AND FROM_USER = :B1
AND MESSAGE_TYPE = 'APINVAPR'
(Note how I’ve changed to_user to from_user at line 5)
A RewriteEven with a “perfect” index the query could still do quite a lot of unnecessary work. If we have a few values for item_key with a large number of rows Oracle will walk through all the relevant index entries before doing the “distinct aggregate” (even though there’s at most just one possible value of item_key being returned and the final result has to be zero or one). So, with the best index in place it might still be nice to rewrite the query in the following form:
select count(*)
from dual
where exists (
select null
from wf_notifications
where
item_key = :B2
and to_user = :B1
and message_type = 'APINVAPR'
)
/
You’ll notice, by the way, that the original SQL is all in capitals with bind variables of the form :Bnnn (nnn numeric). This suggests it’s code that is embedded in PL/SQL – so maybe it’s from a “3rd party” package (the schema name APPLSYS that appears in the “create index” statements looks familiar) that can’t be rewritten.
ORA-00942
Oracle error 942 translates, for most people, into: “table of view does not exist”. This note is based on a conversation that I had with Mikhail Velikikh on the Oracle-l list server over the last couple of days while looking at the question: “How do I find out which table (or view) does not exist?”.
There are several possibilities – none of them ideal – but one suggestion that came up was to take advantage of dumping the errorstack when error 942 was signal:
alter session set events '942 trace name errorstack forever, level [N]';
Mikhail suggested using level 2 and then picking up various details from the resulting trace file to allow further information to be extracted from the SGA.
This seemed a little bit like hard work and didn’t match a distant memory I had of solving the problem, so I ran up a little test (based on an extremely simplified model of the task the OP had been working on) using level 1:
create or replace package pack1 as
procedure failure;
end;
/
alter session set tracefile_identifier='level1';
alter session set events '942 trace name errorstack forever, level 1';
create or replace package body pack1 as
procedure failure is
v1 varchar2(64);
begin
select name into v1 from not_a_table where rownum = 1;
dbms_output.put_line(v1);
end;
end;
/
alter session set events '942 trace name errorstack off';
show errors
This produced the output:
Package created.
Session altered.
Session altered.
Warning: Package Body created with compilation errors.
Session altered.
Errors for PACKAGE BODY PACK1:
LINE/COL ERROR
-------- -----------------------------------------------------------------
6/2 PL/SQL: SQL Statement ignored
6/27 PL/SQL: ORA-00942: table or view does not exist
A quick search through the resulting trace file showed me the following pair of consecutive lines, about half way through a 10,000 line file:
5109 :about to signal 942
5110 -Name: NOT_A_TABLE
It’s a shame that the schema name was not included, but it might be sufficient. In fact this pair of lines would have appeared (without the extra 10,000) if I had if I had used “942 trace name context forever”.
Unfortunately Mikhail’s trace file didn’t echo these two lines! This was a surprise, since I was using 19.11 for my test and Mikhail was using 19.23 – details don’t (often) disappear in the upgrade. It didn’t take Mikhail long to find the answer to the puzzle, which I can demonstrate by changing one line of my code. Instead of selecting from a completely non-existent table I’ll change line 14 of the code above to select from a table that does exist, but I don’t have privileges for:
select name into v1 from sys.obj$ where rownum = 1;
Oracle reports the same error message – but doesn’t write the critical two lines into the trace file.
MoralI’ve said it many times in the past – it’s hard to test properly; sometimes it doesn’t even occur to you that a test you’ve done is a special case, which means it’s easy to jump from a specific case to a general conclusion that isn’t going to work for everyone. If you want to publish some interesting observation it’s a good idea to publish the model you used to derive that conclusion.
FootnoteThe more recent versions of Oracle have seen enhancements to several error messages to improve the information they report. In particular 23ai will now report the schema and object name of the table or view that “does not exist”. (I haven’t checked yet if it distinguishes between “really doesn’t exist” and “exists, but you’re not allowed to access it”.)
Rownum quiz
Here’s a silly little puzzle that baffled me for a few moments until I spotted my typing error. It starts with a small table I’d created to hold a few rows, and then deletes most of them. Here’s a statement to create and populate the table:
create table t1 (id number , c1 clob)
lob(c1) store as basicfile text_lob (
retention disable storage in row
);
insert into t1
select rownum, rpad(rownum,200,'0')
from all_objects
where rownum <= 1000
;
commit;
Here’s what I meant to type to delete most of the data – followed by the response from SQL*Plus:
SQL> delete from t1 where mod(id,20) != 0;
950 rows deleted.
Here’s what I actually typed, with the response, that gave me a “What?!” moment:
SQL> delete from t1 where mod(rownum,20) != 0;
19 rows deleted.
I don’t think it will take long for you to work out why the result is so different; but I think it’s a nice warning about what can happen if you get a bit casual about using rownum.
AWR Snap ID
What to do when you hit a problem (possibly after an incomplete recovery) that reports an “ORA-00001 unique key violation” on sys.wrm$_snapshot_pk – as reported recently in this thread on the MOSC SQL Performance forum (needs a MOS account.)
Snapshot ids are carefully sequenced, without gaps, so somehow the thing that controls the “current” sequence number has gone backwards and is trying to generate a value that is lower than the current highest value in wrm$_snapshot. The thread I referenced above does point to an article dated 2017 on Alibaba discussing methods of checking for corruption and clearing up messes; but as an extra option you could simply try hacking the control table to set the “last used” snapshot id so something higher than the highest value currently in wrm$_snapshot. The table you need to hack is wrm$_wr_control and here’s an example of its contents from an instance of 19.11 (preceded by a check of the current maximum snap_id in wrm$_snapshot):
SQL> select max(snap_id) max_snap_id, max(end_interval_time) max_snap_time from wrm$_snapshot;
MAX_SNAP_ID MAX_SNAP_TIME
----------- ---------------------------------------------------------------------------
7304 09-APR-24 07.00.14.180 PM
SQL> execute print_table('select * from wrm$_wr_control')
DBID : 3158514872
SNAP_INTERVAL : +00000 01:00:00.0
SNAPINT_NUM : 3600
RETENTION : +00008 00:00:00.0
RETENTION_NUM : 691200
MOST_RECENT_SNAP_ID : 7304
MOST_RECENT_SNAP_TIME : 09-APR-24 07.00.15.169 PM
MRCT_SNAP_TIME_NUM : 1712685600
STATUS_FLAG : 2
MOST_RECENT_PURGE_TIME : 09-APR-24 08.35.57.430 AM
MRCT_PURGE_TIME_NUM : 1712648156
MOST_RECENT_SPLIT_ID : 7295
MOST_RECENT_SPLIT_TIME : 1712648156
SWRF_VERSION : 30
REGISTRATION_STATUS : 0
MRCT_BASELINE_ID : 0
TOPNSQL : 2000000000
MRCT_BLTMPL_ID : 0
SRC_DBID : 3158514872
SRC_DBNAME : CDB$ROOT
T2S_DBLINK :
FLUSH_TYPE : 0
SNAP_ALIGN : 0
MRCT_SNAP_STEP_TM : 1712685613
MRCT_SNAP_STEP_ID : 0
TABLESPACE_NAME : SYSAUX
-----------------
1 row(s) selected
PL/SQL procedure successfully completed.
I have to state that I would not try this on a production system without getting permission – possibly written in blood – from Oracle support: but if I were really desperate to keep existing snapshots and to make it possible for Oracle to generate new snapshots as soon as possible I might try updating the most_recent_snap_id to the value shown in wrm$_snapshot.
Of course you ought to update the most_recent_snap_time as well, and the mrct_snap_time_num (which looks like the number of seconds since 1st Jan 1900 GMT (on my instance)).
Then there’s the interpretation and sanity checking of the other “most recent / mrct” columns to worry about, and the possibility of PDBs vs. CDBs – but those are topics that I’m going to leave to someone else to worry about.