Jonathan Lewis
Everything Changes
I saw a recent tweet (on Bluesky) from SQLDaily highlighting a blog note that Lukas Eder wrote in 2016 with the title: “Avoid using COUNT() in SQL when you could use EXISTS()”. This is a good guideline because it probably maximises the options the optimizer has for producing a good execution plan, but it’s not always necessary in recent versions of Oracle. It depends on what you’re trying to achieve and how you’re using count(), of course, but there are cases where the optimizer is capable of turning a count() subquery into an exists() subquery. Here’s a litle demo running on a PDB in 19.11.0.0:
rem
rem Script: count_to_exists.sql
rem Author: Jonathan Lewis
rem Dated: Dec 2024
rem Purpose:
rem
rem Last tested
rem 23.4(ai)
rem 19.11.0.0
rem 12.2.0.1
rem
create table t1 as
select *
from all_objects
where rownum <= 5e4
/
create table t2 as
select *
from t1
where mod(object_id,50) = 0
/
create index t2_i1 on t2(owner);
set linesize 120
set pagesize 60
set trimspool on
set tab off
column object_name format a32
column owner format a32
select owner, count(*)
from t1
group by
owner
order by
count(*)
/
select owner from t1
minus
select owner from t2
/
set feedback only
set serveroutput off
select
owner, object_id, object_name
from
t1
where
t1.owner like 'D%'
and 0 < (
select count(*)
from t2
where t2.owner = t1.owner
)
/
set feedback on
select * from table(dbms_xplan.display_cursor)
/
I’ve included a couple of queries to show the numbers of rows for each owner in t1 and to highlight the owner values that appear in t1 but don’t appear in t2. (I used the results of those queries to pick the ‘D%’ predicate in the final query.) Here’s the execution plan for the final query in 19.11.0.0:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID cwj9t6wzhczgh, child number 0
-------------------------------------
select owner, object_id, object_name from t1 where t1.owner like
'D%' and 0 < ( select count(*) from t2 where t2.owner = t1.owner
)
Plan hash value: 82129279
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 278 (100)| |
|* 1 | HASH JOIN RIGHT SEMI| | 3244 | 167K| 278 (1)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T2_I1 | 144 | 720 | 2 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL | T1 | 6025 | 282K| 276 (1)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("T2"."OWNER"="T1"."OWNER")
2 - access("T2"."OWNER" LIKE 'D%')
filter("T2"."OWNER" LIKE 'D%')
3 - filter("T1"."OWNER" LIKE 'D%')
There’s no sign of any aggregation in the plan – so the count(*) can’t be happening. Unfortunately the optimizer has transformed the subquery out of existence and executed the query as a semi-join. So let’s make it easier to see what’s happened to the count(*) by stopping the optimizer from unnesting the subquery, and let’s use explain plan so that we can see what the complete Predicate Information looks like when the plan operates with a subquery.
explain plan for
select
owner, object_id, object_name
from
t1
where
t1.owner like 'D%'
and 0 < (
select /*+ no_unnest */
count(*)
from t2
where t2.owner = t1.owner
)
/
select * from table(dbms_xplan.display)
/
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3988898734
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 463 | 22224 | 278 (1)| 00:00:01 |
|* 1 | FILTER | | | | | |
|* 2 | TABLE ACCESS FULL| T1 | 6025 | 282K| 276 (1)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T2_I1 | 2 | 10 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE
"T2"."OWNER"=:B1))
2 - filter("T1"."OWNER" LIKE 'D%')
3 - access("T2"."OWNER"=:B1)
As you can see, the optimizer has worked out that “select count(*) > 0” is equivalent to “exists select” and transformed the subquery accordingly; which means it will stop after finding the first matching row.
ConclusionI started by pointing out that a recent (Bluesky) tweet was echoing a blog note published in 2016. I also agreed that the general principle of using exists() to find the first occurrence rather than using count(*) was a good idea when you didn’t need to know “how many” and just needed to know “are there any”.
However I then demonstrated that the (Oracle) optimizer is so sophisticated that it can recognise when a count() subquery can be transformed into an exists() subquery -presumably because it gets a “clue” from the comparison with zero.
The point of this blog note, and the moral of the story is this: when trying to solve a problem by searching the internet, an article published 8 years ago may not give you the most up to date information or the best possible solution.
Consistent?
Here’s a cut-n-paste from a simple SQL*Plus session running under 19.11.0.0 (23.4 produces the same effects):
SQL> drop table t purge;
Table dropped.
SQL> create table t (id number, val number) rowdependencies;
Table created.
SQL> insert into t values(1,0);
1 row created.
SQL> commit;
Commit complete.
SQL> insert into t values(2,0);
1 row created.
SQL> update t set val = 1;
2 rows updated.
SQL> select * from t;
ID VAL
---------- ----------
1 1
2 1
2 rows selected.
Open a second terminal and start an SQL*Plus session that does the following:
SQL> update t set val = val + 1;
This statement is going to hang because the first (and only) row it ought to see at this moment in time is locked by session 1. If session 1 commits then session 2 should detect that it’s trying to update a row that has changed since the start of the transaction and rollback to restart so that it can do an update that is self-consistent as at a specific point in time.
Question – how many rows will session 2 update?
Here’s the cut-n-paste from session 2 that appeared as soon as session 1 commited:
1 row updated.
SQL> select * from t;
ID VAL
---------- ----------
1 2
2 1
2 rows selected.
This doesn’t look right. At one point in time the committed data in the table was the single row (1,0), and at another point in time the committed data in the table was the pair of rows (1,1) and (2,1). How can a single, self-consistent, update statement operate as at a single point in time and leave the table holding the rows you see above. Surely the statement should either fail leaving val = 1 in both rows, or it should succeed consistently leaving val = 2 in both rows.
As a further detail, note how I’ve enabled rowdependencies on the table. (This is something I usually do whenever I want to take advantage of the flashback version query feature though it’s not a necessity in this case). Here’s what I see when I commit in session 2 then query the entire versioning history of the table:
column versions_startscn format 99,999,999,999,999
column versions_endscn format 99,999,999,999,999
column versions_starttime format a24
column versions_endtime format a24
select
versions_starttime,
versions_startscn,
versions_endtime,
versions_endscn,
versions_operation,
versions_xid,
id, val
from t versions between scn minvalue and maxvalue
/
VERSIONS_STARTTIME VERSIONS_STARTSCN VERSIONS_ENDTIME VERSIONS_ENDSCN V VERSIONS_XID ID VAL
------------------------ ------------------- ------------------------ ------------------- - ---------------- ---------- ----------
09-DEC-24 02.38.22 PM 12,670,482,437,540 U 04001B004F600000 1 2
09-DEC-24 02.35.52 PM 12,670,482,437,468 I 06000000DA740000 2 1
09-DEC-24 02.35.52 PM 12,670,482,437,468 09-DEC-24 02.38.22 PM 12,670,482,437,540 U 06000000DA740000 1 1
09-DEC-24 02.35.52 PM 12,670,482,437,468 1 0
4 rows selected.
Surprisingly there is no “insert transaction” visible for (1,0), but, reading from the bottom upwards we can see that (1,0) ceased to exist at SCN 12,670,482,437,468 and that transaction 06000000DA740000 updated that row to (1,1) and inserted row (2,1). What we’re seeing as “versions” are the “versions as at commit”, of course, rather than the results of all the individual modifications (which is what you can get from Log Miner).
If we look at the (1,1) row again we can see that that verson ceased to exist at SCN 12,670,482,437,540 and the top line of the output tells us that that’s thanks to transaction 06000000DA740000 changing it to (1,2). That’s our update from the second session that we believe should have updated every existing row in the table by adding 1 to the val column.
Finally, when we look at the row for (2,1) we see that it came into existence at the same time (SCN 12,670,482,437,468) as row (1,1) but was not changed by the transaction that “happened” a little later and changed (1,1) to (1,2).
This behaviour first came to my attention in a comment to a note I wrote in 2011 about an upgrade from 11.1.0.6 to 11.2.0.2. The sample I’ve shown above is taken from that comment, modified slightly to separate the table creation from the first insert (just in case create as select introduced some peculiar special effects) and there are other examples and observations in the comments following the one I’ve linked to. The anomaly described in the actual blog note was addressed in 11.2.0.3 and 12.1.0.2 but, as you can see, there are other anomalies still waiting to be addressed.
Fetch size
Part of my effort to clear my long backlog of drafts
This is a trivial script I wrote many years ago – the date on the draft (one of the oldest on my blog) was August 2009 – but the script itself must have started life in 8i if not earlier
All the script does is execute a simple-minded test to find SQL that may be wasting resources and causing contention by using “small” array fetches to return “large” amounts of data. It does this through a scan of v$sql or (for later versions of Oracle) v$sqlstats or v$sqlstats_plan_hash to compare execution counts, number of fetch calls, and total rows returned.
The original code – probably written for 8i – used v$sql and reported the first_load_time for each child cursor but an update to the script took advantage of the introduction in 10g of the view v$sqlstats and started reporting the last_active_time – which means you could choose to ignore (temporarily, perhaps) any statements that had not been active in the very recent past.
I’ve tended to avoid the view v$sqlarea because, in its earliest form, it was simply an aggregate view of the x$ structure behind v$sql. When v$sqlstats appeared its contents were similar to those of v$sqlarea but its published implementation was a simple query against a different x$ so that’s the one I started using (having failed to notice that the implementation for v$sqlarea had changed to a simple query against yet another x$ at about the same time. Both x$ structures are created on the fly into the PGA so they’re probably doing similar amounts of aggregation under the covers although with many differences in the details they now report – including a number of “deltas since last AWR snapshot” in the newer view.
rem
rem Script: small_fetch.sql
rem Author: Jonathan Lewis
rem Dated: (pre) Aug 2009
rem
set linesize 156
set pagesize 40
set trimspool on
set tab off
break on row skip 1
column first_load_time format a20
column last_active_time format a20
column sql_text wrap word
spool small_fetch
select
--
-- v$sql.parsing_user_id,
-- v$sql.parsing_schema_id,
-- v$sql.first_load_time,
--
to_char(last_active_time,'yyyy-mm-dd/hh24:mi:ss') last_active_time,
sql_id,
plan_hash_value,
executions,
end_of_fetch_count,
fetches,
rows_processed,
round(rows_processed/nullif(executions,0),0) rows_per_exec,
round(rows_processed/nullif(fetches,0),0) fetchsize,
sql_text
from
v$sqlstats
-- v$sql
-- v$sqlstats_plan_hash
where
plan_hash_value != 0 -- pl/sql block
-- and (v$sql.parsing_user_id != 0 or v$sql.parsing_schema_id != 0)
and fetches > 25000
and round(rows_processed/nullif(executions,0),0) > 5000
and round(rows_processed/nullif(fetches,0),0) < 50
order by
fetches desc
;
spool off
The numbers I’ve used in the SQL above are fairly arbitrary, your choices for “large” and “small” willl almost certainly be different.
The original script is probably an ad hoc statement that I wrote in a few minutes on a client site to poke around the SQL being executed by a system that was showing network problems and surprisingly large numbers of waits for “SQL*Net message to client”, so I’ve published it more as an example of the type of thing you could do than an example of “best practice”. It may be helpful, but your specific circumstances and requirements may make you want to change the choice of view you use, the column you report, and the predicates you use to filter the results.
You’ll notice from the code that v$sql includes the parsing_user_id and parsing_schema_id, but v$sqlstats doesn’t. A current convenience of using v$sql is that you can eliminate any statements parsed by SYS simply be adding the predicate “parsing_user_id != 0”. The reason why this is convenient is that a lot of the sys-recursive (largely dictionary cache) queries do single row fetches and you don’t want to clutter your report with things you can’t change.
A reason why v$sqlstats might be your preferred choice is that it has a large number of “delta_statistic” column – which give you counts and times since the last AWR snapshot (though how that works if you’re not licenced to run the AWR, I haven’t checked).
CautionThe header on the script suggests that the last system I used, and edited, it on was 11.1.0.7 – so before container databases came into existence. There are all sorts of columns in the views I’ve mentioned that might be very helpful, but there’s also the potential for traps relating to which schema can see what data, and from which PDB. Here, for example, is a cut-n-paste showing something that looks like a bug when running as SYS in a PDB on an instance running 19.11.0.0: note particularly how I’ve changed the view in the three queries from v$sql to v$sqlstats then back to v$sql.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sql group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1282
1 row selected.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sqlstats group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1395
1 row selected.
SQL> select con_id, count(*) from (select sql_id, con_id, count(*) ct from V$sql group by sql_id, con_id ) group by con_id;
CON_ID COUNT(*)
---------- ----------
3 1282
1 row selected.
What can v$sqlstats see that v$sql can’t see?
Optimizer Environment 2
Many years ago I wrote a short note to introduce the view v$sql_optimizer_env (and the session and system equivalents) which report the current values of optimizer related parameters (some of which are not found in the startup file) for a cursor child, or a session, or the system. I also mentioned the underlying x$kqlfsqce (for v$sql_optimizer_env) that held far more parameters than were reported in the v$ view.
Recently I commented on a question about different plans (i.e. child cursors) for the same query for different sessions and noted that changing the optimizer_features_enable might not change any of the visible (v$) parameters – other than optimizer_features_enable itself – so it might be necessary to check the x$ to find out what had changed. Here’s a possible query to do that, followed by a few lines of output:
rem
rem Script: optimizer_env_b.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem Purpose:
rem
-- select count(*) from all_users;
define m_sql_id = 'b2zqhgr5tzbpk'
spool optimizer_env_b.lst
select
kqlfsqce_pname,
kqlfsqce_chno,
kqlfsqce_pvalue,
kqlfsqce_flags
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
order by
kqlfsqce_pname,
kqlfsqce_chno
/
spool off
KQLFSQCE_PNAME KQLFSQCE_CHNO KQLFSQCE_PVALUE KQLFSQCE_FLAGS
-------------------------------------------------- ------------- ------------------------- --------------
BlockChain_ledger_infrastructure 0 0 10
BlockChain_ledger_infrastructure 1 0 10
CLI_internal_cursor 0 0 10
CLI_internal_cursor 1 0 10
PMO_altidx_rebuild 0 0 2
PMO_altidx_rebuild 1 0 2
_adaptive_window_consolidator_enabled 0 true 7
_adaptive_window_consolidator_enabled 1 true 7
_add_stale_mv_to_dependency_list 0 true 6
_add_stale_mv_to_dependency_list 1 true 6
_aggregation_optimization_settings 0 0 7
_aggregation_optimization_settings 1 0 7
_allow_level_without_connect_by 0 false 6
_allow_level_without_connect_by 1 false 6
...
uniaud_internal_cursor 0 0 10
uniaud_internal_cursor 1 0 10
workarea_size_policy 0 auto 2
workarea_size_policy 1 auto 2
1226 rows selected.
The script allows you to set an sql_id to report – the value I’ve used is for the statement select count(*) from all_users; The query as it stands has to be run by sys, but you could create a view (say x_$kqlfsqce) in the sys schema, grant select on that to public to allow everyone to query the view then create a public synonym for it that allowed everyone else to use the same query as sys:
create or replace view x_$kqlfsqce as select * from x$kqlfsqce;
grant select on x_$kqlfsqce to public;
create public synonym x$kqlfsqce for x_$kqlfsqce;
The drawback to this query is clear in the final lines of the output – in 19.11 it returns 613 rows per child cursor leaving you have to analyze (by eye?) the entire list to find the parameter names where the parameter values differ. A better query would report only those parameters where the values differed (ideally listing the values side by side, but that’s another story). For example:
rem
rem Script: optimizer_env_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem
-- select count(*) from all_users;
define m_sql_id = 'b2zqhgr5tzbpk'
spool optimizer_env_2.lst
with child_count as (
select
/*+ materialize */
count(distinct kqlfsqce_chno) child_count
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
)
select
kqlfsqce_chno,
kqlfsqce_pname,
kqlfsqce_pvalue,
kqlfsqce_flags
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
and (kqlfsqce_pnum, kqlfsqce_pvalue) in (
select
kqlfsqce_pnum, kqlfsqce_pvalue
from x$kqlfsqce
where kqlfsqce_sqlid = '&m_sql_id'
group by
kqlfsqce_pnum, kqlfsqce_pvalue
having count(*) != (select child_count from child_count)
)
order by
kqlfsqce_pname,
kqlfsqce_chno
/
spool off
KQLFSQCE_CHNO KQLFSQCE_PNAME KQLFSQCE_PVALUE KQLFSQCE_FLAGS
------------- -------------------------------------------------- ------------------------- --------------
0 _bloom_filter_setops_enabled true 7
1 _bloom_filter_setops_enabled false 7
0 _bloom_pruning_setops_enabled true 7
1 _bloom_pruning_setops_enabled false 7
0 _cell_offload_grand_total true 7
1 _cell_offload_grand_total false 7
0 _cell_offload_vector_groupby_fact_key true 7
1 _cell_offload_vector_groupby_fact_key false 7
...
0 _sqlexec_reorder_wif_enabled true 7
1 _sqlexec_reorder_wif_enabled false 7
0 optimizer_features_enable 19.1.0 2
1 optimizer_features_enable 12.2.0.1 0
58 rows selected.
For this example I ran my query against all_users twice, but changed the optimizer_features_enable from 19.1.0 (the default for 19.11) to 12.2.0.1 for the second execution, so I got two child cursors. As you can see from the output the number of optimizer parameters that changed was 29 (half of 58) and the only parameter that would have been visible in v$sql_optimizer_env was optimizer_features_enable itself.
The SQL identifies the duplicates by initially counting the number of child cursors for the sql_id – because you might have more than just two, and knowing the number of child cursors is the only way of identifying which parameters have different values for different child cursors.
The fact that (in general) you won’t necessarily know how many child cursors you have to compare and report is what makes it a little harder to write a query that reports variations in values across the page – so it’s left as an exercise for the reader. Personally I’d be inclined to add a clause “kqlfsqce_chno in (N, M)”, picking child cursors with different execution plans, to limit the number to just two interesting child cursors and stick with reporting down the page.
AddendumIn case you’re wondering about the optimizer parameters that aren’t startup parameters, here’s a quick query to run as SYS. (The output is from 19.11 again):
select
distinct kqlfsqce_pname
from
x$kqlfsqce
where
kqlfsqce_sqlid = '&m_sql_id'
minus
select
ksppinm
from
x$ksppi
/
KQLFSQCE_PNAME
--------------------------------------------------------------------------------
BlockChain_ledger_infrastructure
CLI_internal_cursor
PMO_altidx_rebuild
_enable_pmo_ctas
_fix_control_key
_is_lock_table_for_ddl_wait_lock
_is_lock_table_for_split_merge
_long_varchar_allow_IOT
_multi_commit_global_index_maint
_nls_forced_binary
_suppress_scn_chk_for_cqn
_update_bji_ipdml_enabled
_update_global_index_from_part_exchange
_use_hidden_partitions
advanced_queuing_internal_cursor
flashback_data_archive_internal_cursor
flashback_table_rpi
ilm_access_tracking
ilm_dml_timestamp
ilm_filter
iot_internal_cursor
is_recur_flags
kkb_drop_empty_segments
only_move_row
optimizer_features_hinted
optimizer_mode_hinted
parallel_autodop
parallel_dblink
parallel_ddl_forced_degree
parallel_ddl_forced_instances
parallel_ddl_mode
parallel_ddldml
parallel_degree
parallel_dml_forced_dop
parallel_dop_doubled
parallel_execution_enabled
parallel_hinted
parallel_max_degree
parallel_query_default_dop
parallel_query_forced_dop
parallel_query_mode
partition_pruning_internal_cursor
rim_node_exist
sql_from_coordinator
sqlstat_enabled
total_cpu_count
total_processor_group_count
transaction_isolation_level
uniaud_internal_cursor
49 rows selected.
Following up my announcement of this note on Bluesky someone asked if this information appeared in the 10053 (CBO) trace file. The answer is yes. Under a heading “PARAMETERS USED BY THE OPTIMIZER” there are three sub-headings (also in capitals, but made more readable below):
- Parameters with altered values
- Parameters with default values
- Parameters in opt_param hint
The first two of these sub-headings share the full set of optimizer parameters, with no duplicates across the two lists. The last sub-heading introduces duplication, listing the parameter where the opt_param() hint has over-ridden the default (system) or altered (session) settings.
WWSS
What would Shakespeare say?
- Q: So, Mr. Shakespeare, which RDBMS do you use to look up your “famous quotes”?
- A: My lords, I refer me to the Oracle.
- Q: And how would you feel about migrating to SQL Server or DB2?
- A: A plague a’ both your houses.
- Q: So what is your basic requirement for finding a good quote to use?
- A: If it were done when ’tis done, then ’twere well it were done quickly.
- Q: Do you think that using multiple block sizes in your database would improve performance?
- A: It is a tale told by an idiot full of sound and fury signifying nothing.
- Q: The CBO has a big part to play in performance – what makes it so difficult to understand?
- A: Her infinite variety.
- Q: And what would you say to a DBA who claims to understand the CBO?
- A: There are more things in heaven and earth, Horatio, than are dreamt of in your philsophy.
- Q: Do you think it’s a good idea to use hints when the CBO misbehaves?
- A: Though this be madness, yet there is method in ‘t.
- Q: And how much faith do you have in the execution plans that the CBO generates?
- A: Oft expectation fails, and most oft there Where most it promises.
- Q: NULLs can, of course, make things hard for the CBO – what’s your view on them?
- A: Nothing can come of nothing.
- Q: And how do you interpret NULLs?
- A: To be or not to be, that is the question.
- Q: Do you document the SQL statements you write?
- A: It is a custom more honor’d in the breach than the observance.
- Q: What is your experience with Oracle support?
- A: The course of true love never did run smooth.
- Q: And how you do think support will be affected by the introduction of AI?
- A: More matter with less art.
- Q: Are we going to see 23ai installed at Anne Hathaway’s cottage any time soon?
- A: Expectation is the root of all heartache.
- Q: And finally, what do you make of Sym42 and the MASH program?
- A: O brave new world that has such people in it.
4096 Columns
If you thought 1,000 columns was over the top for a table in Oracle, then you’ll love 23ai which allows you to stretch that to 4,096 columns if you’ve set the parameter max_columns = extended in the spfile (or the pdb equivalent).
Naturally I was curious to see if there had been any significant structural changes to the way Oracle stored a row when the number of columns became very large, so I took the code I had written to examine the case of 1,000 columns, and hacked it around a little bit.
rem
rem Script: wide_table_4096.sql
rem Author: Jonathan Lewis
rem Dated: 7th April 2023
rem
rem Last tested
rem 23.4.25.4 (23ai)
rem
rem See also
rem https://jonathanlewis.wordpress.com/2015/02/19/255-columns/
rem https://jonathanlewis.wordpress.com/2015/05/18/migrated-rows/
rem
set pagesize 0
set feedback off
drop table if exists t1 purge;
spool temp.sql
prompt create table t1(
select
'col' || to_char(rownum,'fm0000') || ' varchar2(10),'
from
all_objects
where rownum <= 4095
;
prompt col4096 varchar2(10)
prompt )
prompt /
spool off
@temp
insert /*+ append */ into t1 (col0001, col0002, col0236,col0491, col0746, col4096)
values ('0001','0002', '0236','0491','0746','4096')
;
commit;
set pagesize 40 linesize 132 feedback on
-- column ref_file_no new_value m_fileno
column abs_file_no new_value m_fileno
column block_no new_value m_blockno
column value new_value m_trace_file
spool wide_table_4096
select
dbms_rowid.rowid_to_absolute_fno(rowid, user, 'T1') abs_file_no,
dbms_rowid.rowid_relative_fno(rowid,'BIGFILE') rel_file_no,
dbms_rowid.rowid_relative_fno(rowid,'BIGFILE') big_rel_file_no,
dbms_rowid.rowid_relative_fno(rowid,'SMALLFILE') small_rel_file_no,
dbms_rowid.rowid_relative_fno(rowid) def_rel_file_no,
dbms_rowid.rowid_block_number(rowid,'BIGFILE') block_no,
dbms_rowid.rowid_block_number(rowid,'SMALLFILE') small_block_no,
dbms_rowid.rowid_block_number(rowid) def_block_no
from
t1
/
alter system flush buffer_cache;
alter system dump datafile &m_fileno block &m_blockno;
set heading off
-- select payload from my_trace_file;
select payload
from v$diag_trace_file_contents
where
adr_home = (select value from V$diag_info where name = 'ADR Home')
and trace_filename = (
select substr(
value,
2 + ( select length(value)
from v$diag_info
where name = 'Diag Trace'
)
)
from v$diag_info
where name = 'Default Trace File'
)
/
spool off
The code is simple – I’ve used SQL to generate the SQL to create a table of 4,096 columns, then I’ve inserted a single row into the table, populating the first column, the last column, and a few in between. Then I’ve used a few calls to procedures in the dbms_rowid to convert the rowid of that row into a file and block number. When I installed 23ai and created my “user” tablespace it defaulted to a bigfile tablespace, but I’ve left a few variations of the calls in place to suit different configurations.
After finding the file and block number where the row starts I’ve dumped the entire block to my trace file and then read the tracefile back and spooled it to a flat file. My account is a “normal” account, but I’ve given it privileges to read various dynamic performance views – you might prefer to create a view and public synonym in the sys schema and grant select to public on the view. (See: https://jonathanlewis.wordpress.com/2019/10/03/trace-files-2/)
I’m not going to show you the symbolic dump of the block, instead I’ll show you a few egrep commands and their output:
******
****** Number of rows in the block:
******
jonathan@localhost working]$ egrep nro wide_table_4096.lst
nrow=17
0xe:pti[0] nrow=17 offs=0
******
****** Number of rows in each column, and the row number of each row:
******
tab 0, row 0, @0x1e4a
tl: 262 fb: -----L-- lb: 0x1 cc: 255
--
tab 0, row 1, @0x1d42
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 2, @0x1c3a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 3, @0x1b32
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 4, @0x1a2a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 5, @0x1922
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 6, @0x181a
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 7, @0x1712
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 9, @0x1502
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 10, @0x13fa
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 11, @0x12f2
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 12, @0x11ea
tl: 264 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 13, @0x10de
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 14, @0xfd2
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 15, @0xec6
tl: 268 fb: -------- lb: 0x1 cc: 255
--
tab 0, row 16, @0xea5
tl: 33 fb: --H-F--- lb: 0x1 cc: 16
******
****** First few columns of row 16 (17th row in the block)
******
[jonathan@localhost working]$ egrep -A+16 "row 16" wide_table_4096.lst
tab 0, row 16, @0xea5
tl: 33 fb: --H-F--- lb: 0x1 cc: 16
nrid: 0x00002275.f
col 0: [ 4] 30 30 30 31
col 1: [ 4] 30 30 30 32
col 2: *NULL*
col 3: *NULL*
The row has been stored as 17 separate row pieces with a maximum of 255 colums per row piece. (Note: 16 * 255 + 16 = 4096). The final egrep output shows you that the starting columns of the row are stored as the 17th row. The values of col 0 and col 1 are one clue, but the –H-F— flags (header, first) are the technical indicator. Each row piece points (backwards backwards) to the next rowpiece in the chain ending with row 0 where the —–L– flag tells us it’s the last rowpiece of the chain.
So, no great difference in the treatment of tables with very large numbers of columns – it’s just lots of row pieces of up to 255 columns, where the row is (initially) split up into 255 column chunks working from the end of the row, potentially leaving you with a very small number of columns in the first row piece.
There was one significant difference in the block dump that I haven’t shown yet. The ITL (interested transaction list) ended up with 5 entries as a result of the single insert.
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0009.01c.0000033c 0x00000103.0102.2a --U- 17 fsc 0x0000.0128d402
0x02 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
0x03 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
0x04 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
0x05 0x0000.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000000000000000
I did’t know if this is new, or something I had missed when the limit on the number of columns changed from 255 to 1,000, but the number of ITL entries varied according to the number of row pieces created during my single row insert. Repeating the test but changing the “last column populated” I got the following results:
- One row piece (col0255) – itc = 2, Lck = 1
- Two row pieces (col0256) – itc = 3, Lck = 2
- Three row pieces (col0512) – itc = 4, Lck = 3
- Four row pieces (col0768) – itc = 5, Lck = 4
- Five row pieces (col1024) – itc = 5, Lck = 5 — i.e. same itc as 4096 columns/17 row pieces
Of course the “coincidence” that the ITL stopped growing at the fouth row made me decide to check back to an older version of Oracle (19.11.0.0) to see what happened where the limit was 1,000 columns – and it turned out that 19c also created extra ITL entries in the same way, up to the same limit. (I can’t help wondering whether this limit was “accidentally” carried forward to 23ai or whether there was a deliberate decision that 4 “spare” ITL slots was the best compromise between ITL waits and unnecessary space use.)
FootnoteThere are probably several more questions to ask about features that might (or might not) adapt to the raising of the column limit. For example 11g had a limit of 20 (or ceil(column_count / 10)) for the number of extended stats definitions you could create for a single table when the maximum number of columns allowed was 1,000. Might this have changed for 23ai? (Initial answer: No. Longer answer: to be investigated – could you really get 370 sets of extended stats if you started with 3,700 columns in a table or would something breaks at (say) 256 sets?)
Audit Cleaning #JoelKallmanday
This note is an observation of an odd little detail that’s nice to know about the way that Unified Audit (see part 8 of Neil Chandler’s series on Fundamental Security) handles deleting old data. I don’t have a lot of background experience with Unified Audit, but I was surprised by some comments on a recent thread on the MOSC database forum (needs an account) about problems of chained rows and the difficulty of reclaiming space from the “audit tablespace”, so I spent a couple of hours looking at the tables, views and procedures to see if there were any pre-emptive strategies that would avoid the space management issue. All the comments below are based on a single pdb running under 19.11.
The unifiied audit is kept (in recent versions of Oracle) in an interval-partitioned table called aud$unified, owned by user audsys. The partitioning is on a column called event_timestamp – declared as timestamp(6) not null, the default interval is one month but you can change this by calling the procedure dbms_audit_mgmt.alter_partition_interval – and for some people it would be a very good idea to do so.
The forum thread included the complaint that the segment space advisor has found a lot of chained rows, and that a rebuild could remove 49 percent of them. Since the sum of the declared lengths of the tables columns is around 70KB and there are 3 CLOB columns declared as “enable storage in row”, I don’t think anyone should be surprised to see a lot of chained rows. (And I wouldn’t trust the advisor to do a very good job estimating what might happen with the CLOBs.)
Here’s a simple query to report the partitions, with high_value for the audit table:
select
partition_name, blocks, num_rows, high_value
from
dba_tab_partitions
where
table_name = 'AUD$UNIFIED'
order by
partition_position desc
/
There are all sorts of things you’re not allowed to do with it – even if you’re connected as sys you should only be using the management package to handle it.
delete from audsys.aud$unified where event_timestamp < timestamp'2022-06-14 00:00:00'
*
ERROR at line 1:
ORA-46385: DML and DDL operations are not allowed on table "AUDSYS"."AUD$UNIFIED".
The thinking behind the audit table seems to be that you should be checking it on a regular basis then archiving (or just deleting) the data you’ve checked so that the audit table doesn’t become enormous. To this end Oracle allows you to set a “last archived” timestamp, which you can use in a call to the procedure dbms_audit_mgmt.clean_audit_trail. You can check the current value of this timestamp by querying view dba_audit_mgmt_last_arch_ts.
SQL> select last_archive_ts
2 from dba_audit_mgmt_last_arch_ts
3 /
LAST_ARCHIVE_TS
---------------------------------------------------------------------------
01-MAY-22 12.00.01.000000 AM +00:00
1 row selected.
There is also a function dbms_audit_mgmt.get_last_archive_timestamp() to return the timestamp, but this didn’t seem to work on my 19.11 instance. (The procedure to set_last_archive_timestamp() does work.)
There are procedures in the package that let you automate this cleaning, but the point(s) I want to make are things I got to by doing some simple tests calling each step of the process in turn. The oldest few rows in my audit table currently look like this:
PARTITION_NAME BLOCKS NUM_ROWS HIGH_VALUE
---------------------- ---------- ---------- ----------------------------------------
...
SYS_P28453 50 701 TIMESTAMP' 2022-11-01 00:00:00'
SYS_P27945 43 683 TIMESTAMP' 2022-10-01 00:00:00'
SYS_P20893 88 1328 TIMESTAMP' 2022-09-01 00:00:00'
SYS_P20176 110 1766 TIMESTAMP' 2022-08-01 00:00:00'
SYS_P20025 244 3318 TIMESTAMP' 2022-07-01 00:00:00'
SYS_P17365 103 1684 TIMESTAMP' 2022-06-01 00:00:00'
Assume I want to delete everything from the last 4 partitions, I could execute the following anonymous block:
declare
ts timestamp;
begin
dbms_audit_mgmt.set_last_archive_timestamp(
dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED,
timestamp '2022-09-01 00:00:00'
);
select last_archive_ts
into ts
from dba_audit_mgmt_last_arch_ts
;
dbms_output.put_line('Set: ' || ts);
dbms_audit_mgmt.clean_audit_trail(
audit_trail_type => dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED,
use_last_arch_timestamp => true -- default
);
end;
/
Set: 01-SEP-22 12.00.00.000000 AM
PL/SQL procedure successfully completed.
When I check dba_tab_partitions again, the end of the list is now as follows:
PARTITION_NAME BLOCKS NUM_ROWS HIGH_VALUE
---------------------- ---------- ---------- ----------------------------------------
...
SYS_P28453 50 701 TIMESTAMP' 2022-11-01 00:00:00'
SYS_P27945 43 683 TIMESTAMP' 2022-10-01 00:00:00'
SYS_P20893 88 1328 TIMESTAMP' 2022-09-01 00:00:00'
Oracle has dropped three of the 4 partitions – although it has kept the the partition with the high_value of ‘2022-09-01’, which is now empty because that’s what I was expecting when I gave Oracle that timestamp. Looking at the trace file – I can see (using a suitable egrep command) three calls (driven by calls to dbms_pdb_exec_sql()) to drop an explicitly named partition followed by this delete statement:
delete from audsys.aud$unified where event_timestamp < :1 and (dbid = :2 or dbid = 0)
So Oracle has been very clever, but not quite clever enough, when removing data from the audit table; it drops any partition whose high_value is (strictly) less than the timestamp you supply then, if necessary, uses a delete on just the one partition that has become the oldest partition.
So, a couple of suggestions for cleaning the audit trail (subject to you checking that I haven’t missed any important details, and any variations due to version of Oracle).
- If you plan to set up a regular call to clean the audit table it might be a very good idea to match the partitioning interval with the schedule so that you never have to remove part of the data from a partition (i.e. you want to avoid running a big delete)
- When you see what goes on with the last_archive_timestamp in my tests, it looks like a good idea to ensure that the value you pass to the cleaning procedure is a “high_value plus one (micro)second” (did you spot the “spare” second in my query result on dba_audit_mgmt_last_arch_ts) so that you drop every relevant partition rather than deleting every single row from the new “oldest” partition.
The delete statement is a little puzzling – how can you drop one partition but think you need to check the dbid before deleting the data from another partition? Maybe the apparent anomaly is a sign that Oracle is reusing a delete statement that would have to replace the drop commands if you had a setup where multiple PDBs were writing to the same (CDB) audit table.
You’ll notice that I’ve passed a (symbolic) parameter value of dbms_audit_mgmt.AUDIT_TRAIL_UNIFIED to the calls to the packaged procedures. If you haven’t moved all the different audit trails (the “legacy” audit, the FGA audit, and any others) into the unified audit there are other parameter values for cleaning the other trails.
Questions appear quite frequently about how to do arithmetic or comparison with the high_value
column from dba_tab_partitions
– the problem has disappeared in very recent versions of Oracle, but dbms_audit_mgm
t has an interesting utility procedure: get_part_highval_as_char()
I spoke too soon, the only table that this procedure will work for is audsys.aud$unified. Unless you’re on 23c (with CLOB and JSON versions of the column) you’ll have to fall back to something like the method I showed in a post I wrote earlier this year.
SQL> select dbms_audit_mgmt.get_part_highval_as_char('SYS_P31737') from dual;
DBMS_AUDIT_MGMT.GET_PART_HIGHVAL_AS_CHAR('SYS_P31737')
--------------------------------------------------------------------------------
2023-04-01 00:00:00
1 row selected.
This is a note about a topic that I’m not really familiar with, written rather quickly, and not yet reviewed for grammar and typing errors.
IN / NOT IN
A question appeared on the Oracle developer forum recently (Oct 2024) that could be answered by a blog note that I started drafting about 18 months ago after answering a very similar question that had appeared in the same forum. Take a look at the following query, with results, about a status column in a fairly large dataset:
select status, count(*)
from my_table
group by status
order by 2 desc;
STATUS COUNT(*)
-------------------------
STATUS_7 10580209
STATUS_6 23902
STATUS_5 1504
STATUS_4 48
STATUS_3 17
STATUS_2 8
STATUS_1 5
As you can see, almost all the data is at ‘STATUS_7’ or ‘STATUS_6’. The user wanted some code to report the rows that were not in these two statuses, which leads to one of two possible constructs:
select message_id
from my_table
where status in ('STATUS_1', 'STATUS_2', 'STATUS_3', 'STATUS_4', 'STATUS_5')
;
select message_id
from my_table
where status not in ('STATUS_6', 'STATUS_7')
;
Thanks to an index on status the first query returned its results in 0.05 seconds while the second query had to do a full table scan which took 6 seconds to complete. The problem the user faced was that there might be other status values appearing in the future, so the slow query was the safe one because it wouldn’t need to be edited in the future. So how do you get that “not in” query to run as quickly as the “in” query?
The first thing to check, of course, is how much freedom you have to modify the database and the code; it’s also important to check that any suggestion you make will produce the same result set. (It’s also a reasonably safe bet (on any forum) that the user has simplified the requirement in a way that makes any suggested solution in need of refinement to fit the real problem.)
In this case the answer is (seems to be) simple – change the question by taking advantage of the right Oracle feature and you can produce a solution that may even eliminate some of the current application overhead. “Add” a column to the table that holds a value only for rows that are in a status of interest then index that column – and the best way to “add” that column in all supported (and many unsupported) versions of Oracle is probably to create it as a virtual column:
alter table my_table
add rare_status /* invisible */
generated always as
(case when status in ('STATUS_6','STATUS_7') then null else status end)
virtual
/
create index mt_i1 on my_table(rare_status);
execute dbms_stats.gather_table_stats(user,'my_table', method_opt=>'for columns rare_status size 1')
select rare_status, message_id from my_table where rare_status is not null;
You’ll notice, of course that I’ve included, as a comment, the invisible option when adding the virtual column. This is generally a good idea, though it may be a little confusing for people to see that they can query a table using a column that doesn’t (apparently) exist. The plus point for this approach is that (naughty) code that executes an “insert as select” statement without specifying the list of target columns will still succeed rather than raising error: ORA-00947: not enough values “for no apparent reason!”.
The fact that I’ve used the predicate “is not null” might prompt you to ask if there are any traps in this approach if status is allowed to be null, and that’s something to check if you extend the strategy to handle a more complicated requirement. In this case the “not in (6,7)” and “in (1,2,3,4,5)” return the same results even if there are rows where the status is null.
Here’s the execution plan (ignore the numbers since there’s no data in the table) for the “is not null” query:
Plan hash value: 3640753122
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 38 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE | 2 | 38 | 2 (0)| 00:00:01 |
|* 2 | INDEX FULL SCAN | MT_I1 | 1 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("RARE_STATUS" IS NOT NULL)
If we return to the user’s original query – it selects message_id and their plan shows the need to visit the table to collect that value. Since the index we’ve now created on the “interesting” rows is so small we might as well extend it to include message_id – remembering to use the same type of case expression so that we include values only for the necessary small number of rows in the index.
alter table my_table
add rare_message_id invisible
generated always as
(case when status in ('STATUS_6','STATUS_7') then null else message_id end)
virtual
/
drop index mt_i1;
create index mt_i1 on my_table(rare_status, rare_message_id);
execute dbms_stats.gather_table_stats(user,'my_table', method_opt=>'for columns rare_message_id size 1')
select rare_status, rare_message_id from my_table where rare_status is not null;
Execution Plan
----------------------------------------------------------
Plan hash value: 3308683209
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 18 | 1 (0)| 00:00:01 |
|* 1 | INDEX FULL SCAN | MT_I1 | 2 | 18 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("RARE_STATUS" IS NOT NULL)
I’ve mentioned it before, and I keep wishing for it to appear in Oracle, but I’d really like to see the “filtered index” mechanism that can be used in SQL Server to handle this type of requirement without writing editing the code; something like:
create index mt_ss1 on my_table (status)
carrying (message_id)
where status not in ('STATUS_6','STATUS_7')
/
I’ve invented a syntax to say that this index should also include the message_id – again a feature (though I don’t remember the syntax) in SQL Server. Of course Oracle can carry “spare” columns, but only by adding them as if they belong in the index – which means their presence affects the selectivity and clustering_factor of the index and can, therefore, have serious side-effects on execution plans.
NOT IN bug
Some years ago I wrote a note describing how with a query using identical literal values in the predicate, the optimizer was able to produce exactly the same plan body with exactly the same Predicate Information section while producing different cardinality estimates depending on how you expressed the requirement for “column not in list of values”.
I won’t go into details of the anomaly since it disappeared in 11.2.0.4, but what I hadn’t paid any attention to was that the arithmetic (for both variations) was wrong, and it was only a recent question on the Oracle database forum that made me revisit the article, allowing me to spot and test 23.3 to see if the arithmetic was still wrong in 23.3. It is.
To demonstrate the problem, here’s a modified version of the code from the previous article. I usually try to write code that will create reproducible data across all versions of Oracle, but I didn’t do that in the previous post, and there was a significant difference between the data generated by 11.2.0.3 and the data generated by 23.3 so, among other details, I’ve changed the code to produce a fixed number of nulls.
rem
rem Script: not_in_anomaly_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2024
rem
rem Last tested
rem 23.3.0.0
rem 19.11.0.0
rem
create table t1
as
select
ao.*
from
all_objects ao
where
rownum <= 10000
/
update t1 set
object_type = null
where
rownum <= 1000
-- object_type = 'TABLE'
/
commit;
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T1',
method_opt => 'for all columns size 1'
);
end;
/
break on report
compute sum of ct on report
select
object_type, count(*) ct
from
t1
group by
object_type
order by
count(*)
/
select
sample_size, num_nulls, num_distinct, histogram
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'OBJECT_TYPE'
;
This code (in my 23.3 database) produces a table of 10,000 rows, of which 1,000 have object_type set to NULL, with 13 distinct (non-null) values across the rest of the rows. There was a huge skew in the distribution of the data, but I’ve avoided that complication by collecting stats without histograms. Here are the results from the two queries against the data dictionary to confirm those details.
OBJECT_TYPE CT
----------------------- ----------
CONTEXT 2
PROCEDURE 2
DIRECTORY 2
OPERATOR 3
SEQUENCE 3
FUNCTION 14
PACKAGE 22
DOMAIN 22
INDEX 48
TABLE 71
TYPE 805
1000 <<--- NULL
SYNONYM 4002
VIEW 4004
----------
sum 10000
14 rows selected.
SAMPLE_SIZE NUM_NULLS NUM_DISTINCT HISTOGRAM
----------- ---------- ------------ ---------------
9000 1000 13 NONE
After generating this data I checked the execution plans – in particular the predicate information and cardinality estimates – of several different queries involving variations of “not in” predicates; but I just want to show one query, with its execution plan from autotrace, that demonstrates the error very clearly:
select
count(*)
from t1
where object_type not in ('INDEX', 'SYNONYM', 'VIEW')
union all
select
count(*)
from t1
where object_type in ('INDEX', 'SYNONYM', 'VIEW')
;
Execution Plan
----------------------------------------------------------
Plan hash value: 575959041
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 12 | 52 (4)| 00:00:01 |
| 1 | UNION-ALL | | 2 | 12 | 52 (4)| 00:00:01 |
| 2 | SORT AGGREGATE | | 1 | 6 | | |
|* 3 | TABLE ACCESS FULL| T1 | 7079 | 42474 | 26 (4)| 00:00:01 |
| 4 | SORT AGGREGATE | | 1 | 6 | | |
|* 5 | TABLE ACCESS FULL| T1 | 2077 | 12462 | 26 (4)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("OBJECT_TYPE"<>'INDEX' AND "OBJECT_TYPE"<>'SYNONYM' AND
"OBJECT_TYPE"<>'VIEW')
5 - filter("OBJECT_TYPE"='INDEX' OR "OBJECT_TYPE"='SYNONYM' OR
"OBJECT_TYPE"='VIEW')
As you can see, the optimizer has rewritten the “not in()” as a conjunct of inequalities and the “in” as a disjunct of equalities. If I had written the query in exactly that form I would still have got the same results – provided I didn’t go all the way back to 11.2.0.3 in my testing.
The critical point is this: check the Rows column for operations 3 and 5 (the two tablescans), and add the two values together. The result is 9,156. Between them the two branches of the query cover all the rows where col1 is not null – and that’s 9,000 rows. Whatever the optimizer is doing it ought (for such a simple query) to match the total row count.
The “in” branch gives the right estimate: for my 9,000 rows with a non-null col1 the number of distinct values is 13 so selectivity of “column = constant” is 1/13; adding the values for the three selected values that gives us a selectivity of 3/13 and a cardinality of 2076.9 (q.e.d.)
The “not in” branch ought, therefore to give us a selectivity of 10/13 (i.e. 1 – 3/13) for a cardinality of 6,923(.08). Oracle’s estimate of 7,079 is out by 156, which is only a little over 2% in this case, so maybe the error is hardly ever noticeable to most people.
Where does the error come from. There are a couple of different ways of thinking about how the numbers can be produced. Here’s one viewpoint which shifts us from addition/subtraction to multiplication/division (which is a little more like the way the optimizer runs the arithmetic, I think).
To eliminate the first value we scale the rows down by one thirteenth; to eliminate the second value we scale that result by one twelfth (there are only 12 distinct values left), to eliminate the final value we scale the result by a further one eleventh – so the total is: 9000 * 12/13 * 11/12 * 10/11 = 6923.07 (q.e.d.)
BUT it looks as if Oracle is not allowing for the reduction in the number of distinct values as it eliminates each value in the list – so the optimizer’s arithmetic becomes: 9000 * 12/13 * 12/13 * 12/13 = 7078.7 (q.e.d).
ConclusionThere is a flaw in the way that the optimizer calculates the selectivity of “not in {list of values}” – the error may be fairly small in most cases but due to the nature of the error (and with the assistance of histograms and/or large numbers of rows with NULLs in the relevant column) there may be cases where a bad enough cardinality estimate appears and make the optimizer choose a poor execution plan. The nature of the error is such that the optimizer will over-estimate the cardinality.
Index puzzle
What could make a “create index” statement fail with an error about a table having too many columns (ORA-01792)? And why would this suddenly happen when the code that raised the error had been dropping and recreating the index every night for the best part of 3 years with no problems?
This is a question that came up on the Oracle developer forum recently, and was finally answered by the original poster after getting an important clue from another participant in the thread.
The clue (supplied by Solomon Yakobson) was that when you create a “function-based” index Oracle will create a hidden virtual column on the table but if the table has already reached the limit on the number of columns (1,000 until very recently) the attempt to create the column will fail with error ORA-01792, leading to the create index statement failing.
But this drop/create has been working perfectly for years – and a query for the table definition reported only 40 columns – so how did an extra 960 columns sneak in to break things? It’s a bug that can appear when you mix virtual columns with function-based indexes – Oracle bypasses a section of the code path that is supposed to clean up after the “drop index”. Here’s a very simple example to demonstrate.
rem
rem Script: fbi_puzzle_4.sql
rem Author: Jonathan Lewis / Marco Antonio Romito
rem Dated: Sep 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
drop table t1;
create table t1(
n1 number(6,0),
n2 number(6,0) as (n1 + 1) virtual,
n3 number(6,0)
)
/
prompt =============================
prompt An index that causes problems
prompt =============================
create index t1_i1 on t1(n1, nvl(n3,0), n2);
column column_name format a32
column data_default format a32
select
column_id, column_name,
hidden_column, virtual_column, user_generated, internal_column_id,
data_default
from
user_tab_cols
where
table_name = 'T1'
order by
internal_column_id
/
I’ve created a table of three columns, where the second column is a virtual column. Then I’ve created an index on the table which includes references to all three columns, but one of those references includes a call to the nvl() function – which means Oracle will create a fourth (hidden, virtual) column for that column expression. Here are the results of the subsequent query against user_tab_cols:
COLUMN_ID COLUMN_NAME HID VIR USE INTERNAL_COLUMN_ID DATA_DEFAULT
---------- -------------------------------- --- --- --- ------------------ --------------------------------
1 N1 NO NO YES 1
2 N2 NO YES YES 2 "N1"+1
3 N3 NO NO YES 3
SYS_NC00004$ YES YES NO 4 NVL("N3",0)
4 rows selected.
As you can see n2 is flagged as virtual (but not hidden) generated as “N1” + 1; and there’s a fourth, system-generated, hidden, virtual column generated as NVL(“N3”,0).
As ever, things are more likely to break when you mix features. Here are a few more lines of code to run after creating the index and reporting the columns:
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
drop index t1_i1;
create index t1_i1 on t1(n1, nvl(n3,0), n2);
select
column_id, column_name,
hidden_column, virtual_column, user_generated, internal_column_id,
data_default
from
user_tab_cols
where
table_name = 'T1'
order by
internal_column_id
/
You might expect Oracle to drop the system-generated column as it drops the index – but here are the results we now get from querying user_tab_cols:
COLUMN_ID COLUMN_NAME HID VIR USE INTERNAL_COLUMN_ID DATA_DEFAULT
---------- -------------------------------- --- --- --- ------------------ --------------------------------
1 N1 NO NO YES 1
2 N2 NO YES YES 2 "N1"+1
3 N3 NO NO YES 3
SYS_NC00004$ YES YES NO 4 NVL("N3",0)
SYS_NC00005$ YES YES NO 5 NVL("N3",0)
SYS_NC00006$ YES YES NO 6 NVL("N3",0)
SYS_NC00007$ YES YES NO 7 NVL("N3",0)
SYS_NC00008$ YES YES NO 8 NVL("N3",0)
SYS_NC00009$ YES YES NO 9 NVL("N3",0)
9 rows selected.
Oracle “forgets” to drop the system-generated virtual column, and generates a new virtual column every time the index is recreated. Since the columns are hidden columns you won’t notice that something has gone wrong if you query the (more commonly used) view user_tab_columns – so there may well be a number of sites which have tables with huge numbers of hidden virtual columns, all quietly working their way towards unexpected ORA-01792 errors.
This “failure to drop” isn’t consistent behaviour – if the index had been defined as (n1, n2, nvl(n3,0)) then the system-generated column would have been dropped every time the index was dropped. I haven’t tried to work out exactly what conditions have to be met for the error to appear but I think it may be something to do with a user-defined virtual column appearing in an index after an expression that has to be supported by a system-generated virtual column.
Bear in mind that if I create two indexes (n1, nvl(n3,0)) and (n2, nvl(n3,0)) then Oracle will generate just one virtual column to support the expression nvl(n3,0), so when I drop one of these indexes Oracle will have to say: “does dropping this index mean I should drop a virtual column?” and I’m guessing that somewhere in that area of code Oracle is bypassing some of the checks it should be doing.
SummaryIf you have any indexes that contain both virtual columns and column expressions, and if you drop and recreate those indexes from time to time, check user_tab_cols to make sure that you don’t have multiple identical system-generated, hidden, virtual columns that support the expression(s) in those indexes.
WorkaroundIn my case I found the following steps were sufficient to clear up the excess occurrences:
- drop the problem index (n1, nvl(n3,0), n2)
- create an index on just the problem expression (nvl(n3,0))
- drop that index (which got rid of all the copies of the generated virtual columns)
- recreate the problem index.
In more complicated cases (e.g. several indexes following this pattern, multiple indexes that include the same expression etc.) you may have to develop a more sophisticated approach.
num_index_keys – 2
A recent MOS Community forum posting (needs an account) raised the problem of a query with a specific index hint switching between two plans, one using “inlist iteration of a unique index scan” the other using a simple “index range scan with filter predicate”. Here’s an example of the pattern of the query, with the two execution plans:
select /*+ index(INDEXES_A_LOCAL_IDX) */
*
from INDEXES_A
WHERE ID in (
:1 ,:2 ,:3 ,:4 ,:5 ,:6 ,:7 ,:8 ,:9 ,:10 ,:11 ,:12 ,:13 ,:14 ,:15 ,:16 ,:17 ,:18 ,:19 ,:20 ,
:21 ,:22 ,:23 ,:24 ,:25 ,:26 ,:27 ,:28 ,:29 ,:30 ,:31 ,:32 ,:33 ,:34 ,:35 ,:36 ,:37 ,:38 ,:39 ,:40 ,
:41 ,:42 ,:43 ,:44 ,:45 ,:46 ,:47 ,:48 ,:49 ,:50 ,:51 ,:52 ,:53 ,:54 ,:55 ,:56 ,:57 ,:58 ,:59 ,:60 ,
:61 ,:62 ,:63 ,:64 ,:65 ,:66 ,:67 ,:68 ,:69 ,:70 ,:71 ,:72 ,:73 ,:74 ,:75 ,:76 ,:77 ,:78 ,:79 ,:80 ,
:81 ,:82 ,:83 ,:84 ,:85 ,:86 ,:87 ,:88 ,:89 ,:90 ,:91 ,:92 ,:93 ,:94 ,:95 ,:96 ,:97 ,:98 ,:99, :100
)
AND CLIENT_ID = :101
AND PRODUCT_ID = :102
;
Plan hash value: 743077671
---------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 110 (100)| | | |
| 1 | PARTITION LIST SINGLE | | 8 | 608 | 110 (0)| 00:00:01 | KEY | KEY |
| 2 | INLIST ITERATOR | | | | | | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| INDEXES_A | 8 | 608 | 110 (0)| 00:00:01 | KEY | KEY |
|* 4 | INDEX UNIQUE SCAN | INDEXES_A_LOCAL_IDX | 8 | | 102 (0)| 00:00:01 | KEY | KEY |
---------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1011922169
----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 88 (100)| | | |
| 1 | PARTITION LIST SINGLE | | 1 | 79 | 88 (3)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| INDEXES_A | 1 | 79 | 88 (3)| 00:00:01 | KEY | KEY |
|* 3 | INDEX RANGE SCAN | INDEXES_A_LOCAL_IDX | 1 | | 87 (3)| 00:00:01 | KEY | KEY |
----------------------------------------------------------------------------------------------------------------------------------
The index in the hint is the primary key index, defined as (client_id, product_id, id) and, as you can see, the query specifies exactly those three column with equality predicates – albeit with an “in-list” for the final column of the index. You’ll also notice that the table is list-partitioned (the partition key is the client_id, one value per partition) and the index is a local index.
It may not be immediately obvious how the two plans can achieve the same result, until you check the Predicate Information sections of the two plans which, in the same order and with several lines of “OR” predicates removed, look like this:
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("CLIENT_ID"=:101 AND "PRODUCT_ID"=:102 AND (("ID"=:1 OR "ID"=:2 OR "ID"=:3 OR "ID"=:4 OR
"ID"=:5 OR "ID"=:6 OR "ID"=:7 OR "ID"=:8 OR "ID"=:9 OR "ID"=:10 OR "ID"=:11 OR "ID"=:12 OR "ID"=:13 OR "ID"=:14 OR
...
"ID"=:95 OR "ID"=:96 OR "ID"=:97 OR "ID"=:98 OR "ID"=:99 OR "ID"=:100)))
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("CLIENT_ID"=:101 AND "PRODUCT_ID"=:102)
filter(("ID"=:1 OR "ID"=:2 OR "ID"=:3 OR "ID"=:4 OR "ID"=:5 OR "ID"=:6 OR "ID"=:7 OR "ID"=:8 OR "ID"=:9 OR "ID"=:10 OR
...
"ID"=:95 OR "ID"=:96 OR "ID"=:97 OR "ID"=:98 OR "ID"=:99 OR "ID"=:100))
The plan with the inlist iterator operation uses the list as an access predicate – it does 100 precision probes of the index to find rowids and visits the table for each rowid in turn.
The plan using the index range scan uses the list as a filter predicate – it does a single range scan through the section of the index covering the given client_id and product_id and, for every index entry in that range, checks the id against the the list of values, visiting the table only for the cases when it finds a match (There are probably some cunning little tricks to minimise the work done checking the id – sorting the list into order once at the start of execution would be a reasonable one, starting the range scan at the lowest listed id and stopping at the highest would be another.)
Why can the optimizer produce two different plans? It’s just the result of the costing algorithm. In this case the optimizer compares the cost of doing a unique access 100 times (which is roughly 100 times the cost of doing it once) with the cost of a single large range scan (plus the CPU cost of lots of comparisons).
If you want to jump to it, here’s the summary and workaround.
I’ve blogged about this behaviour in a previous post, though the context in that case was an upgrade from 10g to 12c and the example and consequences were slightly different, so I decided to create a model of this case to demonstrate the point – and then discovered an oddity in the optimizer that probably made the switch appear to be rather random. I’m going to start by creating some data in a list-partitioned table with a local primary key index.
rem
rem Script: num_index_keys_3.sql
rem Author: Jonathan Lewis
rem Dated: Mar 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
rem 23.3.0.0
rem
execute dbms_random.seed(0)
create table t1(
client_id not null,
product_id not null,
id not null,
v1,
padding
)
partition by list (client_id) (
partition p0 values ('Client0'),
partition p1 values ('Client1'),
partition p2 values ('Client2'),
partition p3 values ('Client3'),
partition p4 values ('Client4'),
partition p5 values ('Client5'),
partition p6 values ('Client6'),
partition p7 values ('Client7'),
partition p8 values ('Client8'),
partition p9 values ('Client9')
)
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
cast('Client' || least(trunc(abs(3 * dbms_random.normal)),9) as varchar2(254)) client_id,
cast(trunc(abs(400 * dbms_random.normal)) as number(6,0)) product_id,
cast('Id' || lpad(rownum,7,'0') as varchar2(254)) id,
lpad(rownum,10,'0') v1,
lpad('x',250,'x') padding
from
generator v1,
generator v2
where
rownum <= 1e6 -- > comment to avoid WordPress format issue
order by
dbms_random.value
/
execute dbms_stats.gather_table_stats(user, 't1', method_opt=>'for all columns size 1')
alter table t1 add constraint t1_pk primary key(client_id, product_id, id) using index local
-- in a production system this index (for this data pattern) ought to be defined "compress 2"
/
alter index t1_pk rebuild partition p3 pctfree 92
/
I’ve created a list-partitioned table with one client_id per partition. I’ve used the dbms_random.normal() function to generate an uneven distribution of client_id values and product_id values, and rownum to generate the final id column; then I’ve randomised the order of data before inserting it so that the index (partitions) will have large values for the clustering_factor. I’ve also rebuilt partition p3 of the index with a pctfree of 92 because I wanted to have a physically large index partition without generating a huge amount of table data. (The OP has partition sizes varying from a few hundred rows to a couple of hundred million rows – which is partly why the switch in plan can seem fairly random, and why the performance impact can be catastrophic.)
With my hacking factors I’ve got 10 values for client_id, 1,603 values for product_id, and the distribution of data across partitions is as follows:
select
client_id, count(distinct product_id), count(*), min(id) min_id, max(id) max_id
from t1
group by client_id
order by client_id
/
CLIENT_ID COUNT(DISTINCTPRODUCT_ID) COUNT(*) MIN_ID MAX_ID
------------------------ ------------------------- ---------- ------------ ------------
Client0 1474 261218 Id0000002 Id0999997
Client1 1455 233591 Id0000001 Id1000000
Client2 1448 187657 Id0000006 Id0999998
Client3 1390 134710 Id0000003 Id0999996
Client4 1339 87095 Id0000034 Id0999981
Client5 1270 50005 Id0000056 Id0999902
Client6 1198 25894 Id0000124 Id0999975
Client7 1068 12083 Id0000033 Id0999804
Client8 914 4970 Id0000185 Id0999811
Client9 827 2777 Id0000764 Id0999515
After running a couple more queries to examine the data I picked partition p3 for testing, and 284 as a suitable product_id (249 rows in that partition), and created an anonymous PL/SQL block to execute a query to select 100 of those rows (using a couple of queries to generate most of the text).
Here’s a version of the code – cut back to select only 7 rows with an option for an 8th row – followed by a call to generate the SQL Monitor report (available thanks to the hint in the SQL) for the most recent SQL statement execution. The code counts the rows returned just to confirm that every id in the list really does correspond to a row in the partition:
set serveroutput on
<<anon>>
declare
ct number := 0;
m_pad t1.padding%type;
client_id varchar2(32) := 'Client3';
product_id number := 284;
b1 varchar2(32) := 'Id0000628';
b2 varchar2(32) := 'Id0002350';
b3 varchar2(32) := 'Id0002472';
b4 varchar2(32) := 'Id0007921';
b5 varchar2(32) := 'Id0008073';
b6 varchar2(32) := 'Id0008398';
b7 varchar2(32) := 'Id0012196';
b8 varchar2(32) := 'Id0013212';
begin
for c1 in (
select /*+ monitor index(t1 t1_pk) */
*
from t1 t1
where t1.client_id = anon.client_id
and t1.product_id = anon.product_id
and t1.id in (
b1, b2, b3, b4, b5, b6, b7
-- b1, b2, b3, b4, b5, b6, b7, b8
)
) loop
m_pad := c1.padding;
ct := ct + 1;
end loop;
dbms_output.put_line(ct);
end;
/
set heading off
set pagesize 0 linesize 255
set tab off trimspool on
set long 250000
set longchunksize 250000
column text_Line format a254
select
dbms_sql_monitor.report_sql_monitor(
type =>'TEXT',
report_level => 'all'
) text_line
from dual
/
set pagesize 40 linesize 132
The code above produced the following SQL Monitor plan:
============================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
============================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 7 | | |
| 1 | PARTITION LIST SINGLE | | 7 | 16 | 1 | +0 | 1 | 7 | | |
| 2 | INLIST ITERATOR | | | | 1 | +0 | 1 | 7 | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | T1 | 7 | 16 | 1 | +0 | 7 | 7 | | |
| 4 | INDEX UNIQUE SCAN | T1_PK | 7 | 9 | 1 | +0 | 7 | 7 | | |
============================================================================================================================================
As you can see, we’ve used an inlist iterator and done 7 unique probes of the index. The cost of the indexing operation alone is 9: i.e. 2 for the root and branch blocks (which get pinned), and one for each of the 7 leaf blocks that Oracle expects to access.
Change the code so that we include the eighth variable in the list and the SQL Monitor plan shows:
===================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
===================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 8 | | |
| 1 | PARTITION LIST SINGLE | | 8 | 17 | 1 | +0 | 1 | 8 | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | T1 | 8 | 17 | 1 | +0 | 1 | 8 | | |
| 3 | INDEX RANGE SCAN | T1_PK | 8 | 9 | 1 | +0 | 1 | 8 | | |
===================================================================================================================================================
The optimizer has decided to do an index range scan and examine every index entry for ‘Client3’, product 284 to find the 8 requested IDs – and it’s just about arithmetic. Tell the optimizer to use all three columns as access columns by adding the hint /*+ num_index_keys(t1 t1_pk 3) */ to the SQL and the plan changes to the inlist iterator version – with a higher cost:
============================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
============================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 8 | | |
| 1 | PARTITION LIST SINGLE | | 8 | 18 | 1 | +0 | 1 | 8 | | |
| 2 | INLIST ITERATOR | | | | 1 | +0 | 1 | 8 | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | T1 | 8 | 18 | 1 | +0 | 8 | 8 | | |
| 4 | INDEX UNIQUE SCAN | T1_PK | 8 | 10 | 1 | +0 | 8 | 8 | | |
============================================================================================================================================
Using the inlist iterator the total cost is 18: that’s 2 for the root and branch blocks, 8 for leaf blocks, 8 for table blocks. Using the range scan the total cost is 17: that’s 9 for the index scan plus the extra CPU, plus 8 for visiting 8 table blocks. (If you’re wondering why the index range scan is 9, there are 9,324 leaf blocks in the index partition and 1,390 product ids in the partition (and one client_id) : 9,324/1,390 = 6.7; round up, add one for the root, one for the branch –> total 9. The optimizer is using averages to estimate the size of the range scan needed.
The switch happens – it’s just arithmetic, and depends on how many leaf blocks Oracle thinks will be needed by the range scan compared to the number of leaf blocks that will be accessed by the iterated unique probe. The larger the list the more likely it is that Oracle will switch to using a range scan and filter.
The anomaly/bugWhen I first wrote the PL/SQL block I thought I might need about 100 values to demonstrate the problem. In fact I would have needed a much larger data set to need that many values, nevertheless my first attempt introduced a puzzle that turned out to be an unpleasant glitch in the optimizer arithmetic.
If I set the PL/SQL block up to query using 96 variables I get the following plan:
===================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
===================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 96 | | |
| 1 | PARTITION LIST SINGLE | | 96 | 105 | 1 | +0 | 1 | 96 | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | T1 | 96 | 105 | 1 | +0 | 1 | 96 | | |
| 3 | INDEX RANGE SCAN | T1_PK | 96 | 9 | 1 | +0 | 1 | 96 | | |
===================================================================================================================================================
It shouldn’t be a surprise to see the range scan/filter plan, with a total cost of 105: 9 for the index range scan + 96 for the table accesses.
Increase the number of variable to 97 and watch the cost (the plan doesn’t change):
===================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
===================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 97 | | |
| 1 | PARTITION LIST SINGLE | | 97 | 10 | 1 | +0 | 1 | 97 | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | T1 | 97 | 10 | 1 | +0 | 1 | 97 | | |
| 3 | INDEX RANGE SCAN | T1_PK | 1 | 9 | 1 | +0 | 1 | 97 | | |
===================================================================================================================================================
Suddenly the Rows (estim) for the index is 1 (even though the row estimate for the table is 97) with the result that the total cost is now 10 – i.e. 9 for the index range scan plus 1 for the use of the single “predicted” rowid to visit the table.
Given very large data sets, and products with very large numbers of rows (which means big range scans and the appearance of the inlist iterator strategy for relatively long lists) you can imagine how baffling it must be when the optimizer suddenly says – “Hey, one extra value in the list, that means the range scan is really cheap” – and that’s probably why the OP was able to say: “The unique scan operation runs sub second, the range scan takes approx 10 secs to complete.”
You may recall that I pointed out in the code that my index really ought to be defined with “compress 2” – that’s true, but if you’ve got a system like this one already and haven’t used basic index compression, don’t rebuild the index to compress it just yet. If you halve the size of the index then (roughly) you will probably halve the size of the list where the optimizer switches from inlist iteration to the range scan/filter strategy.
Why, you might ask does this error appear at 97 rows? The answer is in the statistics about the partitions / clients that I listed above. This partition/client has 134,710 rows and 1,390 distinct products; 134,710 / 1,390 = 96.91. In my case (which had no histograms) when the table row estimate exceeded the average number of rows per product the optimizer hit a glitch in the code path and used the wrong numbers in the arithmetic.
Let’s do one more experiment – take the code with 96 variables (which does most of the arithmetic “correctly” and use the num_index_keys() hint to force Oracle into the iteration strategy, and here’s the plan from SQL Monitor:
============================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
============================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +0 | 1 | 96 | | |
| 1 | PARTITION LIST SINGLE | | 96 | 194 | 1 | +0 | 1 | 96 | | |
| 2 | INLIST ITERATOR | | | | 1 | +0 | 1 | 96 | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID | T1 | 96 | 194 | 1 | +0 | 96 | 96 | | |
| 4 | INDEX UNIQUE SCAN | T1_PK | 96 | 98 | 1 | +0 | 96 | 96 | | |
============================================================================================================================================
The total cost is 194, compared to the cost of 105 for the range scan/filter plan. Despite the fact that Oracle “thinks” that the relevant section of index is only 9 leaf blocks (and I know it’s 17), it has assumed that every single one of the 96 probes it expects to make will have to do a physical read of a leaf block. The optimizer has not allowed for the “self-caching” that is going to happen as the lists get longer and longer – and that’s a fundamental reason behind this change from inlist iteration to range scanning with filters.
SummaryThere may be all sorts of variations behind the switch between “iterate/unique” and “range/filter” and it’s likely that the presence of histograms would have some effect on the break point, and there may be some arithmetic relating the low/high values in the table/partition compared with the least/greatest values used in the list.
The bottom line, though, is that there’s some instability in the choice of execution plan. If you start seeing it and it’s causing performance problems your best bet is probably to use the num_index_keys() hint to ensure that you get the same plan every time.
FootnoteWhen I re-ran the test on an instance of 23.3 Free it behaved the same way, with one (slight?) difference. My 19.11 instance had been estimating the number of rows correctly, so the sudden change appeared when I used 97 variables in the list; in 23.3 I had to use 100 variables in the list before the error appeared – but that was exactly the point where the table row estimate hit 97.
Indexing
This is a small case study from one of the MOS Community Fora (needs a MOS account) that I drafted about a year ago. It started out as a question about a two node RAC system running Oracle 19.17 where an SQL Monitor report showed time spent in a wait for resmgr:cpu quantum . Here’s the report (with a little cosmetic editing) that was included in the posting:
SELECT COUNT (DISTINCT (ITEM_KEY))
FROM WF_NOTIFICATIONS
WHERE
ITEM_KEY = :B2
AND TO_USER = :B1
AND MESSAGE_TYPE = 'APINVAPR'
;
===================================================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (%) | (# samples) |
===================================================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +162 | 1 | 1 | | | | |
| 1 | SORTAGGREGATE | | 1 | | 1 | +162 | 1 | 1 | | | | |
| 2 | VIEW | VW_DAG_0 | 1 | 102K | 1 | +162 | 1 | 1 | | | | |
| 3 | SORT GROUP BY NOSORT | | 1 | 102K | 1 | +162 | 1 | 1 | | | | |
| 4 | TABLE ACCESS BY INDEX ROWID | WF_NOTIFICATIONS | 1 | 102K | 162 | +1 | 1 | 1 | 408K | 3GB | 96.91 | Cpu (15) |
| | | | | | | | | | | | | resmgr:cpu quantum (1) |
| | | | | | | | | | | | | db file sequential read (141) |
| 5 | INDEX RANGE SCAN | WF_NOTIFICATIONS_N5 | 740K | 4836 | 160 | +3 | 1 | 3M | 17679 | 138MB | 3.09 | Cpu (3) |
| | | | | | | | | | | | | db file sequential read (2) |
===================================================================================================================================================================================
As far as the original question is concerned there are two key points to note – first that the wait in question appears once in the ASH / Activity data out of a total of 162 samples, so it’s not “important”. (That’s okay, the OP didn’t say it was a problem, only that they wanted to understand what it meant.)
Secondly, from a performance perspective, the index range scan produced 3 million rowids but the table access discarded almost all the rows, returning just one row for the distinct aggregation. Clearly the query could be made a lot more efficient (and, yes, you could consider the option for rewriting it with an “exists” subquery).
One of the other forum users addressed the reasons behind the resource manager wait so I followed up on the basic performance characteristics, commenting on the apparent absence of (or failure to use) a suitable index. The OP responded with details from the data dictionary covering index definitions, data volumes, and (rather usefully) counts of distinct values. I did a little cosmetic editing to the information supplied to produce the following details:
Table Name NUM_ROWS BLOCKS
-------------------- ---------- ----------
WF_NOTIFICATIONS 8151040 406969
Table Distinct Num. Sample
Table Name Attribute Values Density Nulls Size
-------------------- ------------- ---------- ------- ---------- ----------
WF_NOTIFICATIONS MESSAGE_TYPE 11 .0909 0 815104
WF_NOTIFICATIONS TO_USER 1026 .0010 60 815098
WF_NOTIFICATIONS ITEM_KEY 3538701 .0000 1032240 711880
CREATE UNIQUE INDEX "APPLSYS"."WF_NOTIFICATIONS_PK" ON "APPLSYS"."WF_NOTIFICATIONS" ("NOTIFICATION_ID")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N1" ON "APPLSYS"."WF_NOTIFICATIONS" ("RECIPIENT_ROLE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N2" ON "APPLSYS"."WF_NOTIFICATIONS" ("GROUP_ID", "NOTIFICATION_ID")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N3" ON "APPLSYS"."WF_NOTIFICATIONS" ("MAIL_STATUS", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N4" ON "APPLSYS"."WF_NOTIFICATIONS" ("ORIGINAL_RECIPIENT")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N5" ON "APPLSYS"."WF_NOTIFICATIONS" ("MESSAGE_TYPE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N6" ON "APPLSYS"."WF_NOTIFICATIONS" ("MORE_INFO_ROLE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N7" ON "APPLSYS"."WF_NOTIFICATIONS" ("FROM_ROLE", "STATUS")
CREATE INDEX "APPLSYS"."WF_NOTIFICATIONS_N8" ON "APPLSYS"."WF_NOTIFICATIONS" ("RESPONDER")
INDEX_NAME BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ----------- -----------------
WF_NOTIFICATIONS_PK 2 27130 1914910
WF_NOTIFICATIONS_N1 2 49400 2485980
WF_NOTIFICATIONS_N2 2 37620 1917700
WF_NOTIFICATIONS_N3 2 47060 816790
WF_NOTIFICATIONS_N4 2 36760 2490300
WF_NOTIFICATIONS_N5 2 49200 962650
WF_NOTIFICATIONS_N6 2 50830 727210
WF_NOTIFICATIONS_N7 3 57020 1034680
WF_NOTIFICATIONS_N8 2 9271 455861
There are many details we could pick up from these stats but to start with I’ll just point out that only one of the three columns in the query’s where clause appears in an index, and that column (message_type) has only 11 distinct values. The index is wf_notifications_n5(message_type, status) and for some reason the optimizer has decided to use that index for this query.
The choice is surprising since the reported cost for the query is 102,000 when the table is (apparently) only 406,969 blocks, which means we should expect the tablescan cost (using the default value of 8 for the “_db_file_optimizer_read_count”) to be in the region of 50,000; maybe someone has set the db_file_multiblock_read_count to 4 (or maybe the value for processes is so high that Oracle’s internal algorithm has made a similar adjustment behind the scenes).
The cost for the index range scan alone is 4,836 – which is approximately the number of leaf blocks divided by the number of distinct values for the message_type: 49,200/11 = 4,472
While we’re looking at the statistics you might note that the stats gathering seems to be using estimate_percent => 10 rather than auto_sample_size.
You’ll notice that the Rows (Estim) for the index range scan is 740,000 while the Rows (Actual) is 3 million. This tells you that there’s no histogram on the column (740,000 is very close to 8.15M rows / 11 distinct values) and that a histogram would be a good idea – a correct estimate might have been enough to tip the optimizer into a tablescan. In fact we can also note in passing that the to_user column might benefit from a histogram given that 12c and later can create frequency histograms up to 2,048 buckets.
A tablescan would, of course, still be expensive and do far more work than needed. Ideally we need an index to be able to find the rows that match all three equality predicates. However, we can see that there are roughly 2 rows per value for item_key, so an index on just (item_key) might be good enough. Averages, of course, can be very misleading: we might have one row for each of 3.5M values of item_key and one value of item_key with 3.5M rows in our table, so we do need to know more about the data distribution before we can make any solid suggestions.
You always have to be careful when adding indexes to tables to address specific queries – they may have adverse side-effects on other queries (and on the DML workload); and you should always look for options to maximise the benefit of any index you add. The “obvious” choice in this case is (message_type, to_user, item_key) compress 2. which would allow the query to execute without visiting the table at all, and probably do very little work. A better alternative might be (message_type, item_key) compress 1, because it could be good enough for the supplied query and very helpful if there is a similar query of the form:
SELECT COUNT (DISTINCT (ITEM_KEY))
FROM WF_NOTIFICATIONS
WHERE
ITEM_KEY = :B2
AND FROM_USER = :B1
AND MESSAGE_TYPE = 'APINVAPR'
(Note how I’ve changed to_user to from_user at line 5)
A RewriteEven with a “perfect” index the query could still do quite a lot of unnecessary work. If we have a few values for item_key with a large number of rows Oracle will walk through all the relevant index entries before doing the “distinct aggregate” (even though there’s at most just one possible value of item_key being returned and the final result has to be zero or one). So, with the best index in place it might still be nice to rewrite the query in the following form:
select count(*)
from dual
where exists (
select null
from wf_notifications
where
item_key = :B2
and to_user = :B1
and message_type = 'APINVAPR'
)
/
You’ll notice, by the way, that the original SQL is all in capitals with bind variables of the form :Bnnn (nnn numeric). This suggests it’s code that is embedded in PL/SQL – so maybe it’s from a “3rd party” package (the schema name APPLSYS that appears in the “create index” statements looks familiar) that can’t be rewritten.
ORA-00942
Oracle error 942 translates, for most people, into: “table of view does not exist”. This note is based on a conversation that I had with Mikhail Velikikh on the Oracle-l list server over the last couple of days while looking at the question: “How do I find out which table (or view) does not exist?”.
There are several possibilities – none of them ideal – but one suggestion that came up was to take advantage of dumping the errorstack when error 942 was signal:
alter session set events '942 trace name errorstack forever, level [N]';
Mikhail suggested using level 2 and then picking up various details from the resulting trace file to allow further information to be extracted from the SGA.
This seemed a little bit like hard work and didn’t match a distant memory I had of solving the problem, so I ran up a little test (based on an extremely simplified model of the task the OP had been working on) using level 1:
create or replace package pack1 as
procedure failure;
end;
/
alter session set tracefile_identifier='level1';
alter session set events '942 trace name errorstack forever, level 1';
create or replace package body pack1 as
procedure failure is
v1 varchar2(64);
begin
select name into v1 from not_a_table where rownum = 1;
dbms_output.put_line(v1);
end;
end;
/
alter session set events '942 trace name errorstack off';
show errors
This produced the output:
Package created.
Session altered.
Session altered.
Warning: Package Body created with compilation errors.
Session altered.
Errors for PACKAGE BODY PACK1:
LINE/COL ERROR
-------- -----------------------------------------------------------------
6/2 PL/SQL: SQL Statement ignored
6/27 PL/SQL: ORA-00942: table or view does not exist
A quick search through the resulting trace file showed me the following pair of consecutive lines, about half way through a 10,000 line file:
5109 :about to signal 942
5110 -Name: NOT_A_TABLE
It’s a shame that the schema name was not included, but it might be sufficient. In fact this pair of lines would have appeared (without the extra 10,000) if I had if I had used “942 trace name context forever”.
Unfortunately Mikhail’s trace file didn’t echo these two lines! This was a surprise, since I was using 19.11 for my test and Mikhail was using 19.23 – details don’t (often) disappear in the upgrade. It didn’t take Mikhail long to find the answer to the puzzle, which I can demonstrate by changing one line of my code. Instead of selecting from a completely non-existent table I’ll change line 14 of the code above to select from a table that does exist, but I don’t have privileges for:
select name into v1 from sys.obj$ where rownum = 1;
Oracle reports the same error message – but doesn’t write the critical two lines into the trace file.
MoralI’ve said it many times in the past – it’s hard to test properly; sometimes it doesn’t even occur to you that a test you’ve done is a special case, which means it’s easy to jump from a specific case to a general conclusion that isn’t going to work for everyone. If you want to publish some interesting observation it’s a good idea to publish the model you used to derive that conclusion.
FootnoteThe more recent versions of Oracle have seen enhancements to several error messages to improve the information they report. In particular 23ai will now report the schema and object name of the table or view that “does not exist”. (I haven’t checked yet if it distinguishes between “really doesn’t exist” and “exists, but you’re not allowed to access it”.)
Rownum quiz
Here’s a silly little puzzle that baffled me for a few moments until I spotted my typing error. It starts with a small table I’d created to hold a few rows, and then deletes most of them. Here’s a statement to create and populate the table:
create table t1 (id number , c1 clob)
lob(c1) store as basicfile text_lob (
retention disable storage in row
);
insert into t1
select rownum, rpad(rownum,200,'0')
from all_objects
where rownum <= 1000
;
commit;
Here’s what I meant to type to delete most of the data – followed by the response from SQL*Plus:
SQL> delete from t1 where mod(id,20) != 0;
950 rows deleted.
Here’s what I actually typed, with the response, that gave me a “What?!” moment:
SQL> delete from t1 where mod(rownum,20) != 0;
19 rows deleted.
I don’t think it will take long for you to work out why the result is so different; but I think it’s a nice warning about what can happen if you get a bit casual about using rownum.
AWR Snap ID
What to do when you hit a problem (possibly after an incomplete recovery) that reports an “ORA-00001 unique key violation” on sys.wrm$_snapshot_pk – as reported recently in this thread on the MOSC SQL Performance forum (needs a MOS account.)
Snapshot ids are carefully sequenced, without gaps, so somehow the thing that controls the “current” sequence number has gone backwards and is trying to generate a value that is lower than the current highest value in wrm$_snapshot. The thread I referenced above does point to an article dated 2017 on Alibaba discussing methods of checking for corruption and clearing up messes; but as an extra option you could simply try hacking the control table to set the “last used” snapshot id so something higher than the highest value currently in wrm$_snapshot. The table you need to hack is wrm$_wr_control and here’s an example of its contents from an instance of 19.11 (preceded by a check of the current maximum snap_id in wrm$_snapshot):
SQL> select max(snap_id) max_snap_id, max(end_interval_time) max_snap_time from wrm$_snapshot;
MAX_SNAP_ID MAX_SNAP_TIME
----------- ---------------------------------------------------------------------------
7304 09-APR-24 07.00.14.180 PM
SQL> execute print_table('select * from wrm$_wr_control')
DBID : 3158514872
SNAP_INTERVAL : +00000 01:00:00.0
SNAPINT_NUM : 3600
RETENTION : +00008 00:00:00.0
RETENTION_NUM : 691200
MOST_RECENT_SNAP_ID : 7304
MOST_RECENT_SNAP_TIME : 09-APR-24 07.00.15.169 PM
MRCT_SNAP_TIME_NUM : 1712685600
STATUS_FLAG : 2
MOST_RECENT_PURGE_TIME : 09-APR-24 08.35.57.430 AM
MRCT_PURGE_TIME_NUM : 1712648156
MOST_RECENT_SPLIT_ID : 7295
MOST_RECENT_SPLIT_TIME : 1712648156
SWRF_VERSION : 30
REGISTRATION_STATUS : 0
MRCT_BASELINE_ID : 0
TOPNSQL : 2000000000
MRCT_BLTMPL_ID : 0
SRC_DBID : 3158514872
SRC_DBNAME : CDB$ROOT
T2S_DBLINK :
FLUSH_TYPE : 0
SNAP_ALIGN : 0
MRCT_SNAP_STEP_TM : 1712685613
MRCT_SNAP_STEP_ID : 0
TABLESPACE_NAME : SYSAUX
-----------------
1 row(s) selected
PL/SQL procedure successfully completed.
I have to state that I would not try this on a production system without getting permission – possibly written in blood – from Oracle support: but if I were really desperate to keep existing snapshots and to make it possible for Oracle to generate new snapshots as soon as possible I might try updating the most_recent_snap_id to the value shown in wrm$_snapshot.
Of course you ought to update the most_recent_snap_time as well, and the mrct_snap_time_num (which looks like the number of seconds since 1st Jan 1900 GMT (on my instance)).
Then there’s the interpretation and sanity checking of the other “most recent / mrct” columns to worry about, and the possibility of PDBs vs. CDBs – but those are topics that I’m going to leave to someone else to worry about.
dbms_output
Here’s a detail about dbms_output that is probably overlooked because (in most cases) it’s ignorable, except that it can lead to unexpected response times when you try using it to debug “busy” operations.
A question on the Oracle SQL and PL/SQL forum asked: “Why is a PL/SQL ‘for loop’ so slow on Oracle Autonomous DB?” and provided the following code to demonstrate the issue.
begin
for i in 1..36000 loop
dbms_output.put_line ('i value: '|| i);
end loop;
end;
/
The OP reported the elapsed time for this block as 1 minute 40 seconds (compared to Java taking only 2 seconds), and showed a screen shot to prove the point.
A couple of people added calls to systimestamp to the block to show that the loop actually took less than a second and explained that the time reported was mostly the time spent in displaying the results, not in creating them. Correct, of course, but that still raises the question of why it takes so long to display such a small amount of data.
If you call dbms_output.put_line() in your code you only get to see the results if something subsequently calls dbms_output.get_lines() to extract the results from the buffer created by dbms_output. (In the case of SQL*Plus this call is built in and enabled by default, so that there’s a hidden call to BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END; after every call you make to the database unless you’ve set serveroutput off.)
You’ll note the plural in get_lines(), and the parameter :numlines – how many lines do you think the call will get by default? In SQL*Plus it’s whatever is set by the last call to set arraysize, which defaults to 15.
I set up a little test, connected to a local instance of 19.11, to demonstrate the impact – executing the supplied code from SQL*Plus using an arraysize of 1 (which is actually honoured by dbms_output), 15 (default), and 1000, with the following times:
- set arraysize 1: Time: 8.90 seconds (ca. 36,000 roundtrips)
- set arraysize 15: Time: 1.43 seconds (ca. 2,400 roundtrips)
- set arraysize 1000: Time: 0.41 seconds (ca. 36 roundtrips)
Historically there was a limit of 1MB on the dbms_output buffer but that limit was removed in a fairly recent version of Oracle so it is possible to generate huge amounts of data in a single PL/SQL block. This means that whatever else you do to optimise for network traffic time you may still see time spent in “PGA memory operation” as your shadow process tries to allocate memory chunks for the buffer.
Index Usage – 3
In the second part of this series I described some of the technicalities of Index Usage Tracking and showed an example of what I was doing to test the feature. In this episode I’ll describe some of the index access methods I’ve tested and report the results. I’ve listed the tests I’ve planned so far and will show results as I find time to run the tests – if you can think of more cases add them in the comments and I’ll extend the list. (If you think a test is a very good idea, “upvote” it in the comments and I’ll try to run it sooner.
Before I’ve done all the tests I’ll add a section on Conclusions and Recommendations. This will be addressing the questions: “Is the feature worth using?” and “What’s the best way to use it”. This section of the note may develop over time as special cases or boundary conditions show up.
Setup (recap from pt.2)I’ll be using three session to do my testing:
- Session 1 logged in as the test user to execute test statements and query dba_index_usage.
- Session 2 logged in as sys in the same pdb to query x$keiut and x$keiut_info/v$index_usage_info (real time summaries)
- Session 3 logged in as the oracle s/w owner and connected as sys in the cdb$root to call keiutflush to make mmon flush x$keiut to disk. The session may have to “alter session set container = {the test pdb}” to flush the right x$keiut.
- CDB SYS session
- call keiutFlush twice to clear all active elements from x$keiut_info/v$index_usage_info (check contents)
- PDB SYS session
- check x$keiut is empty
- User session:
- check dba_index_usage for targetted index(es)
- execute some statements forcing use of index and check execution plan
- PDB sys session:
- check contents of x$keiut and x$keiut_info/v$index_usage_Info
- CDB sys session
- call keiutFlush
- End user session
- Check contents of dba_index_usage (report changes)
For each test I’ll show (in collapsed form) the SQL I used to define the objects being tested, and describe the purpose and method of the test. Then I’ll simply show the critical changes in dba_index_usage and, in some cases, x$keiut_info/v$index_usage_info, x$keiut as a result of the test.
Initial list of tests- Gathering index stats
- “Foreign Key” indexes and primary key indexes during referential integrity
- Single table, single column index, multiple statements
- Unique index (equality and range predicates)
- non-unique index covering unique constraint
- non-unique index generally
- Single table, Multi-column index (access/filter, and skip scan)
- Inlist iterator
- Index-only query (range scan, full scan, fast full scan, index joins)
- Bitmap access (with multiple bitmaps per key value)
- Bitmap and / or / minus
- IOT accessed by primary key
- IOT accessed by secondary key – ensuring “misses on guesses”
- Two-table, nested loop join
- DML – single table access
- DML – with subquery / join methods embedded
- Locally partitioned index (single table partition, multiple table partition)
- Globally partitioned index
I’ve used the following script to generate common data for many of the initial tests in the list above – I may extend this script as I complete more and more of the tests, though I may also introduce separate scripts for some specific cases:
rem
rem Script: iut_02.sql
rem Author: Jonathan Lewis
rem Dated: Mar 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
rem To be tested
rem 23.3.0.0
rem 12.2.0.1
rem
rem Notes:
rem Create some test data to demonstrate index usage tracking
rem
create table t1 as
select
rownum - 1 id,
mod(rownum-1,10000) n1,
trunc((rownum - 1)/20) col1,
trunc((rownum - 1)/10) col2,
rownum - 1 col3,
round(100 * dbms_random.normal()) rand,
cast(rpad(rownum,25) as varchar2(25)) v1,
cast(rpad('x',80,'x') as varchar2(80)) padding
from
all_objects
where
rownum <= 50000
/
create table t2 as select * from t1;
-- single column primary key index (unique)
alter table t1 add constraint t1_pk primary key (id)
using index (
create unique index t1_pk on t1(id)
);
create index t1_col3 on t1(col3);
alter table t1 add constraint t1_uk unique (col3);
create index t1_3col on t1(col1, col2, col3) compress 2;
create index t1_rand on t1(rand);
-- multi-column primary key index (unique)
alter table t2 add constraint t2_pk primary key (col1, id)
using index (
create unique index t2_pk on t2(col1, id) compress
);
-- "foreign key index" (one to many)
alter table t2 add constraint t2_fk_t1 foreign key (col3) references t1(id);
create index t2_col3 on t2(col3);
The SQL used to gather index stats from any of the calls to dbms_stats.gather_xxx_stats() does a simple select statement that is hinted to access the index. In my case the indexes were all fairly small – smaller than the critical number of blocks that trigger sampling methods – so Oracle examined every block and row in the index, leading to (a check in) index usage stats looking like the following t1 primary key example:
OBJECT_ID : 209180
NAME : T1_PK
OWNER : TEST_USER
TOTAL_ACCESS_COUNT : 1
TOTAL_EXEC_COUNT : 1
TOTAL_ROWS_RETURNED : 50000
BUCKET_1000_PLUS_ACCESS_COUNT : 1
BUCKET_1000_PLUS_ROWS_RETURNED : 50000
LAST_USED : 01-apr-2024 13:48:51
So gathering stats does flag the index as used – but with the enhanced tracking it’s going to be relatively easy to spot cases where a small number of accesses account for a large number of rows – hinting that these accesses may be only for stats collection.
If you’ve read the linked article you’ll see how Oracle’s sampling strategy has changed in recent years, so a check that will be relevant to some of your indexes is whether or not the average rows returned is equivalent to roughly 1,140 leaf blocks. Other simple checks that might be apprpriate are: “is the very large access happening once per day/week” in line with your expected stats collection strategy.
If course, if large access is happening fairly frequently you can raise the question – does this look an appropriate result for the table or index, or does it hint at an index that is used when it should not be, or an index that needs to be refined (e.g. by the addition of extra columns to allow better elimination before visiting the table).
Referential Integrity (and DML)With referential integrity in place Oracle will (in the most commonly used setup)
- check that matching child rows do not exist when you try to delete a parent or modify its key value – will either operation flag a supporting “foreign key” index as used.
- check that the parent row exists if you try to insert a child row or update a child row to associate it with a different parent value – will either operation flag the parent primary key index as used
In my test script t1 is the parent and t2 is the child. The referential integrity is from t2.col3 to t1.id and the relevant indexes are unique indexes named t2_col3 and t1_pk respectively. Test statements are:
-- Delete parent when child exists
-- Delete parent when child deleted
-- --------------------------------
delete from t1 where id = 17000;
-- ORA-02292: integrity constraint (TEST_USER.T2_FK_T1) violated - child record found
delete from t2 where col3 = 17000;
delete from t1 where id = 17000;
commit;
-- Insert into child without parent
-- Insert into child with parent
-- --------------------------------
insert into t2 values(60000, 1,1,1,60000,1,'x','x');
-- ORA-02291: integrity constraint (TEST_USER.T2_FK_T1) violated - parent key not found
insert into t2 values (18000,1,1,1,15000,1,'x','x');
commit;
-- update parent to "abandon" child
-- --------------------------------
update t1 set id = 60000 where id = 25000;
-- ORA-02292: integrity constraint (TEST_USER.T2_FK_T1) violated - child record found
-- update child to change to non-existent parent
-- update child to change to pre-existing parent
-- ---------------------------------------------
update t2 set col3 = 60000 where id = 25000;
-- ORA-02291: integrity constraint (TEST_USER.T2_FK_T1) violated - parent key not found
update t2 set col3 = 30000 where id = 25000;
commit;
After calling keiutFlush and checking that there were no active elements in x$keiutinfo/v$index_usage_Info, and no rows in x$keiut I executed all the statements above one after the other (some failed, of course, with their errors shown above). Nothing was captured in x$keiut.
Apart from the implications of “foreign key” indexes not being flagged as used during referential integrity checks, the tests above also show us that updates and deletes driven by index access do not show the driving index flagged as used: t1.id = constant, t2.col3 = constant, t2.id = constant (which used an index skip scan on t2_pk)).
Single table, single column indexThis set of tests is close to a repeat of the first demonstration in part 2. Here’s the list of statements aimed at index t1_pk. Note that t1_pk(id) is a unique index on a primary key constraint, t1_uk(col3) is a non-unique index covering a unique constraints, t2_col3(col3) is a single column non-unique index.
--
-- Out of range, no rows
--
select v1 from t1 where id = -1;
select v1 from t1 where id = 60001;
--
-- Single row
--
select v1 from t1 where id = 1000;
select v1 from t1 where id = 2000;
--
-- multiple rows: 5, 50, 500, 1500
--
select max(v1) from t1 where id between 11 and 15;
select max(v1) from t1 where id between 101 and 150;
select max(v1) from t1 where id between 1001 and 1500;
select max(v1) from t1 where id between 10001 and 11500;
Here are the results after the sequence: call keiutFlush, execute test SQL, report x$keiut, call keiutFlush, report change in dba_index_usage:
SQL> select objname, num_starts, num_execs, rows_returned from x$keiut;
OBJNAME NUM_STARTS NUM_EXECS ROWS_RETURNED
-------------------------------- ---------- ---------- -------------
TEST_USER.T1_PK 8 8 2057
SQL> host expand temp1.lst
OBJECT_ID : 209180
NAME : T1_PK
OWNER : TEST_USER
TOTAL_ACCESS_COUNT : 8
TOTAL_EXEC_COUNT : 8
TOTAL_ROWS_RETURNED : 2075
BUCKET_0_ACCESS_COUNT : 2
BUCKET_1_ACCESS_COUNT : 2
BUCKET_2_10_ACCESS_COUNT : 1
BUCKET_2_10_ROWS_RETURNED : 5
BUCKET_11_100_ACCESS_COUNT : 1
BUCKET_11_100_ROWS_RETURNED : 50
BUCKET_101_1000_ACCESS_COUNT : 1
BUCKET_101_1000_ROWS_RETURNED : 500
BUCKET_1000_PLUS_ACCESS_COUNT : 1
BUCKET_1000_PLUS_ROWS_RETURNED : 1500
LAST_USED : 02-apr-2024 15:29:06
You’ll note that I executed 8 select statements, and expected a total of 2,057 rows (index rowids) being passed to the table access operation, and the changes in stats shown in dba_index_usage are an exact match for the predictions above the listed SQL statements.
Since t1.col3 is an exact match of t1.id, and since t2 is a duplicate of t1, it seems likely that tests that start by cloning the SQL and changing the column or table name as appropriate would give the matching results – and they do, so I won’t bother to print them all up.
There is one final test of a single column index before I move on to simple queries targeting a multi-column index. Here’s the statement I want to test to confirm an important point:
select v1 from t1 where id between 101 and 150 and mod(id,10) = 0;
This query will scan through 50 index entries, discarding all but 5 of them, returning 5 rows from the table. The key question is this – will dba_index_usage report 50 rows accessed or 5 rows accessed. Here’s what the changes in stats looked like after the test
TOTAL_ACCESS_COUNT : 1
TOTAL_EXEC_COUNT : 1
TOTAL_ROWS_RETURNED : 5
BUCKET_0_ACCESS_COUNT : 0
BUCKET_1_ACCESS_COUNT : 0
BUCKET_2_10_ACCESS_COUNT : 1
BUCKET_2_10_ROWS_RETURNED : 5
BUCKET_11_100_ACCESS_COUNT : 0
BUCKET_11_100_ROWS_RETURNED : 0
BUCKET_101_1000_ACCESS_COUNT : 0
BUCKET_101_1000_ROWS_RETURNED : 0
BUCKET_1000_PLUS_ACCESS_COUNT : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0
This is a very important point: the stats in dba_index_usage do not tell us how many rows (index entries) we visited in the index, they tell us how many rowids (or, possibly, key values) survived to be passed to the parent operation (typically the table access). So you might look at some stats that say: “25,000 executions, all in the 2 to 10 range – good index” when you’ve got a badly designed index does 90% of the total work of a query and discards 2,000 index entries for each rowid it uses to access a table.
Inlist IteratorsHere’s a sample query (with its result set, and actual execution plan pulled from memory) accessing the t1 table through the index on column rand. If you try to repeat this example it probably won’t give you exactly the same results because I used Oracle’s random number generator to generate a normal distribution of integer values (with mean zero and standard deviation of 100), but there’s a reasonable chance that you’ll see similar numbers in your output as I’ve been careful to pick three values that should return significantly different numbers of rows:
select
rand, count(*)
from t1
where rand in (100, 200, 300)
and v1 is not null
group by
rand
order by
count(*)
/
RAND COUNT(*)
---------- ----------
300 2
200 25
100 114
3 rows selected.
select * from table(dbms_xplan.display_cursor(format=>'allstats last'));
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------
SQL_ID fbwmnax39jqtq, child number 0
-------------------------------------
select rand, count(*) from t1 where rand in (100, 200,
300) and v1 is not null group by rand order by
count(*)
Plan hash value: 874747030
-------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 |00:00:00.01 | 143 | | | |
| 1 | SORT ORDER BY | | 1 | 3 | 3 |00:00:00.01 | 143 | 2048 | 2048 | 2048 (0)|
| 2 | SORT GROUP BY NOSORT | | 1 | 3 | 3 |00:00:00.01 | 143 | | | |
| 3 | INLIST ITERATOR | | 1 | | 141 |00:00:00.01 | 143 | | | |
|* 4 | TABLE ACCESS BY INDEX ROWID| T1 | 3 | 220 | 141 |00:00:00.01 | 143 | | | |
|* 5 | INDEX RANGE SCAN | T1_RAND | 3 | 220 | 141 |00:00:00.01 | 5 | | | |
-------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("V1" IS NOT NULL)
5 - access(("RAND"=100 OR "RAND"=200 OR "RAND"=300))
This plan is probably the simplest demonstration of the difference between total_exec_count and total_access_count. Here’s the output from x$keiut after this test, followed by the details available from dba_index_usage after calling keiutFlush:
OBJNAME NUM_STARTS NUM_EXECS ROWS_RETURNED
-------------------------------- ---------- ---------- -------------
TEST_USER.T1_RAND 3 1 141
Given that x$keiut has reported three “starts”, and we can see that the plan shows three “Starts” and “A-rows” is 141, and the output shows three rows totalling 141 we might hope that dba_index_usage is going to show us exactly what happened with this query. Here are the results:
SQL> execute print_tABLE(q'(select * from dba_index_usage where name = 'T1_RAND')')
OBJECT_ID : 209738
NAME : T1_RAND
OWNER : TEST_USER
TOTAL_ACCESS_COUNT : 3
TOTAL_EXEC_COUNT : 1
TOTAL_ROWS_RETURNED : 141
BUCKET_0_ACCESS_COUNT : 0
BUCKET_1_ACCESS_COUNT : 0
BUCKET_2_10_ACCESS_COUNT : 0
BUCKET_2_10_ROWS_RETURNED : 0
BUCKET_11_100_ACCESS_COUNT : 3
BUCKET_11_100_ROWS_RETURNED : 141
BUCKET_101_1000_ACCESS_COUNT : 0
BUCKET_101_1000_ROWS_RETURNED : 0
BUCKET_1000_PLUS_ACCESS_COUNT : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0
LAST_USED : 02-apr-2024 19:02:03
The three “starts” from x$keiut show up in the total_access_count but the three separate accesses have been accumulated and averaged to appear as 3 accesses in the 11-100 range. (141 / 3 = 27). So there’s some loss of precision in the information. It’s better than just the old “yes/no” report, and in many cases it’s likely to give you numbers that close enough to the right ballpark to be useful, but there are likely to be some cases where the more detailed information would be more beneficial. It’s worth mentioning that the total_access_count (hence the averaging) will include “Starts” where no rows were returned.
Conclusions and SuggestionsCalls to gather index stats execute queries that will be captured by index usage tracking, so you need a strategy to help you ignore “BUCKET_1000_PLUS_ROWS_RETURNED” accesses that are about stats collection, while taking action when they are indications of index usage that could be made more efficient.
Indexes used during referential integrity checks are not flagged as used. This is probably not much of a threat for the relevant (primary/unique key) indexes on the parent end since you are unlikely to drop indexes that enforce uniqueness even if they don’t appear to be used; however it does mean that there is some risk of dropping a “foreign key” index that is needed to avoid locking problems.
Indexes used to drive delete and update statements are not captured by the new mechanism – at least for simple “delete from single_table where …” and “update table where …”. This could result in indexes being dropped that have been created to enhance DML performance. (There are outstanding tests for DML with subqueries using indexes, or joins using indexes may flag the indexes used accessing other tables in the statement.)
The statistics about “rows returned” tell you how many index entries are passed to the parent operation, not about the number of index entries examined; so a frequently used index that consistently reports a very small number of “rows returned” is not necessarily an efficient, well-designed index.
The stats on inlist iterators are a little disappointing: while the number of iterations appears as the total_access_count, the accesses are all allocated to the same bucket on the basis of total_rows_returned / total_access_count and no allowance is made for different values in the list returning significantly different numbers of rows.
Index Usage – 2
In the first part of this series I introduced Index Usage Tracking and the view dba_index_usage – a feature that appeared in 12.2 as a replacement for index monitoring and the view dba_object_usage. In this note I’ll give a quick sketch of the technicalities of the implementation and comments on how to test and use the feature. Actual tests, results and observations will have to wait until part 3.
A not very deep diveThere are three parameters relating to Index Usage Tracking (iut), shown below with their default values:
- _iut_enable [TRUE]
- _iut_max_entries [30000]
- _iut_stat_collection_type [SAMPLED]
The feature is, as you can see, enabled by default; the tracking, however, is “sampled”, which seems to mean that a tiny number of executions end up being tracked. I can’t find any information about how the sampling is done, and having tried a few tests that executed thousands of statements in a couple of minutes without capturing any details of index usage I’ve given up trying and done all my testing with “_iut_stat_collection_type” set to ALL.
SQL> alter session set "_iut_stat_collection_type"=all;
According to a note on MOS (Doc ID 2977302.1) it doesn’t matter whether you set this parameter for the session or the system the effect is the same; and I found that this seemed to be true in my testing on Oracle 19.11 – either way the effect appeared across all sessions connecting to the PDB, though it didn’t seem to persist across a database restart.
The parameter _iut_max_entries probably limits the amount of global memory allowed for collecting stats about indexes. You might ask whether the 30,000 is per PDB or for the entire instance; I suspect it’s for the instance as a whole, but I’m not going to run up a test to scale on that. While I know of several 3rd party applications holding far more indexes than this, the number is probably sufficient for most investigations.
There are eight objects visibly related to Index Usage Tracking: three views, one table, three memory structures and one latch:
- dba_index_usage – the user (dba) friendly view of the accumulated statistics of index usage
- cdb_index_usage – the cdb equivalent of the above
- v$index_usage_info – a view (holding one row) summarising the current tracking status
- sys.wri$_index_usage – the main table behind the xxx_index_usage views above; the views join this table to obj$ and user$, so dropped indexes (and users) disappear from the views.
- x$keiut_info – the memory structure (held in the shared pool) behind the v$index_usage_info
- x$keiut – a structure holding a brief summary for each index actively being tracked. This is generated on demand in the session/process memory and my guess is that it’s an extract or summary of a larger memory structure in the shared pool holding the full histogram data for each index.
- htab_keiutsg – a (small) memory allocation reported by v$sgastat in the shared pool. In my 19.11 the memory size was initially 512 bytes, and in a test with 140 indexes showing up in x$keiut the memory reported was still only 512 bytes (so it’s not a simple list of pointers, more likely a set of pointers to pointers/arrays.
- “keiut hash table modification” – a single parent latch which I assume protects the htab_keiutsg memory. It’s possible that this latch is used to add an entry to the x$keiut structure (or, rather, the larger structure behind it) when an index is first tracked by the software, and that each entry in that larger structure is then protected by its own mutex to minimise collision time on updates as the stats are updated (or cleared after flushing).
Given that there’s a limit of 30,000 for iut_max_entries and only a small memory allocation for the keiut hash table, it does sound as if Oracle could end up walking a fairly long linked list or array to find the correct entry to update, which makes me wonder about two things: first, have I missed something obvious, secondly will Oracle skip updating the stats if the alternative means waiting for a mutex? There’s also the question of whether Oracle simply stops collecting when the limit is reached or whether there’s some sort LRU algorithm that allows it to discard entries for rarely used indexes to get maximum benefit from the available limit.
Another thought that goes with the 30,000 limit. I can find the merge statement that Oracle uses to update the wri$_index_usage table when the stats are flushed from memory to table (an activity that takes place every 15 minutes, with no obvious parameter to change the timing). In my19.11 instance its sql_id is 5cu0x10yu88sw, and it starts with the text:
merge into
sys.wri$_index_usage iu
using
dual
on (iu.obj# = :objn)
when matched then
update set
iu.total_access_count = iu.total_access_count + :ns,
iu.total_rows_returned = iu.total_rows_returned + :rr,
iu.total_exec_count = iu.total_exec_count + :ne,
...
This statement updates the table one row at a time (which you can confirm if you can find it in v$sql and compare rows_processed with executions). This could take a significant amount of time to complete on a system with a very large number of indexes.
The other thing that comes with finding the merge statement is that I couldn’t find any indication that there is a delete statement – either in v$sql, or in the Oracle executable. Spreading the search a little further I queried dba_dependencies and found that the package dbms_auto_index_internal references wri$_index_usage and various of the “autotask” packages – so perhaps there’s something a couple of layers further down the PL/SQL stack that generates dynamic SQL to delete tracking data. On the other hand, there are entries in my copy of wri$_index_usage where the last_used column has dates going back to September 2021, and there are a number of rows where the reported index has been dropped.
Testing the feature.The most significant difficulty testing the mechanism is that it flushes the in-memory stats to the table every 15 minutes, and it’s only possible to see the histogram of index usage from the table. Fortunately it is possible to use oradebug to force mmon to trigger a flush, but I found in my Oracle 19.11 PDB I had to have a session logged into the server as the owner of the Oracle executable, and logged into the cdb$root as the SYS user (though a couple of colleagues had different degrees of success on different versions of Oracle and O/S). The following is a cut and paste after logging in showing appropriate calls to oradebug:
SQL> oradebug setorapname mmon
Oracle pid: 31, Unix process pid: 11580, image: oracle@linux19c (MMON)
SQL> oradebug call keiutFlush
Function returned 0
SQL>
Initially I had assumed I could log on as a rather more ordinary O/S user and connect as SYS to the PDB, but this produced an unexpected error when I tried to execute the flush call:
SQL> oradebug call keiutFlush
ORA-32519: insufficient privileges to execute ORADEBUG command: OS debugger privileges required for client
In my testing, then, I’m going to open three sessions:
- End-user session – a session to execute some carefully designed queries.
- cdb$root SYS session – a session to flush stats from memory to disc.
- PDB SYS session – a session to show the effects of the end-user activity (reporting figures from x$keiut_info, x$keiut, and dba_index_usage)
I’ll be running some simple tests, covering select, insert, update, delete and merge statements with single-column indexes, multi-column indexes, locally partitioned indexes, single table queries, nested loop joins, range scans, fast full scans, skip scans, inlist iterators, union views, stats collection and referential integrity. For each test I’ll describe how the index will be used, then show what the stats look like. Given that what we really need to see are the changes in x$keiut and dba_index_usage I’ll only show the complete “before / after” values in one example here. In part 3 of the series you’ll have to trust that I can do the arithmetic and report the changes correctly.
ExampleFrom the end-user session I have a table created with the following code:
rem
rem Script: iut_01.sql
rem Author: Jonathan Lewis
rem Dated: Mar 2024
rem Purpose:
rem
rem Last tested
rem 19.11.0.0
create table t1 as
select
rownum id,
mod(rownum-1,10000) n1,
trunc((rownum - 1)/20) col1,
trunc((rownum - 1)/10) col2,
rownum - 1 col3,
cast(rpad(rownum,25) as varchar2(25)) v1,
cast(rpad('x',80,'x') as varchar2(80)) padding
from
all_objects
where
rownum <= 50000
/
create index t1_pk on t1(id);
create index t1_n1 on t1(n1);
create index t1_i1 on t1(col1, col2, col3);
From the cdb$root logged on as oracle (executable owner) and connected as SYS:
SQL> startup force
ORACLE instance started.
Total System Global Area 1476391568 bytes
Fixed Size 9134736 bytes
Variable Size 822083584 bytes
Database Buffers 637534208 bytes
Redo Buffers 7639040 bytes
Database mounted.
Database opened.
SQL> oradebug setorapname mmon
Oracle pid: 31, Unix process pid: 27738, image: oracle@linux19c (MMON)
SQL> oradebug call keiutFlush
Function returned 0
From an ordinary O/S user, connected to the PDB as SYS:
SQL> select index_stats_collection_type, alloc_elem_count, active_elem_count, flush_count, last_flush_time from x$keiut_info;
INDEX_STATS_COLLECTION_TYPE ALLOC_ELEM_COUNT ACTIVE_ELEM_COUNT FLUSH_COUNT LAST_FLUSH_TIME
--------------------------- ---------------- ----------------- ----------- ----------------------------
1 0 0 2 19-MAR-24 10.53.50.584 PM
1 row selected.
SQL> alter session set "_iut_stat_collection_type"=all;
Session altered.
SQL> select index_stats_collection_type, alloc_elem_count, active_elem_count, flush_count, last_flush_time from x$keiut_info;
INDEX_STATS_COLLECTION_TYPE ALLOC_ELEM_COUNT ACTIVE_ELEM_COUNT FLUSH_COUNT LAST_FLUSH_TIME
--------------------------- ---------------- ----------------- ----------- ----------------------------
0 0 0 2 19-MAR-24 10.53.50.584 PM
1 row selected.
Note how the index_stats_collection_type changes from 1 to 0 after the “alter session”. I don’t know why the flush_count showed up as 2 when I had only flushed once – but perhaps the second flush is a side effect of altering the collection type.
From an ordinary end-user session
SQL> set feedback only
SQL> select n1 from t1 where id between 1 and 5;
5 rows selected.
SQL> select n1 from t1 where id between 1 and 5;
5 rows selected.
SQL> select n1 from t1 where id between 1 and 50;
50 rows selected.
These queries will use the index t1_pk in an index range scan to access the table by rowid.
From the PDB / SYS
SQL> select index_stats_collection_type, alloc_elem_count, active_elem_count, flush_count, last_flush_time from x$keiut_info;
INDEX_STATS_COLLECTION_TYPE ALLOC_ELEM_COUNT ACTIVE_ELEM_COUNT FLUSH_COUNT LAST_FLUSH_TIME
--------------------------- ---------------- ----------------- ----------- ----------------------------
0 1 1 2 19-MAR-24 10.53.50.584 PM
1 row selected.
SQL> select objnum, objname, num_starts, num_execs, rows_returned from x$keiut;
OBJNUM OBJNAME NUM_STARTS NUM_EXECS ROWS_RETURNED
---------- -------------------------------- ---------- ---------- -------------
208077 TEST_USER.T1_PK 3 3 60
1 row selected.
In the x$keiut_info you can see that Oracle has now allocated one “element”, and has one “active” element. Checking x$keiut (which will report some details of each active element) we can see that my t1_pk index has been used in 3 statement executions, starting a scan a total of 3 times (which matches our expectation) with a total of 60 (= 5 + 5 + 50) rows returned. Of course all we could infer from this one row is that we have returned an average of 20 rows per start, and an average of one start per execution.
From the cdb$root SYS
SQL> oradebug call keiutFlush
Function returned 0
From the PDB SYS (using Tom Kyte’s “print_table”)
SQL> execute print_table(q'[select * from dba_index_usage where name = 'T1_PK' and owner = 'TEST_USER']')
OBJECT_ID : 208077
NAME : T1_PK
OWNER : TEST_USER
TOTAL_ACCESS_COUNT : 3
TOTAL_EXEC_COUNT : 3
TOTAL_ROWS_RETURNED : 60
BUCKET_0_ACCESS_COUNT : 0
BUCKET_1_ACCESS_COUNT : 0
BUCKET_2_10_ACCESS_COUNT : 2
BUCKET_2_10_ROWS_RETURNED : 10
BUCKET_11_100_ACCESS_COUNT : 1
BUCKET_11_100_ROWS_RETURNED : 50
BUCKET_101_1000_ACCESS_COUNT : 0
BUCKET_101_1000_ROWS_RETURNED : 0
BUCKET_1000_PLUS_ACCESS_COUNT : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0
LAST_USED : 19-mar-2024 23:08:02
From the data saved in the table we can see that we’ve logged 3 accesses, of which 2 accesses returned (individually) something between 2 and 10 rows (rowids) for a total of 10 rows (5 + 5) and one access returned (individually) something between 11 and 100 rows (rowids) for a total of 50 rows.
Of course we can say confidently that the one larger access actually did return 50 rows; but looking at nothing but these figures we can’t infer that the other two access returned 5 rows each, it could have been one query returning 2 rows and the other returning 8, or 3 and 7, or 4 and 6, but we do get a reasonable indication of the volume of data from the breakdown of 0, 1, 2 – 10, 11 – 100, 101 – 1000, 1000+
You might note that we can also see our flush time (reported below) reappearing as the last_used date and time – so we know that we are looking at current statistics.
From the PDB / SYS (again)
SQL> select index_stats_collection_type, alloc_elem_count, active_elem_count, flush_count, last_flush_time from x$keiut_info;
INDEX_STATS_COLLECTION_TYPE ALLOC_ELEM_COUNT ACTIVE_ELEM_COUNT FLUSH_COUNT LAST_FLUSH_TIME
--------------------------- ---------------- ----------------- ----------- ----------------------------
0 1 1 3 19-MAR-24 11.08.02.013 PM
1 row selected.
SQL> select objnum, objname, num_starts, num_execs, rows_returned from x$keiut;
OBJNUM OBJNAME NUM_STARTS NUM_EXECS ROWS_RETURNED
---------- -------------------------------- ---------- ---------- -------------
208077 TEST_USER.T1_PK 0 0 0
1 row selected.
The x$keiut_info shows that a third flush has taken place (and any index flushed at that time will have its last_used set very near to that flush time – the merge command uses sysdate, so the last_used could be a tiny bit after the last_flush_time). It still shows an “active” element and when we check x$keiut we find that t1_pk is still listed but the stats have been reset to zero across the board.
If we were to repeat the flush command the active count would drop to zero and the t1_pk entry would disappear from x$keiut. (Oracle doesn’t remove an element until an entire tracking period has passed with no accesses – a typical type of “lazy” strategy aimed at avoiding unnecessary work.)
That’s all for now – if there are any questions put them in the comments and if their answers belong in this note I’ll update the note.
Index Usage – 1
In 12.2 Oracle introduced Index Usage Tracking to replace the previous option for “alter index xxx monitoring usage”. A recent post on the Oracle database discussion forum prompted me to look for articles about this “new” feature and what people had to say about it. There didn’t seem to be much information online – just a handful of articles starting with Tim Hall a few years ago and ending with Maria Colgan a few months ago – so I thought I’d update my notes a little and publish them.
Unfortunately, by the time I’d written the first 6 pages it was starting to feel like very heavy going, so I decided to rewrite it as a mini-series. In part one I’ll just give you some descriptions and explanations that are missing from the manuals; in part two I’ll do a bit of a fairly shallow dive to talk about what’s happening behind the scenes and how you can do some experiments; in part three I’ll describe some of the experiments and show the results that justify the descriptions I’ve given here in part one.
HistoryIn the bad old days you could enable “monitoring” on an index to see if it was being used. The command to do this was:
alter index {index name} monitoring usage;
After executing this statement you would wait for a bit then check the view dba_object_usage:
SQL> desc dba_object_usage
Name Null? Type
----------------------------- -------- --------------------
OWNER NOT NULL VARCHAR2(128)
INDEX_NAME NOT NULL VARCHAR2(128)
TABLE_NAME NOT NULL VARCHAR2(128)
MONITORING VARCHAR2(3)
USED VARCHAR2(3)
START_MONITORING VARCHAR2(19)
END_MONITORING VARCHAR2(19)
SQL> select * from dba_object_usage;
OWNER INDEX_NAME TABLE_NAME MON USE START_MONITORING END_MONITORING
--------------- -------------------- ------------------------- --- --- ------------------- -------------------
TEST_USER T2_I1 T2 YES YES 03/12/2024 15:31:35
1 row selected.
As you can see, this didn’t give you much information – just “yes it has been used” or “no it hasn’t been used” since the moment you started monitoring it; and that’s almost totally useless as an aid to measuring or understanding the effectiveness of the index.
Apart from the almost complete absence of information, there were collateral issues: I think that, initially, gathering stats, index rebuilds and using explain plan would flag an index as used; at the opposite extreme indexes that were actually used to avoid foreign key locking problems were not flagged as used.
And now for something completely differentThe promise of Index Usage Tracking is clearly visible in the description of the view you use to report the details captured:
SQL> desc dba_index_usage
Name Null? Type
----------------------------------- -------- ------------------------
OBJECT_ID NOT NULL NUMBER
NAME NOT NULL VARCHAR2(128)
OWNER NOT NULL VARCHAR2(128)
TOTAL_ACCESS_COUNT NUMBER
TOTAL_EXEC_COUNT NUMBER
TOTAL_ROWS_RETURNED NUMBER
BUCKET_0_ACCESS_COUNT NUMBER
BUCKET_1_ACCESS_COUNT NUMBER
BUCKET_2_10_ACCESS_COUNT NUMBER
BUCKET_2_10_ROWS_RETURNED NUMBER
BUCKET_11_100_ACCESS_COUNT NUMBER
BUCKET_11_100_ROWS_RETURNED NUMBER
BUCKET_101_1000_ACCESS_COUNT NUMBER
BUCKET_101_1000_ROWS_RETURNED NUMBER
BUCKET_1000_PLUS_ACCESS_COUNT NUMBER
BUCKET_1000_PLUS_ROWS_RETURNED NUMBER
LAST_USED DATE
Though the columns are not very well described in the reference manuals you can see very clearly that there’s a lot more detail than just “yes/no” here. The columns clearly carry information about “how many times” and “how much data”, breaking the numbers down across a small range-based histogram. Here’s an example of output (using Tom Kyte’s print_table() routine to turn columns to rows):
SQL> execute print_table('select * from dba_index_usage where name = ''T1_I1''')
OBJECT_ID : 206312
NAME : T1_I1
OWNER : TEST_USER
TOTAL_ACCESS_COUNT : 889
TOTAL_EXEC_COUNT : 45
TOTAL_ROWS_RETURNED : 17850
BUCKET_0_ACCESS_COUNT : 0
BUCKET_1_ACCESS_COUNT : 0
BUCKET_2_10_ACCESS_COUNT : 0
BUCKET_2_10_ROWS_RETURNED : 0
BUCKET_11_100_ACCESS_COUNT : 878
BUCKET_11_100_ROWS_RETURNED : 13200
BUCKET_101_1000_ACCESS_COUNT : 9
BUCKET_101_1000_ROWS_RETURNED : 1650
BUCKET_1000_PLUS_ACCESS_COUNT : 2
BUCKET_1000_PLUS_ROWS_RETURNED : 3000
LAST_USED : 11-mar-2024 20:26:26
The order of the columns is just a little odd (in my opinion) so I’ve switched two of them around in my descriptions below:
- Total_exec_count: is the total number of executions that have been captured for SQL statements using this index.
- Total_access_count: is the total number of scans of this index that have been observed. If you think of a nested loop join you will appreciate that a single execution of an SQL statement could result in many accesses of an index – viz: an index range scan into the inner (second) table may happen many times, once for each row acquired from the outer (first) table.
- Total_rows_returned: carries a little trap in the word rows, and in the word returned. In this context “rows” means “index entries”, and “returned” means “passed to the parent operation”. (To be confirmed / clarified)
- Bucket_0_access_count: how many index accesses found no rows and there’s no bucket_0_row_count needed because it would always be 0).
- Bucket_1_access_count: how many index accesses found just one row (and there’s no bucket_1_row_count because that would always match the access count).
- Bucket_M_N_access_count: how many index accesses found between M and N rows.
- Bucket_M_N_row_count: sum of rows across all the index accesses that returned between M and N rows.
- Last_used: date and time of the last flush that updated this row of the table/view.
The most important omission in the descriptions given in the manuals is the difference between total_exec_count and total_access_count. (It was a comment on Maria Colgan’s blog note asking about the difference that persuaded me that I really had to write this note.) If you don’t know what an “access” is supposed to be you can’t really know how to interpret the rest of the numbers.
Take another look at the sample output above, it shows 45 executions and 889 accesses – I happen to know (because I did the test) that most of the work I’ve done in this interval has been reporting a two-table join that uses a nested loop from t2 into t1 using an index range scan on index t1_i1 to access table t1. I know my data well enough to know that every time I run my query it’s going to find about 20 rows in t2, and that for every row I find in t2 there will be roughly 15 rows that I will access in t1 through the index.
Give or take a little extra activity round the edges that blur the numbers I can see that the numbers make sense:
- 45 executions x 20 rows from t2 = 900 index range scans through t1_i1
- 878 index ranges scans x 15 rows per scan = 13,170
The numbers are in the right ball-park to meet my expectations. But we do have 11 more accesses reported – 9 of them reported an average of 1,650/9 = 183 rows, 2 of them reported an average of 3,000/2 = 1500 rows. Again, I know what I did, so I can explain why those numbers have appeared, but in real life you may have to do a little work to find a reasonable explanation (Spoilers: be suspicious about gathering index stats)
It’s possible, for example, that there are a few rows in the t2 table that have far more than the 15 row average in t1 and the larger numbers are just some examples from the nested loop query that happened to hit a couple of these outliers in t2. (It’s worth highlighting, as a follow-up to this suggestion, that a single execution could end up reporting accesses and row counts in multiple buckets.)
In fact the 9 “medium sized” access were the result of single table queries using a “between” clause that ranged through 10 to 15 values of t1 (returning 150 to 225 rows each), and the two “large” accesses were the result of two index-only queries where I forced an index full scan and an index fast full scan that discarded half the rows of an index holding 3,000 entries.
As I said, I’ll be presenting a few examples in part 3, but a guideline that may be helpful when considering the executions, accesses, and rowcounts is this: if you’re familiar with the SQL Monitor report then you’ll know that each call to dbms_sql_monitor.report_sql_monitor() reports one execution – then the Starts column for any index operation will (probably) be the total access count, and the Rows (Actual) column will (probably) be the total rows returned. As noted above, though, any one execution may end up splitting the total Starts and Rows (Actual) across multiple buckets.
Some questions to investigateI hope this has given you enough information to get you interested in Index Usage Tracking, and some idea of what you’re looking at when you start using the view. There are, however, some technical details you will need to know if you want to do some testing before taking any major steps in production. There are also some questions that ought to be addressed before jumping to conclusions about what the numbers mean, so I thought I’d list several questions that came to mind when I first read about the feature:
- Does a call to dbms_stats.gather_index_stats result in an update to the index usage stats, and does it matter?
- Does a call to explain plan result in an update to the index usage stats, and does it matter.
- Do referential integrity checks result in the parent or child indexes being reported in the usage stats. What if there is a parent delete with “on delete cascade” on the child.
- Do inserts, updates, deletes or merges produce any unexpected results (e.g. double / quadruple counting); what if they’re PL/SQL forall bulk processing, what if (e.g.) you update or delete through a join view.
- Does an index skip scan count as a single access, or does Oracle count each skip as a separate access (I’d hope it would be one access).
- If you have an index range scan with a filter predicate applied to each index entry after the access predicate is the “rows returned” the number of index entries examined (accessed), or the number that survive the filter. (I would like it to be the number examined because that’s the real measure of the work done in the index but the name suggests it counts the survivors.)
- Does an index fast full scan get reported correctly.
- Are IOTs accounted differently from ordinary B-tree indexes
- For bitmap indexes what is a “row” and what does the tracking information look like?
- If you have an “Inlist Iterator” operation does this get summed into one access, or is it an access per iteration (which is what I would expect). And how is the logic applied with partitioned table iteration.
- Does a UNION or UNION ALL operation count multiple accesses (I would expect so), and what happens with things like nvl_or_expansion with “conditional” branches.
- Does a “connect by pump” through an index produce any unexpected results
- Can index usage tracking tell us anything about Domain indexes
- Are there any types of indexes that are not tracked (sys indexes, for example)
If you can think of any other questions where “something different” might happen, feel free to add them as comments.
SummaryIndex Usage Tracking (and the supporting view dba_index_usage) can give you a good insight into how Oracle is using your indexes. This note explains the meaning of data reported in the view and a couple of ideas about how you may need to interpret the numbers for a single index.
In the next two articles we’ll look at some of the technical aspects of the feature (including how to enable and test it), and the results captured from different patterns of query execution, concluding (possibly in a 4th article) in suggestions of how to use the feature in a production system.
FootnoteAt the start of this note I said it had been prompted by a question on one of the Oracle forums. The thread was about identifying indexes that could be dropped and the question was basically: “Is the old index monitoring obsolete?” The answer is “Yes, definitely, and it has been for years.”
Querying LONGs
Update for 23c: If your only need for using LONGs in predicates is to query the partitioning views by high_value you won’t need to read this note as the views now expose columns high_value_clob and high_value_json. (See comment #3 below.)
Despite their continued presence in the Oracle data dictionary, LONG columns are not an option that anyone should choose; you can’t do much with them and they introduce a number of strange space management problems. Nevertheless a recent thread on the Oracle database forum started with the following question: “How do you use LONG columns in a WHERE clause?”. The basic answer is: “You don’t”.
This user wanted to query all_tab_partitions for a given table_name and high_value, and the high_value is (still) a LONG, so attempts to use it resulted in Oracle error “ORA-00997: illegal use of LONG datatype”. A possible, and fairly obvious but undesirable, solution to the requirement is to write a PL/SQL function to read the current row from all_tab_partitions and returns the first N characters of the high_value as a varchar2(). Here’s a version (not quite the one I posted) of such a function, with a sample of use:
rem
rem Script: get_high_value.sql
rem Author: Jonathan Lewis
rem Dated: Mar 2024
rem
create or replace function get_high_value (
i_tab_owner varchar2,
i_tab_name varchar2,
i_part_name varchar2,
i_part_posn number
)
return varchar2
is
v1 varchar2(4000);
begin
select atp.high_value
into v1
from all_tab_partitions atp
where atp.table_owner = upper(i_tab_owner)
and atp.table_name = upper(i_tab_name)
and atp.partition_name = upper(i_part_name)
and atp.partition_position = upper(i_part_posn)
;
return v1;
end;
/
select
apt.table_owner, apt.table_name,
apt.tablespace_name,
apt.partition_name, apt.partition_position,
apt.high_value
from
all_tab_partitions apt
where
apt.table_owner = 'TEST_USER'
and apt.table_name = 'PT_RANGE'
and get_high_value(
apt.table_owner,
apt.table_name,
apt.partition_name,
apt.partition_position
) = '200'
/
This seemed to work quite well and sufficiently rapidly – but I only had two partitioned tables in my schema and a total of 12 partitions, so it’s not sensible to look at the clock to see how efficient the query is.
Another possible solution introduced me to a function that has been around for years (and many versions) which I had never come across: sys_dburigen(). PaulZip supplied the following code (which I’ve altered cosmetically and edited to pick up a table in my schema):
select *
from (
select
dbms_lob.substr(
sys_dburigen (
atp.table_owner,
atp.table_name,
atp.partition_name,
atp.partition_position,
atp.high_value,
'text()'
).getclob(), 4000, 1) high_value_str,
atp.table_owner, atp.table_name, atp.partition_name,
atp.tablespace_name, atp.high_value
from all_tab_partitions atp
where atp.table_owner = 'TEST_USER'
and atp.table_name = 'PT_RANGE'
)
where high_value_str = '200'
/
This was so cute, and looked like a much nicer (i.e. “legal”) solution than my PL/SQL hacking that I had to take a closer look at sys_dburigen() – first to understand what it was supposed achieve (yes, I do RTFM) then to see how it actually worked.
Something I did first was simply to strip back the layers of the expression used to supplied the high_value_str which took me through the following four combinations (with and without ‘text’ , with and without ‘get_clob’). Each expression is followed by the result for the row selected above:
sys_dburigen (atp.table_owner, atp.table_name, atp.partition_name, atp.partition_position, atp.high_value, 'text()').getclob()
200
---
sys_dburigen (atp.table_owner, atp.table_name, atp.partition_name, atp.partition_position, atp.high_value).getclob()
<?xml version="1.0"?><HIGH_VALUE>200</HIGH_VALUE>
--
sys_dburigen (atp.table_owner, atp.table_name, atp.partition_name, atp.partition_position, atp.high_value, 'text()')
DBURITYPE('/PUBLIC/ALL_TAB_PARTITIONS/ROW[TABLE_OWNER=''TEST_USER'' and TABLE_NAME=''PT_RANGE'' and PARTITION_NAME=''P200'' and PARTITION_POSITION=''1'']/HIGH_VALUE/text()', NULL)
--
sys_dburigen (atp.table_owner, atp.table_name, atp.partition_name, atp.partition_position, atp.high_value)
DBURITYPE('/PUBLIC/ALL_TAB_PARTITIONS/ROW[TABLE_OWNER=''TEST_USER'' and TABLE_NAME=''PT_RANGE'' and PARTITION_NAME=''P200'' and PARTITION_POSITION=''1'']/HIGH_VALUE', NULL)
Working from the bottom pair up we see that we start by generating a dburitype which defines the type of thing we want to query and the restriction we want to use while querying. The ‘text()’ option simply adds an extra detail to the dburitype.
The top pair shows us that the get_clob() will then return the value we have requested, either as an XML value, or as the text value described by the XML value if we’ve supplied the ‘text()’ option.
Our call to sys_dburigen() has specified an object we want to access, and 4 columns in that object that will identify a unique row in that object, and a fifth column that we want returned either as an XML value or as a text value.
TracingI actually worked through the analysis in the opposite direction to the one I’ve been showing. When the call to sys_dburigen() I suspected that it might be doing the same thing as my PL/SQL function call, so I ran the two queries with SQL tracing enabled to see what activity took place at the database.
Ignoring driving query against all_tab_partitions the content of the PL/SQL trace was basically 3 executions (I had 3 partitions in the pt_range table) of:
SELECT ATP.HIGH_VALUE
FROM
ALL_TAB_PARTITIONS ATP WHERE ATP.TABLE_OWNER = UPPER(:B4 ) AND
ATP.TABLE_NAME = UPPER(:B3 ) AND ATP.PARTITION_NAME = UPPER(:B2 ) AND
ATP.PARTITION_POSITION = UPPER(:B1 )
The content of the sys_dburigen() trace was 3 executions of a query like:
SELECT alias000$."HIGH_VALUE" AS HIGH_VALUE
FROM
"ALL_TAB_PARTITIONS" alias000$ WHERE 1 = 1 AND ((((alias000$."TABLE_OWNER"=
'TEST_USER') AND (alias000$."TABLE_NAME"='PT_RANGE')) AND
(alias000$."PARTITION_NAME"='P200')) AND (alias000$."PARTITION_POSITION"=
'1'))
Note particularly the literal values in the predicates in lines 4, 5 and 6. This version of the code has to generate and optimise (hard-parse) a new SQL statement for every partition in the table referenced in the driving query. For a table with a large number of partitions, and a system with a large number of partitioned tables, the disruption of shared pool that this might cause could be severe if (as the user said at one point) “we will be frequently selecting from all_tab_partitions”. [Damage limitation: if the session sets cursor_sharing to FORCE temporarily then the generated SQL will be subject to bind variable substitution; but that’s not an ideal workaround.]
SummaryUsing LONG columns in SQL predicates is not nice – and not likely to be efficient – but there are ways of working around the limitations of LONGs. It’s undesirable to use PL/SQL that calls SQL inside a SQL statement, but we can use a PL/SQL function to return a string from a LONG in the current row – and since that’s pretty much what Oracle seems to be doing with its call to sys_dburigen() it’s hard to insist that the PL/SQL strategy is inappropriate. (But maybe the call to sys_dburigen() in this context would be considered an abuse of a feature anyway – even though it seems much more elegant and flexible once you’ve learned a little about how it works.)
FootnoteAs another detail on analysing the cost/benefit of different approaches – it would be possible to avoid creating the pl/sql function by embedding it in the SQL as a “with function” clause:
with function get_high_value (
i_tab_owner varchar2,
i_tab_name varchar2,
i_part_name varchar2,
i_part_posn number
)
return varchar2
is
v1 varchar2(4000);
begin
select atp.high_value
into v1
from all_tab_partitions atp
where atp.table_owner = upper(i_tab_owner)
and atp.table_name = upper(i_tab_name)
and atp.partition_name = upper(i_part_name)
and atp.partition_position = upper(i_part_posn)
;
return v1;
end;
select
apt.table_owner, apt.table_name,
apt.tablespace_name,
apt.partition_name, apt.partition_position,
apt.high_value
from
all_tab_partitions apt
where
apt.table_owner = 'TEST_USER'
and apt.table_name = 'PT_RANGE'
and get_high_value(
apt.table_owner,
apt.table_name,
apt.partition_name,
apt.partition_position
) = '200'
/
I have asked the user why they want to query all_tab_partitions by high_value since it seems to be a slightly odd thing to do and there may be a better way of doing whatever it is that this query is supposed to support. They haven’t responded to the question, so I’ll take a guess that they want to rename (or move etc.) partitions that they don’t know the name for – perhaps because they are using interval partitioning or automatic list partitioning. If the guess is correct then the solutions offered are irrelevant – you don’t need to know the name of a partition to manipulate it, you need only know some value that is a legal member of the partition:
SQL> select partition_name from user_tab_partitions where table_name = 'PT_RANGE' order by partition_position;
PARTITION_NAME
----------------------
P200
P400
P600
3 rows selected.
SQL> alter table pt_range rename partition for (199) to pt_0200;
Table altered.
SQL> select partition_name from user_tab_partitions where table_name = 'PT_RANGE' order by partition_position;
PARTITION_NAME
----------------------
PT_0200
P400
P600
3 rows selected.