Skip navigation.

Yann Neuhaus

Syndicate content
All blog entries from
Updated: 16 hours 52 min ago

SQL Server: DBCC CHECKDB does not detect corruption

Mon, 2014-08-04 02:12

During my audits at customer places, it still happens very often to find SQL Server databases with page verification option configured to "none". I always alert my customers on this configuration point because it can have an impact on the overall integrity of their databases. One of my customer told me that the integrity task of its maintenance will detect the corruption anyway and alert him by email - but is it really the case?

Of course, my response to the customer is that it depends of the corruption type and that a dbcc checkdb is not an absolute guarantee in this case. I like to show this example to my customers to convince them to change the page verify option to another thing than "none".

Let me show this to you:


use [master]; go   -- create corrupt db if DB_ID('corrupt') is not null        drop database corrupt;        create database corrupt; go   -- force page verify option to none alter database corrupt set page_verify none; go   use corrupt; go   -- create two tables t1 and t2 create table dbo.t1( id int identity primary key, col1 varchar(50)); create table dbo.t2( id int identity primary key, col1 int, col2 int, col3 varchar(50));   -- insert some sample data declare @random int; declare @i int = 1;   while @i begin        set @random = 10000/(rand()* 1000);        if @random > 50 set @random = 50;          insert t1(col1) values (REPLICATE('t', 50));        insert t2(col1, col2, col3) values (@i, @i + 1, REPLICATE('t', @random));          set @i = @i + 1; end   -- add a foreign key on column id to table t2 that references the table t1 on column id alter table dbo.t2 add constraint FK_t2_id foreign key (id) references dbo.t1(id); go   -- create a nonclustered covered index on table t1 create nonclustered index idx_t2_col1 on dbo.t2( col1, col2) include ( col3 ); go


At this point we have two tables named t1 and t2. Table t2 has a foreign key constraint on the id column that references the table t1 on the column with the same name.

Now let’s corrupt a data page in the clustered index on the table t1. First, we will find the first data page in the clustered index of the table t1:


-- get the first data page on the t1 table clustered index dbcc ind('corrupt', 'dbo.t1', 1); go




Then we will find the first row. The first row is stored in slot 0 which is located at offset 0x060.


-- Display dump page id = 15 dbcc traceon(3604); go dbcc page ('corrupt', 1, 15, 3); go




Now it’s time to corrupt the id column (id = 1) located to the offset 0x4 in the row. That means we have to place to the offset 0x60 + 0x4 to corrupt this column.

We will use the DBCC WRITEPAGE undocumented command to corrupt our page (again, a big thanks to Paul Randal for showing us how to use this command for testing purposes).


-- corrupt the concerned page alter database corrupt set single_user; go   dbcc writepage('corrupt', 1, 15, 100, 1, 0x00, 1)   alter database corrupt set multi_user; go


Now if we take a look at the page id=15, we notice that the id column value is now changed from 1 to 0.




Ok, let’s run a DBCC CHECKDB command:


-- perform an integrity check with dbcc checkdb dbcc checkdb('corrupt') with no_infomsgs, all_errormsgs; go




As you can see, the dbcc checkdb command does not detect any corruption! Now, let’s run the following statements:


-- first query select t2.col2, t2.col3 from dbo.t2        join dbo.t1              on = where t2.col1 = 1




Do you notice that reading the corrupted page does not trigger an error in this case?


-- second query select t2.col2, t2.col3, t1.col1 from dbo.t2        join dbo.t1              on = where t2.col1 = 1




As you can notice, adding the t1.col1 column to the query will give it a different result between the both queries. Strange behavior isn’t it? In fact, the two queries above don’t use the same execution plan as the following below:

Query 1:




Query 2:




In the query 1, due to the foreign key constraint, the query execution engine doesn’t need to join t2 to t1 to retrieve data because we need only data already covered by the idx_t2_col1 index on table t2. However the story is not the same with the query 2. Indeed, we want to retrieve an additional value provided by the col1 column from the table t1. In this case SQL Server has to join t1 and t2 because the covered index idx_t2_col1 cannot provide all the data we need. But remember we had corrupt the id column of the primary key of the table t1 by changing the value from 1 to 0. This is why the query 2 doesn’t display any results.

The main question here is: why dbcc checkdb doesn’t detect the corruption? Well, in this case corruption has occurring directly on the data value and dbcc checkdb doesn’t have a verification mechanism to detect a corruption issue. Having a checksum value stored in the page would help dbcc checkdb operation in this case because it could compare a computed checksum while reading the page with the stored checksum stored on it.

Below the output provided by dbcc checkdb command if checksum page verify option was enabled for the database …




… or when we ran the query used earlier:




My conclusion:

Do not hesitate to change your page verify option value when it is configured to "none".

PostgreSQL for Oracle DBAs - an introduction

Sun, 2014-08-03 22:16

Having worked for several years as an Oracle DBA, I decided to have a look at the PostgreSQL database and see how it functions in comparison to the Oracle Database.

The "Enterprise DB" graphical installation of PostgreSQL 9.3 is quite easy and rather fast. Under Linux you run the graphical installer, dialog boxes lead you through the installation process. You enter the specific information of your system and at the end of the PostgreSQL installation, the Stack Builder package is invoked if you need to install applications, drivers, agents or utilities.

You can download the Enterprise DB utility using the following URL:

I have installed PostgreSQL 9.3 using Enterprise DB as described below:




Choose Next.




Specify the installation directory where PostgreSQL 9.3 will be installed.




Select the directory that will store the data.




Provide a password to the PostgreSQL database user.




Select a port number.




Choose the locale for the new database cluster.




PostgreSQL is now ready to be installed.




You can choose to launch or not the Stack Builder - if not, the installation process will begin.

If you encounter any problem during the installation phase, the log files are generated in /tmp.

Under Linux, a shell script named uninstall-postgresql is created in the PostgreSQL home directory to de-install the software.

The installation phase is very quick, your PostgreSQL cluster database is ready to use. Furthermore, the Enterprise DB installation creates the automatic startup file in /etc/init.d/postgresql-9.3 to start PostgreSQL in case of a server reboot.

Once the Enterprise DB installation is processed, a database storage area is initialized on disk (a database cluster). After the installation, this database cluster will contain a database named postgres and will be used by utilities or users:


postgres=# \list                                 List of databases   Name   | Owner   | Encoding | Collate   |   Ctype   |   Access privileges-----------+----------+----------+------------+------------+-------------postgres | postgres | UTF8     | en_US.utf8 | en_US.utf8 |template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres      +           |         |         |           |           | postgres=CTc/postgrestemplate1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | postgres=CTc/postgres+           |         |         |           |           | =c/postgres


By default, a new database is created by cloning the system standard base named template1. The template0 allows you to create a database containing only pre-defined standard objects.

The sqlplus oracle equivalent command in PostgreSQL is psql. As you will see in the document, the PostgreSQL commands begin with the \ sign. The “\?” command lists every possibility.

For example, the following commands connects to the psi database:


-bash-3.2$ psql -d psi

Password:psql.bin (9.3.4)

Type "help" for help.No entry for terminal type "xterm";

using dumb terminal settings.

psi=# \q

If you do not want the system to ask for a password, you simply have to create a .pgpass file in the postgres home directory with the 0600 rights and the following syntax:


-bash-3.2$ more .pgpass



-bash-3.2$ su - postgres


-bash-3.2$ psql -d psi

psql.bin (9.3.4)

Type "help" for help.

No entry for terminal type "xterm";

using dumb terminal settings.


psi-# \q


At first you probably need to create a database. As an Oracle DBA, I was wondering about some typical problems such as character set or default tablespace. With PostgreSQL, it is quite easy to create a database.

As the locale en_US.utf8 has been chosen during the installation phase to be used by the cluster database, every database you will create will use it.

When you create a database you can specify a default tablespace and an owner. At first we create a tablespace:


postgres=# create tablespace psi location '/u01/postgres/data/psi';



The tablespace data is located in /u01/postgres/data/psi:


-bash-3.2$ ls


-bash-3.2$ ls PG_9.3_201306121/


-bash-3.2$ ls PG_9.3_201306121/16526/

12547     12587_vm  12624     12663     12728     12773

12547_fsm 12589     12625     12664     12728_fsm 12774

12664_vm  12730   12774_vm     12627     12666     12731     12776


Then we create the database:


postgres=# create database psi owner postgres tablespace psi;



We can list all databases with the \list command:


postgres=# \list                                

                 List of databases

   Name   | Owner   | Encoding | Collate   |   Ctype   |   Access privileges


postgres | postgres | UTF8     | en_US.utf8 | en_US.utf8 |

psi       | postgres | UTF8     | en_US.utf8 | en_US.utf8 |

template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres        

|         |         |           |           | postgres=CTc/postgres

template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | postgres=CTc/postgres+           |         |         |           |           | =c/postgres


Now, we can connect to the psi database and create objects, the syntax is quite similar to Oracle:


postgres=# \c psi

You are now connected to database "psi" as user "postgres".

We create a table and an index:


psi=# create table employe (name varchar);


psi=# create index employe_ix on employe (name);



We insert values in it:

psi=# insert into employe values ('bill');



We reconnect to the psi database:

-bash-3.2$ psql -d psi


psql.bin (9.3.4)

Type "help" for help.

No entry for terminal type "xterm";

using dumb terminal settings.

The following command lists the tables:

psi=# \dt[+]                    

             List of relations

Schema | Name   | Type | Owner   | Size | Description


public | employe | table | postgres | 16 kB |

(1 row)

psi=# select * from employe;




(1 row)


The \d+ postgreSQL command is the equivalent of the Oracle desc command:

psi=# \d+ employe                            

                Table "public.employe"

Column |       Type       | Modifiers | Storage | Stats target | Description


name   | character varying |          | extended |             |


   "employe_ix" btree (name)

Has OIDs: no

Obviously we also have the possibility to create a schema and create objects in this schema.

Let's create a schema:

psi=# create schema psi;


Let's create a table, insert objects in it and create a view:

psi=# create table psi.salary (val integer);


psi=# insert into psi.salary values (10000);


psi=# select * from psi.salary;




psi=# create view psi.v_employe as select * from psi.salary;



If we list the tables we can only see the public objects:

psi=# \d        

        List of relations

Schema | Name   | Type | Owner  


public | employe | table | postgres

(1 row)

If we modify the search path, all schemas are visible:

psi=# set search_path to psi,public;


psi=# \d 

        List of relations

Schema | Name   | Type | Owner  


psi   | salary | table | postgres

public | employe | table | postgres

Oracle DBA’s are familiar with sql commands - e. g. to get the table list of a schema by typing select table_name, owner from user_tables, etc.

What is the equivalent query in postgreSQL?

PostgreSQL uses a schema named information_schema available in every database. The owner of this schema is the initial database user in the cluster. You can drop this schema, but the space saving is negligible.

You can easily query the tables of this schema to get precious informations about your database objects:

Here is a list of the schemas tables:

psi=# select table_name, table_schema from information_schema.tables where table_schema in ('public','psi');

table_name | table_schema


employe   | public

salary     | psi

We can display the database character set:

psi=# select character_set_name from information_schema.character_sets;





We can display schema views:

psi=# select table_name from information_schema.views where table_schema='psi';




Using the information_schema schema helps us to display information about a lot of different database objects (tables, constraints, sequences, triggers, table_privileges …)

Like in Oracle you can run a query from the SQL or the UNIX prompt. For example, if you want to know the index name of the table employe, you shoud use the index.sql script:


t.relname as table_name,

i.relname as index_name,

a.attname as column_name


pg_class t,pg_class i,

pg_index ix,pg_attribute a

wheret.oid = ix.indrelid

and i.oid = ix.indexrelid

and a.attrelid = t.oid

and a.attnum = ANY(ix.indkey)

and t.relkind = 'r'

and t.relname = 'employe'

order byt.relname,i.relname;

If you want to display the employee index from the SQL prompt, you run:

psi=# \i index.sql

table_name | index_name | column_name


employe   | employe_ix | name

If you want to run the same query from the UNIX prompt:

-bash-3.2$ psql -d psi -a -f index.sql


table_name | index_name | column_name


employe   | employe_ix | name

However, typing an SQL request might be interesting, but - as many Oracle DBA - I like using an administration console because I think it increases efficiency.

I have discovered pgAdmin, an administration tool designed for Unix or Windows systems. pgAdmin is easy to install on a PostgreSQL environment and enables many operations for the administration of a cluster database.

pgAdmin3 is installed in the home directory of the user postgre - in my case in /opt/postgres/9.3.

To successfully enable pgAdmin3, it is necessary to correctly initialize the LD_LIBRARY_PATH variable:


export LD_LIBRARY_PATH=/opt/PostgreSQL/9.3/lib:/opt/PostgreSQL/9.3/pgAdmin3/lib


The pgadmin3 console:




As you can see, you can administer every database object (tables, functions, sequences, triggers, views…).

You can visualize the table creation scripts:




You can edit / change / modify the privileges of an object:




You also have the possibility to create scripst for the database creation:




Or even to backup the database:




This tool seems to be very powerful, even if for the moment, I did not find any performance tool available like in Cloud Control 12c.



Discovering PostgreSQL as an Oracle DBA, I realized how close the two products are. The PostgreSQL database has a lot of advantages such as the easy installation, the general usage and the price (because it’s free!).

For the processing of huge amounts of data, Oracle certainly has advantages, nevertheless the choice of a RDBMS always depends on what your application business needs are.

SQL Saturday in Paris on 12 -13 September

Mon, 2014-07-28 22:35

As you certainly know SQL Saturday events are very popular in SQL Server world community. This is the second time the event takes place in Paris (France), but this time, we have a new format with pre-conferences on Wednesday and classic sessions on Saturday. During pre-conferences, we will talk about a particular subject for a whole day.

This time, I have the opportunity to participate twice by giving two sessions (in French) with the following program:

  • Friday: Inside the SQL Server storage and backups

If you are interested in how the SQL Server storage works and how to deal with corruption as well as backups, this session might be interesting for you.

Be careful: the pre-conferences on Wednesday are fee-paying sessions (but not that expensive). You can still register at this address.

  • Saturday: SQL Server AlwaysOn deep dive

SQL Server AlwaysOn is a new great high-availability and disaster recovery feature provided by Microsoft. You can come take a look at this session if you are concerned by questions like:

  • How to configure my Windows failover cluster and quorum in my situation?
  • What exactly is a read-only secondary replica?
  • What are the built-in tools provided by Microsoft to monitor and troubleshoot this infrastructure?

Good news: the sessions on Saturday are free!

Take a look at the agenda if you want to attend to other interesting sessions. I hope there will be many attendees! Smile

PDB media failure may cause the whole CDB to crash

Mon, 2014-07-28 07:44

Do you remember last year, when 12c arrived with multitenant, David Hueber warned us about the fact that a single PDB can, under certain conditions, generate a complete system downtime? We are beta testers and opened a SR for that. Now one year later the first patchset is out and obviously I checked if the issue was fixed. It's a patchset afterall, which is expected to fix issues before than bringing new features.

So the issue was that when the SYSTEM tablespace is lost in a PDB, then we cannot restore it without shutting down the whole CDB. This is because we cannot take the SYSTEM tablespace offline, and we cannot close the PDB as a checkpoint cannot be done. There is no SHUTDOWN ABORT for a PDB that can force to it. Conclusion: if you loose one SYSTEM tablespace, either you accept to wait for a maintenance window before bring it back online, or you have to stop the whole CDB with a shutdown abort.

When I receive a new release, I like to check new parameters, even the undocumented ones. And in there is a new underscore parameter _enable_pdb_close_abort which has the description 'Enable PDB shutdown abort (close abort)'. Great. It has a default value of false but maybe this is how the bug has been addressed.

Before trying that parameter, let's reproduce the case:

Here are my datafiles:

RMAN> report schema;

using target database control file instead of recovery catalog
Report of database schema for database with db_unique_name CDB_SITE1

List of Permanent Datafiles
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    790      SYSTEM               YES     /u01/app/oracle/oradata/CDB/system01.dbf
3    680      SYSAUX               NO      /u01/app/oracle/oradata/CDB/sysaux01.dbf
4    215      UNDOTBS1             YES     /u01/app/oracle/oradata/CDB/undotbs01.dbf
5    250      PDB$SEED:SYSTEM      NO      /u01/app/oracle/oradata/CDB/pdbseed/system01.dbf
6    5        USERS                NO      /u01/app/oracle/oradata/CDB/users01.dbf
7    540      PDB$SEED:SYSAUX      NO      /u01/app/oracle/oradata/CDB/pdbseed/sysaux01.dbf
8    250      PDB1:SYSTEM          NO      /u01/app/oracle/oradata/CDB/PDB1/system01.dbf
9    570      PDB1:SYSAUX          NO      /u01/app/oracle/oradata/CDB/PDB1/sysaux01.dbf
10   5        PDB1:USERS           NO      /u01/app/oracle/oradata/CDB/PDB1/PDB1_users01.dbf
11   250      PDB2:SYSTEM          NO      /u01/app/oracle/oradata/CDB/PDB2/system01.dbf
12   570      PDB2:SYSAUX          NO      /u01/app/oracle/oradata/CDB/PDB2/sysaux01.dbf
13   5        PDB2:USERS           NO      /u01/app/oracle/oradata/CDB/PDB2/PDB2_users01.dbf

List of Temporary Files
File Size(MB) Tablespace           Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1    197      TEMP                 32767       /u01/app/oracle/oradata/CDB/temp01.dbf
2    100      PDB$SEED:TEMP        32767       /u01/app/oracle/oradata/CDB/pdbseed/pdbseed_temp012014-06-15_09-46-11-PM.dbf
3    20       PDB1:TEMP            32767       /u01/app/oracle/oradata/CDB/PDB1/temp012014-06-15_09-46-11-PM.dbf
4    20       PDB2:TEMP            32767       /u01/app/oracle/oradata/CDB/PDB2/temp012014-06-15_09-46-11-PM.dbf

then I just remove the PDB2 SYSTEM datafile:

rm /u01/app/oracle/oradata/CDB/PDB2/system01.dbf 

And I go to sqlplus in order to check the state of my pdb. Remeber, I want to see if I can restore the datafile without doing a shutdown abort on my CDB instance.

SQL*Plus: Release Production on Mon Jul 27 20:31:45 2014

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> select name,open_mode from v$pdbs;
select name,open_mode from v$pdbs
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

Oh... that's bad... Let's look at the alert.log:

Errors in file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_ckpt_21620.trc:
ORA-01243: system tablespace file suffered media failure
ORA-01116: error in opening database file 11
ORA-01110: data file 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
USER (ospid: 21620): terminating the instance due to error 1243
System state dump requested by (instance=1, osid=21620 (CKPT)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_diag_21608_20140727202844.trc
2014-07-27 20:28:49.596000 +02:00
Instance terminated by USER, pid = 21620

The CKPT process has terminated the instance. The whole CDB is down.

That's worse. In we had to bring down the instance, but at least we were able to choose the time and warn the users. Not here. In it crashes immediately when a checkpoint occurs.

I've opened a bug for that (Bug 19001390 - PDB SYSTEM TABLESPACE MEDIA FAILURE CAUSES THE WHOLE CDB TO CRASH) which is expected to be fixed for the next release (12.2).

Ok the good news is that once the CDB is down, recovery is straightforward:

Recovery Manager: Release - Production on Mon Jul 27 21:36:22 2014

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database (not started)

RMAN> startup

Oracle instance started
database mounted
database opened

Total System Global Area     838860800 bytes

Fixed Size                     2929936 bytes
Variable Size                616565488 bytes
Database Buffers             213909504 bytes
Redo Buffers                   5455872 bytes

RMAN> list failure;

using target database control file instead of recovery catalog
Database Role: PRIMARY

List of Database Failures

Failure ID Priority Status    Time Detected Summary
---------- -------- --------- ------------- -------
3353       CRITICAL OPEN      27-JUL-14     System datafile 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' is missing
245        HIGH     OPEN      27-JUL-14     One or more non-system datafiles need media recovery

RMAN> advise failure;

Database Role: PRIMARY

List of Database Failures

Failure ID Priority Status    Time Detected Summary
---------- -------- --------- ------------- -------
3353       CRITICAL OPEN      27-JUL-14     System datafile 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' is missing
245        HIGH     OPEN      27-JUL-14     One or more non-system datafiles need media recovery

analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=132 device type=DISK
analyzing automatic repair options complete

Mandatory Manual Actions
no manual actions available

Optional Manual Actions
1. If file /u01/app/oracle/oradata/CDB/PDB2/system01.dbf was unintentionally renamed or moved, restore it
2. Automatic repairs may be available if you shutdown the database and restart it in mount mode
3. If you restored the wrong version of data file /u01/app/oracle/oradata/CDB/PDB2/sysaux01.dbf, then replace it with the correct one
4. If you restored the wrong version of data file /u01/app/oracle/oradata/CDB/PDB2/PDB2_users01.dbf, then replace it with the correct one

Automated Repair Options
Option Repair Description
------ ------------------
1      Restore and recover datafile 11; Recover datafile 12; Recover datafile 13
  Strategy: The repair includes complete media recovery with no data loss
  Repair script: /u01/app/oracle/diag/rdbms/cdb/CDB/hm/

RMAN> repair failure;

Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/cdb/CDB/hm/

contents of repair script:
   # restore and recover datafile
   restore ( datafile 11 );
   recover datafile 11;
   # recover datafile
   recover datafile 12, 13;

Do you really want to execute the above repair (enter YES or NO)? YES
executing repair script

Starting restore at 27-JUL-14
using channel ORA_DISK_1

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00011 to /u01/app/oracle/oradata/CDB/PDB2/system01.dbf
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fast_recovery_area/CDB/FECFFDC5F6D31F5FE043D74EA8C0715F/backupset/2014_07_28/o1_mf_nnndf_TAG20140728T150921_9xdlw21n_.bkp
channel ORA_DISK_1: piece handle=/u01/app/oracle/fast_recovery_area/CDB/FECFFDC5F6D31F5FE043D74EA8C0715F/backupset/2014_07_28/o1_mf_nnndf_TAG20140728T150921_9xdlw21n_.bkp tag=TAG20140728T150921
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 27-JUL-14

Starting recover at 27-JUL-14
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 27-JUL-14

Starting recover at 27-JUL-14
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 27-JUL-14
repair failure complete

RMAN> alter pluggable database PDB2 open;

Statement processed

I whish that one day the PDB will have true isolation so that I can give DBA rights to the application owner on his PDB. But that means that, at least:

  • A PDB failure cannot crash the CDB instance.
  • A PDB admin cannot create datafiles anywhere on my server.
  • A PDB admin cannot run anything as the instance owner user (usually oracle)

SQL monitoring shows adaptive plans

Fri, 2014-07-25 14:23

In a previous post, I have described Adaptive Plans. Even if I prefer to show plans with the SQL Monitor active html format, I had to stick with the dbms_xplan for that because SQL Monitoring did not show all information about adaptive plans.

This has been fixed in the Patchset 1 and I have run the same query to show the new feature.

First, an adaptive plan can be in two states: 'resolving' where all alternatives are possible and 'resolved' then the final plan has been choosen. It is resolved once the first execution statistics collector has made the decision about the inflection point. We can see the state in the SQL Monitor header:




Here my plan is resolved because the first execution is finished.

The plan with rowsource statistics show only the current plan, but the 'Plan Note' shows that it is an adaptive plan:




Now we have to go to the 'Plan' tab which show the equivalent of dbms_xplan.display_cursor:




Here the format is equivalent to format=>'adaptive'. It's the 'Full' plan where all branches are shown but inactive part is grayed. We have here the Statistics Collector after reading DEPARTMENTS, and we have the inactive full table scan hash join of EMPLOYEES.

Just choose the 'Final' Plan (or 'Current' if it is not yet resolved) to get only the active part:




I often prefer the tabular format to the graphical one:




We have all information: the 7 rows from DEPARTMENTS have gone through STATISTICS COLLECTOR and NESTED LOOP with index access has been choosen. Note that it is different from the previous post where HASH JOIN with full table scan was choosen because the 7 rows were higher than the inflection point.

In my current example, because I have system statistics that costs full table scan higher:


DP: Found point of inflection for NLJ vs. HJ: card = 8.35


This is higher than ny 7 rows from DEPARTMENTS.

Here is the whole sqlmon report: and how I got it:


alter session set current_schema=HR;
select /*+ monitor */ distinct DEPARTMENT_NAME from DEPARTMENTS
 where DEPARTMENT_NAME like '%ing' and SALARY>20000;

alter session set events='emx_control compress_xml=none';set pagesize 0 linesize 10000 trimspool on serveroutput off long 100000000 longc 100000000 echo off feedback off
spool sqlmon.htm
select dbms_sqltune.report_sql_monitor(report_level=>'all',type=>'html') from dual;
spool off

Note that I used the script exposed here and I used the emx_event to get the uncompressed xml, which I got from Tyler Muth:

@FranckPachot well, I asked @DBAKevlar, she asked the developer that owns it ;)

— tmuth (@tmuth) July 25, 2014

Beyond In-Memory, what's new in ?

Tue, 2014-07-22 09:53

It's just a patchset. The delivery that is there to stabilize a release with all the bug fixes. But it comes with a lot of new features as well. And not only the one that has been advertised as the future of the database. It's a huge release.

Let's have a look at what's new.

First, it seems that it will be the only patchest for 12.1

Then, there is that In-Memory option awaited for a while. There has been some demo done by Larry Ellison on Exadata or even on the Oracle SPARC M6. Of course, if you have 32 TB of memory, we can understand the need for an In-Memory optimized storage. For a more real-life usage of that option, stay tune on our blog. We investigate the features in the context of our customer concerns, to fit their needs. For example, In-Memory addresses cases where some customers use Active Data Guard to offload reporting/real-time analytics to another server. But unfortunately In-Memory is not populated on a physical standby. We probably have to wait 12.2 for that.

In-Memory is an option, so available only in Enterprise Edition.

There are other new features related with large memory. There is a part of buffer cache dedicated to big tables (you just set the percentage) to be cached for In-Memory Parallel Query. And there is also a mode where all the database is in buffer cache. About performance and Parallel Query, a new transformation has been introduced to optimize the group by operation when joining a fact table to dimensions.

Second new feature is the range-partitioned hash cluster. Oracle CLUSTER segments is a very old feature but not widely used. Hash cluster is the fastest way to access to a row because the key can be directly transformed to a rowid. Unfortunately maintenance is not easy, especially when the volume increases. And we have partitioning which is the way to ease maintenance with growing tables but, until today, we can't partition a hash cluster. I mean, not in a supported way because Oracle uses it on SPARC for the TPC benchmarks - applying a specific patch (10374168) for it.

Well, the good news is that we can finally partition hash clusters with the simple syntax:

create cluster democ1 (sample_time timestamp,sample_id number)
hashkeys 3600 hash is sample_id size 8192
partition by range (sample_time) (
partition P12 values less than( timestamp'2014-04-26 12:00:00' )

Another nice feature is Attribute Clustering. Lot of other RDBMS has the ability to arrange rows but Oracle puts any insert anywhere in a heap table, depending only on where some free space is left. The alternative is IOT of course. But it can be good to try to cluster rows on one or several columns. It's better for index access, it's better for cache efficiency, it's better for storage indexes (or in-memory min/max), for ILM compression, etc. We can finally do it and I'll blog soon about that. 

Attribute Clustering is not an option, but available only in Enterprise Edition.


I think those two features are my favorite ones. Because the best optimization we can do, without refactoring the application design, is to place data in the way it will be retreived.


The trend today is to store unstructured data as JSON. XML was nice, but it's verbose. JSON is easier to read and even PostgreSQL can store JSON in its latest version. So Oracle has it in you can store and index it. Once again stay tuned on this blog to see how it works.

Something important was missing in Oracle SQL. How do you grant a read only user? You grant only select privilege? But that's too much because with a select privilege we can lock a table (with LOCK or SELECT FOR UPDATE). So we have now a READ privilege to prevent that. That's my favorite new feature for developers.

Then there are a few improvements on multitenant, such as the possibility to save the state of a pluggable database so that it can be automatically opened when the CDB startup. We already addressed that in in our Database Management Kit. An undocumented parameter, _multiple_char_set_cdb, let us imagine that we will be able to have different characterset for the PDB - probably in the future. Currently it's set to false.

And once again as beta testing partners we have put the pressure to have a fix for what we consider as serious availability bug. The behaviour in beta was even worse about CDB availability and I finally had a bug opened (Bug 19001390 - PDB SYSTEM TABLESPACE MEDIA FAILURE CAUSES THE WHOLE CDB TO CRASH) that should be fixed in 12.1

About fixes, some restrictions are now gone: we can finally use ILM with multitenant and we can have supplemental logging while using a move partition online. And you can have Flashback Data Archive in multitenant as well.

All that is good news, but remember, even if it's only the 4th digit that is increased in the version number, it's a brand new version with lot of new features. So, when do you plan to upgrade ? 11g is supported until January 2015. Extended support is free until January 2016 given that you are in the terminal patchset ( So either you don't want to be in the latestet release and you will have to upgrade to before the end of the year, waiting for 12.2 maybe in 2016. Or you want those new features and will probably go to for 2015.

Talking about upgrade, there's a bad news. We thought that multitenancy can accelarate upgrade time. Because the data dictionary is shared, you just have to plug a PDB into a newer version CDB and it's upgraded. And we show that in our 12c new features workshop by applying a PSU. But we have tested the upgrade to in the same way, and it's not that simple. Plugging is quick when you have only new patches that did not change the dictionary. It's still true for PSU when the dictionary changes are limited to the root container. But when you upgrade to you have to synchronize all the PDB dictionaries (all that magic behind object links and metadata links) and that takes time. It takes the same time as upgrading a non-CDB. Conclusion: you don't save time when you do it by plug/unplug.

But I have good news as well for that because I've tested a 1 minute downtime migration from to Dbvisit replicate, the affordable replication solution, supports multitenant in it's latest version, both as source and target. If your application is compatible (which is easy to check with the 30 days trial) then it's a good way to migrate without stress and with minimal downtime. It's available for Standard Edition as well, but currently the download can install only an Enterprise Edition.

Backup an SQL Server database from On-Premise to Azure

Sun, 2014-07-20 23:46

SQL Server database backup & restore from On-Premise to Azure is a feature introduced with SQL Server 2012 SP1 CU2. In the past, it could be used with these three tools:

  • Transact-SQL (T-SQL)
  • PowerShell
  • SQL Server Management Objects (SMO)

With SQL Server 2014, backup & restore can also be enabled via SQL Server Management Studio (SSMS).

Oracle EM agent 12c thread leak on RAC

Thu, 2014-07-17 22:24

In a previous post about nproc limit, I wrote that I had to investigate the nproc limit with the number of threads because my Oracle 12c EM agent was having thousands of threads. This post is a short feedback about this issue and the way I have found the root cause. It concerns the enterprise manager agent 12c on Grid Infrasctructure >=



The issue was:


ps -o nlwp,pid,lwp,args -u oracle | sort -n
   1  8444  8444 oracleOPRODP3 (LOCAL=NO)
   1  9397  9397 oracleOPRODP3 (LOCAL=NO)
   1  9542  9542 oracleOPRODP3 (LOCAL=NO)
   1  9803  9803 /u00/app/oracle/product/agent12c/core/ /u00/app/oracle/product/agent12c/core/ agent /u00/app/oracle/product/agent12c/agent_inst/sysman/log/emagent.nohup
  19 11966 11966 /u00/app/11.2.0/grid/bin/oraagent.bin
1114  9963  9963 /u00/app/oracle/product/agent12c/core/ ... emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain


By default ps has only one entry per process, but each processes can have several threads - implemented on linux as light-weight process (LWP). Here, the NLWP column shows that I have 1114 threads for my EM 12c agent - and it was increasing every day until it reached the limit and the node failed ('Resource temporarily unavailable').

The first thing to do is to know what those threads are. The ps entries do not have a lot of information, but I discovered jstack which every java developer should know, I presume. You probably know that java has very verbose (lengthy) stack traces. Jstack was able to show me thousands of them in only one command:


$ jstack 9963
2014-06-03 13:29:04
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.14-b01 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f3368002000 nid=0x4c9b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CRSeOns" prio=10 tid=0x00007f32c80b6800 nid=0x3863 in Object.wait() [0x00007f31fe11f000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at oracle.eons.impl.NotificationQueue.internalDequeue(
	- locked  (a java.lang.Object)
	at oracle.eons.impl.NotificationQueue.dequeue(
	at oracle.eons.proxy.impl.client.base.SubscriberImpl.receive(
	at oracle.eons.proxy.impl.client.base.SubscriberImpl.receive(
	at oracle.eons.proxy.impl.client.ProxySubscriber.receive(
	at oracle.sysman.db.receivelet.eons.EonsMetric.beginSubscription(
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
	at java.util.concurrent.ThreadPoolExecutor$
	at oracle.sysman.gcagent.util.system.GCAThread$

I don't paste all of them here. We have the 'main', we have a few GCs and 'Gang workers' which are present in all JVMs and we have a few enterprise manager threads. And what was interesting was that I had thousands of "CRSeOns" that seemed to be increasing.

Some guesses: I'm on RAC, and I have a 'ons' resource and the EM agent tries to subscribe to it. Goggle search returned nothing, and that's the reason I put that in a blog post now. Then I searched MOS, and bingo, there is a note: Doc ID 1486626.1. It has nothing to do with my issue, but has an interesting comment in it:

In cluster version and higher, the ora.eons resource functionality has been moved to EVM. Because of this the ora.eons resource no longer exists or is controlled by crsctl.

It also explains how to disable EM agent subscription:

emctl setproperty agent -name disableEonsRcvlet -value true

I'm in and I have thousands of threads related to a functionality that doesn't exist anymore. And that leads to some failures in my 4 nodes cluster.

The solution was simple: disable it.

For a long time I have seen a lot of memory leaks or CPU usage leaks related to the enterprise manager agent. With this new issue, I discovered a thread leak and I also faced a SR leak when trying to get support for the 'Resource temporarily unavailable' error, going back and forth between OS, Database, Cluster and EM support teams...

SQL Server Perfmon does not start automatically

Mon, 2014-07-14 19:28

I have recently used perfmon (performance monitor) at a customer site. I created a Data Collector Set to monitor CPU, Memory, Disk, and Network during one day. Then, I ran the monitor and I received a "beautiful" error message…

Partial Join Evaluation in Oracle 12c

Sun, 2014-07-13 21:21

Do you think that it's better to write semi-join SQL statements with IN(), EXISTS(), or to do a JOIN? Usually, the optimizer will evaluate the cost and do the transformation for you. And in this area, one more transformation has been introduced in 12c which is the Partial Join Evaluation (PJE).

First, let's have a look at the 11g behaviour. For that example, I use the SCOTT schema, but I hire a lot more employees in departement 40:


SQL> alter table EMP modify empno number(10);
Table altered.
SQL> insert into EMP(empno,deptno) select rownum+10000,40 from EMP,(select * from dual connect by level


Why department 40? I'll explain it below, but I let you think about it before. In the default SCOTT schema, there is a department 40 in DEPT table, but which has no employees in EMP. And the new transformation is not useful in that case.


11g behaviour

Now, I'm running the following query to check all the departments that have at least one employee:

I can write it with IN:


SQL_ID  6y71msam9w32r, child number 0
select distinct deptno,dname from dept 
 where deptno in ( select deptno from emp)

Plan hash value: 1754319153

| Id  | Operation          | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT   |      |      1 |        |      4 |      15 |
|*  1 |  HASH JOIN SEMI    |      |      1 |      4 |      4 |      15 |
|   2 |   TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   3 |   TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       8 |

Predicate Information (identified by operation id):

   1 - access("DEPTNO"="DEPTNO")


or with EXISTS:


SQL_ID  cbpa3zjtzfzrn, child number 0
select distinct deptno,dname from dept 
 where exists ( select 1 from emp where emp.deptno=dept.deptno)

Plan hash value: 1754319153

| Id  | Operation          | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT   |      |      1 |        |      4 |      15 |
|*  1 |  HASH JOIN SEMI    |      |      1 |      4 |      4 |      15 |
|   2 |   TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   3 |   TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       8 |
Predicate Information (identified by operation id):

   1 - access("DEPTNO"="DEPTNO")


Both are good. We didn't have to read the whole EMP table. I have 15000 rows in my table, I do a full scan on it, but look at the A-Rows: only 388 rows were actually read.

The HASH JOIN first read the DEPT table in order to build the hash table. So it already knows that we cannot have more than 4 distinct departments.

Then we do the join to EMP just to check which of those departments have an employee. But we can stop as soon as we find the 4 departments. This is the reason why we have read only 388 rows here. And this is exactly what a Semi Join is: we don't need all the matching rows, we return at most one row per matching pair.

Ok. What if we write the join ourselves?


SQL_ID  2xjj9jybqja87, child number 1
select distinct deptno,dname from dept join emp using(deptno)

Plan hash value: 2962452962

| Id  | Operation           | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT    |      |      1 |        |      4 |     129 |
|   1 |  HASH UNIQUE        |      |      1 |  15068 |      4 |     129 |
|*  2 |   HASH JOIN         |      |      1 |  15068 |  14014 |     129 |
|   3 |    TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  15068 |  14014 |     122 |

Predicate Information (identified by operation id):

   2 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")


Bad luck. We have to read all the rows. More rows and more buffers.


12c behaviour

Let's do the same in


SQL_ID  2xjj9jybqja87, child number 0
select distinct deptno,dname from dept join emp using(deptno)

Plan hash value: 1629510749

| Id  | Operation           | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT    |      |      1 |        |      4 |      14 |
|   1 |  HASH UNIQUE        |      |      1 |      4 |      4 |      14 |
|*  2 |   HASH JOIN SEMI    |      |      1 |      4 |      4 |      14 |
|   3 |    TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       7 |

Predicate Information (identified by operation id):

   2 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")


Same plan but less rows have been read. If we check the outlines, we see the new feature:


      PARTIAL_JOIN(@"SEL$58A6D7F6" "EMP"@"SEL$1")


And here is what we see in the optimizer trace:


PJE: Checking validity of partial join eval on query block SEL$58A6D7F6 (#1)
PJE: Passed validity of partial join eval by query block SEL$58A6D7F6 (#1)
PJE: Partial join eval conversion for query block SEL$58A6D7F6 (#1).
PJE: Table marked for partial join eval: EMP[EMP]#1


The hints that controls the feature are PARTIAL_JOIN and NO_PARTIAL_JOIN and there are enabled by _optimizer_partial_join_eval which appeared in 12c.

But of course, the optimization is useful only when we have all the values at the beginning of the table. This is why I added at least one employee in department 40. If there are some rows in DEPT that have no matching row in EMP, then Oracle cannot know the result before reaching the end of the table.

Master Data Services installation for SQL Server 2012

Sun, 2014-07-13 20:32

This posting is a tutorial for installing Master Data Services on your Windows Server 2012. Microsoft SQL Server Master Data Services (MDS) is a Master Management product from Microsoft, code-named Bulldog. It is the rebranding of the Stratature MDM product, titled +EDM and acquired in June 2007 by Microsoft. Initially, it was integrated for the first time in Microsoft SQL Server 2008 as an additional installer. But since SQL Server 2012, Master Data Services is integrated as a feature within the SQL Server installer.



Master Data Services is part of the Enterprise Information Management (EMI) technologies, provided by Microsoft, for managing information in an enterprise.

EMI technologies include:

  • Integration Services
  • Master Data Services
  • Data Quality Services



Master Data Services covers five main components:

  • MDS Configuration Manager tool: used to configure Master Data Services
  • MDS Data Manager Web Application: used essentially to perform administrative tasks
  • MDS Web Service: used to extend or develop custom solutions
  • MDS Add-in for Excel: used to manage data, create new entities or attributes …


SQL Server Editions & Versions

Master Data Services can be installed only with the following SQL Server Editions & Versions:

  • SQL Server 2008 R2 edition: Datacenter or Enterprise versions
  • SQL Server 2012 or SQL Server 2014 editions: Enterprise or BI versions


Master Data Services prerequisites in SQL Server 2012

First, Master Data Services is based on an application web named Master Data Manager Web Application, in order to perform administrative task, for example. This web application is hosted by Internet Information Services (IIS), so it is a necessary prerequisite.

Furthermore, to be able to display the content from the web application, you need Internet Explorer 7 or later (Internet Explorer 6 is not supported) with Silverlight 5.

Moreover, if you planned to use Excel with Master Data Services, you also need to install Visual Studio 2010 Tools for Office Runtime, plus the Master Data Services Add-in for Microsoft Excel.

Finally, often forgotten, but PowerShell 2.0 is required for Master Data Services.

 Let’s resume the requirements for Master Data Services:

  • Internet Information Services (IIS)
  • Internet Explorer 7 or later
  • Silverlight 5
  •  PowerShell 2.0
  • Visual Studio 2010 Tools for Office Runtime and Excel Add-in for Microsoft Excel (only if you plan to use Excel with Master Data Services).


Configuration at the Windows Server level

In the Server Manager, you have to activate the Web Server (IIS) Server Roles to be able to host the Master Data Web Application, as well as the .Net 3.5 feature.

For the Server Roles, you have to select:

  • Web Server (IIS)

For the Server Features, you have to select:

- .NET Framework 3.5 Features
  - .NET Framework 3.5
  - HTTP Activation
- .NET Framework 4.5 features
  - .NET Framework 4.5
  - ASP.NET 4.5
  - WCF Services
    - HTTP Activation
    - TCP Port Sharing




For the IIS features selection, you have to select:

- Web Server
  - Common HTTP Features
    - Default Document
    - Directory Browsing
    - HTTP Errors
    - Static Content
  - Health and Diagnostics
    - HTTP Logging
    - Request Monitor
  - Performance
    - Static Content Compression
  - Security
    - Request Filtering
    - Windows Authentication
  - Application Development
    - .NET Extensibility
    - .NET Extensibility 4.5
    - ASP.NET 3.5
    - ASP.NET 4.5
    - ISAPI Extensions
    - ISAPI Filters
  - Management Tools
    - IIS Management Console





Installation of SQL Server 2012

Master Data Services stores its data on a SQL Server database, so you need a SQL Server Engine installed.

Of course, SQL Server Engine can be installed on a different Windows Server. So the Windows Server with Master Data Services installed is used as a Front Server.

Then, in order to personalize the roles of your Master Data Services, you also need to install Management Tools.

At the features installation step, you have to select:

  • Database Engine Services
  • Management Tools
  • Master Data Services



At this point, Master Data Services should be installed with all the needed prerequisites.



However, Master Data Services cannot be used without configuring it. Three main steps need to be performed through the MDS Configuration Manager:

  • First, you have to create a MDS database
  • Then, you have to create a MDS web application hosted in IIS
  • Finally, you have to link the MDS database with the MDS web application

SQL Server 2014: Are DENY 'SELECT ALL USERS SECURABLES' permissions sufficient for DBAs?

Wed, 2014-07-02 20:09

SQL Server 2014 improves the segregation of duties by implementing new server permissions. The most important is the SELECT ALL USERS SECURABLES permission that will help to restrict database administrators from viewing data in all databases.

My article is a complement to David Barbarin's article 'SQL Server 2014: SELECT ALL USERS SECURABLES & DB admins'.

Java Mission Control 5.2 (7u40) deserves your attention

Mon, 2014-06-30 20:38

Recently, some new versions of java were made available. Most people think Java updates are boring and only security-oriented. But one of the last updates (7u40) includes a feature which deserves attention. I mean Java Mission Control 5.2.


Hotspot incoming

If you know the Oracle JRockit JVM a little bit, you might have heard about JMC, which was called JRockit Mission Control in the past. In fact, it’s a tool suite embedded with the Hotspot JDK since this so called 7u40 release which allows to monitor and profile your JVM.

Previously, it was only available for JRockit JVMs and it has now been ported to the Hotspot. JMC is a way more accurate and complete than JConsole or other embedded tools. It does not affect JVM performances more than 1%.




Mission Completed

JMC gathers low level information thanks to its Flight Recorder tool which listens and waits for internal events. It can then monitor, manage, profile, and eliminate memory leaks from the JVM.

The new version of JMC now has a new browser with subnodes for available server side services with their states. It is supported by Eclipse 3.8.2/4.2.2 and later, it allows deeper management of MBeans, especially setting values directly in the attribute tree. It converges with JRockit event management: All information provided by JRockit is now available in the Hotspot as well.




To enable JMC, you will have to add the following arguments to the JVM:





Understanding JVM Java memory leaks

Thu, 2014-06-26 21:38

Lots of people think that Java is free of memory management as the JVM uses a Garbage Collector. The goal of this collector is to free objects that are no longer used in the program without the developer being forced to declare that the object can be collected. Everything is automatic.

It’s really helpful and it avoids wasting time in managing memory. But, as it is an automatic process, it can produce some issues. They are better known as memory leaks.


What is a memory leak?

A leak appears when an object is no longer used in the program but is still referenced somewhere at a location that is not reachable. Thus, the garbage collector cannot delete it. The memory space used for this object will not be released and the total memory used for the program will grow. This will degrade performances over time and the JVM may run out of memory.


Identifying memory leaks

Memory leaks are difficult to identify, they require a good knowledge of the application. Usually, they are related to an Out Of Memory Error exception (also called OOM). But note that not all Out Of Memory Errors imply memory leaks, and not all memory leaks are due to Out Of Memory Errors.

Having an Out Of Memory Error is a first sign but you must make a difference between a “normal” OOM and a memory leak.

For example, if the program loads data from external files and one time a file is bigger than it has been expected it could result in an OOM. But this one is “normal” as the design of the application was not able to handle such a big file.

In the other hand, if the program is used to process data with the same size or which are similar to each other and you get an OOM it may be a memory leak. In addition a memory leak generally eats free space gradually, if your memory has been fulfilled suddenly it might be a “normal” OOM and you should look at the stack trace to find out the origin.

Another symptom of a memory leak would be an allocation issue. In fact when too much space is taken in the heap and not freed allocations of new objects may be complicated. Mainly if objects are huge. When an object is first created in memory it goes to a part of the heap called Eden, then if the object survives several garbage collections it goes to the tenured space. Usually there are only old objects in the tenured space; objects which are long life based. But when the Eden is fulfilled by leaks the objects directly go to the tenured space and you can see an abnormal stack of objects in the tenured space as it is generally not fulfilled. So if you have an OOM of a tenured space it might be an allocation issue (or simply your heap configuration is too low).

The heap usage can be checked with tools such as VisualVM, jconsole, Java Mission Control (only for 7u40+ JRE) and so on. I will describe the use of this kind of tools in future blogs.


How to prevent memory leaks

As said before you cannot really prevent memory leaks as this is related to the design of the application. If you are not aware of the type of application you are running or the internal design it uses, you cannot prevent leaks.

Many applications, either desktop built or web-based, will use many threads to run. Some threads such as ThreadLocal can store references to objects which reference their classes which reference their class loader. These threads can keep objects references in order to use objects later in the thread as in methods without passing it as argument. This way, when a web app is redeployed (e.g. in tomcat) a new class loader is created to load the new application but LocalThreads might not be dealocated as they are loaded or they use things from the permgen. As you may know the permgen is a part of the JVM heap which is usually not collected by the GC. And this produces memory leaks.

What you have to remember is to reduce the use of such a threads and be careful of the implementation and design of the application you are using. Of course, memory leaks can result from other reasons but the principle is the same: Objects allocated are no longer reachable and they are not released by the garbage collector.

Oracle Parallel Query: Did you use MapReduce for years without knowing it?

Thu, 2014-06-26 06:42

I've read this morning that MapReduce is dead. The first time I heard about MapReduce was when a software architect proposed to stop writing SQL on Oracle Database and replace it with MapReduce processing. Because the project had to deal with a huge amount of data in a small time and they had enough budget to buy as many cores as they need, they wanted the scalability of parallel distributed processing.

The architect explained how you can code filters and aggregations in Map & Reduce functions and then distribute the work over hundreds of CPU cores. Of course, it's very interesting, but it was not actually new. I was doing this for years on Oracle with Parallel Query. And not only filters and aggregations, but joins as well - and without having to rewrite the SQL statements.

I don't know if MapReduce is dead, but for 20 years we are able to just flip a switch (ALTER TABLE ... PARALLEL ...) and bring scalability with parallel processing. Given that we understand how it works.

Reading a parallel query execution plan is not easy. In this post, I'll just show the basics. If you need to go further, you should have a look at some Randolf Geist presentations and read his Understanding Parallel Execution article. My goal is not to go very deep, but only to show that it is not that complex.

I'll explain how Parallel query works by showing an execution plan for a simple join between DEPT and EMP tables where I want to read EMP in parallel - and distribute the join operation as well.

For the fun of it, and maybe because it's easier to read at the first time, I've done the execution plan on an Oracle 7.3.3 database (1997):




Let's start by the end. I want to read the EMP table by several processes (4 processes because I've set the parallel degree to 4 on table EMP). The table is not partitioned. It is a heap table where rows are scattered into the segment without any specific clustering. So each process will process an arbitrary range of blocks and this is why you see an internal query filtering on ROWID between :1 and :2. My session process, which is known as the 'coordinator', and which will be represented in green below, has divided the range of rowid (it's a full table scan, that reads all blocks from start to high water mark) and has mandated 4 'producer' processes to do the full scan on their part. Those producers are represented in dark blue below.

But then there is a join to do. The coordinator could collect all the rows from the 'producer' processes and do the join, but that is expensive and not scalable. We want the join to be distributed as well. Each producer process can read the DEPT table and do the join, which is fine if it is a small table only. But anyway, we don't want the DEPT table to be read in parallel because we have not set a parallel degree on it. So the EMP table will be read by only one process: my session process, which does all the no-parallel (aka the serial) things in addition to its 'coordinator' role.

Then we have a new set of 4 processes that will do the Hash Join. They need some rows from DEPT and they need some rows from EMP. They are the 'consumer' processes that will consume rows from 'producers', and are represented in pink below. And they don't need them randomly. Because it is a join, each 'consumer' process must have the pairs of rows that match the join columns. In the plan above, you see an internal query on internal 'table queue' names. The parallel full scan on EMP distributes its rows: it's a PARALLEL_TO_PARALLEL distribution, the parallel producers sending their rows to parallel consumers. The serial full scan on DEPT distributes its rows as well: it's a PARALLEL_FROM_SERIAL distribution, the parallel consumers receiving their rows from the serial coordinator process. The key for both distributions are given by a hash function on the join column DEPTNO, so that rows are distributed to the 4 consumer processes, but keeping same DEPTNO into the same process.

We have a group by operation that will be done in parallel as well. But the processes that do the join on DEPTNO cannot do the group by which is on others columns (DNAME,JOB). So we have to distribute the rows again, but this time the distribution key is on DNAME and JOB columns. So the join consumer processes are also producers for the group by operation. And we will have a new set of consumer processes that will do the join, in light blue below. That distribution is a PARALLEL_TO_PARALLEL as it distributes from 4 producers arranged by (DEPTNO) to 4 consumers arranged by (DNAME,JOB).

At the end only one process receives the result and sends it to the client. It's the coordinator which is 'serial'. So it's a PARALLEL_TO_SERIAL distribution.

Now let's finish with my Oracle 7.3.3 PLAN_TABLE and upgrade to 12c which can show more detailed and more colorful execution plans. See here on how to get it.

I've added some color boxes to show the four parallel distributions that I've detailed above:

  • :TQ10001 Parallel full scan of EMP distributing its rows to the consumer processes doing the join.
  • :TQ10000 Serial full scan of DEPT distributing its rows to the same processes, with the same hash function on the join column.
  • :TQ10002 The join consumer receiving both, and then becoming the producer to send rows to the consumer processes doing the group by
  • :TQ10003 Those consumer processes doing the group by and sending the rows to the coordinator for the final result.



So what is different here?

First we are in 12c and the optimizer may choose to broadcast all the rows from DEPT instead of the hash distribution. It's the new HYBRID HASH distribution. That decision is done when there are very few rows and this is why they are counted by the STATISTICS COLLECTOR.

We don't see the predicate on rowid ranges, but the BLOCK ITERATOR is there to show that each process reads its range of blocks.

And an important point is illustrated here.

Intra-operation parallelism can have a high degree (here I've set it to 4 meaning that each parallel operation can be distributed among 4 processes). But Inter-operation parallelism is limited to one set of producer sending rows to one set of consumers. We cannot have two consumer operations at the same time. This is why the :TQ0001 and the :TQ10003 have the same color: it's the same processes that act as the EMP producer, and then when finished, then are reused as the GROUP BY consumer.

And there are additional limitations when the coordinator is also involved in a serial operation. For those reasons, in a parallel query plan, some non-blocking operations (those that can send rows above on the fly as they receive rows from below) have to buffer the rows before continuing. Here you see the BUFFER SORT (which buffers but doesn't sort - the name is misleading) which will keep all the rows from DEPT in memory (or tempfiles when it's big).

Besides the plan, SQL Monitoring show the activity from ASH and the time spent in each parallel process:




My parallel degree was 4 so I had 9 processes working on my query: 1 coordinator, two sets of 4 processes. The coordinator started to distribute the work plan to the other processes, then had to read DEPT and distribute its rows, and when completed it started to receive the result and send it to the client. The blue set of processes started to read EMP and distribute its rows, and when completed was able to process the group by. The red set of processes has done the join. The goal is to have the DB time distributed on all the processes running in parallel, so that the response time is equal to the longest one instead of the total. Here, it's the coordinator which has taken 18 milliseconds. The query duration was 15 milliseconds:




This is the point of parallel processing: we can do a 32 ms workload in only 15 ms. Because we had several cpu running at the same time. Of course we need enough resources (CPU, I/O and temp space). It's not new. We don't have to define complex MapReduce functions. Just use plain old SQL and set a parallel degree. You can use all the cores in your server. You can use all the servers in your cluster. If you're I/O bound on the parallel full scans, you can even use your Exadata storage cells to offload some work. And in the near future the CPU processing will be even more efficient, thanks to in-memory columnar storage.

Linux: how to monitor the nofile limit

Wed, 2014-06-18 01:47

In a previous post I explained how to measure the number of processes that are generated when a fork() or clone() call checks the nproc limit. There is another limit in /etc/limits.conf - or in /etc/limits.d - that is displayed by 'ulimit -n'. It's the number of open files - 'nofile' - and here again we need to know what kind of files are counted.



'nofile' is another limit that may not be easy to monitor, because if you just count the 'lsof' output you will include a lot of lines which are not file descriptors. So how can we count the number of files descriptors in a process?



'lsof' is a utility that show all the open files. Let's take an example:

I get the pid of my pmon process:

[oracle@VM211 ulimit]$ ps -edf | grep pmon
oracle   10586     1  0 19:21 ?        00:00:02 ora_pmon_DEMO
oracle   15494 15290  0 22:12 pts/1    00:00:00 grep pmon


And I list the open files for that process

[oracle@VM211 ulimit]$ lsof -p 10586
ora_pmon_ 10586 oracle  cwd  DIR  252,0      4096 /app/oracle/product/12.1/dbs
ora_pmon_ 10586 oracle  rtd  DIR  252,0      4096 /
ora_pmon_ 10586 oracle  txt  REG  252,0 322308753 /app/oracle/product/12.1/bin/oracle
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150175744_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_1
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_2
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_3
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_4
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_5
ora_pmon_ 10586 oracle  mem  REG  252,0   1135194 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle  mem  REG  252,0   6776936 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle  mem  REG  252,0     14597 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle    0r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    1w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    2w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    3r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    4r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb
ora_pmon_ 10586 oracle    5r DIR    0,3         0 /proc/10586/fd
ora_pmon_ 10586 oracle    6u REG  252,0      1544 /app/oracle/product/12.1/dbs/hc_DEMO.dat
ora_pmon_ 10586 oracle    7u REG  252,0        24 /app/oracle/product/12.1/dbs/lkDEMO_SITE1
ora_pmon_ 10586 oracle    8r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb

I've removed hundreds of lines with FD=mem and size=4M. I'm in AMM with memory_target=800M and SGA is implemented in /dev/shm granules. With lsof, we see all of them. And with a large memory_target we can have thousands of them (even if granule becomes 16M when memory_target is larger than 1GB). But don't worry, they don't count in the 'nofile' limit. Only 'real' file descriptors are counted - those with a numeric FD.

So, if you want to know the processes that are near the limit, you can use the following:

[oracle@VM211 ulimit]$ lsof | awk '$4 ~ /[0-9]+[rwu -].*/{p[$1"\t"$2"\t"$3]=p[$1"\t"$2"\t"$3]+1}END{for (i in p) print p[i],i}' | sort -n | tail
15 ora_dmon_    10634   oracle
16 ora_dbw0_    10608   oracle
16 ora_mmon_    10626   oracle
16 ora_rsm0_    10722   oracle
16 tnslsnr      9785    oracle
17 automount    1482    root
17 dbus-daem    1363    dbus
20 rpc.mount    1525    root
21 ora_lgwr_    10610   oracle
89 master       1811    root


The idea is to filter the output of lsof and use awk to keep only the numeric file descriptors, and aggregate per process. Then, we sort them and show the highest counts. Here the Postfix master process has 89 files open. Then log writer follows.

You can get the same information from /proc filesystem where files handles are in /proc//fd:

for p in /proc/[0-9]* ; do echo $(ls $p/fd | wc -l) $(cat $p/cmdline) ; done | sort -n | tail
15 ora_dmon_DEMO
16 ora_dbw0_DEMO
16 ora_mmon_DEMO
16 ora_rsm0_DEMO
16 /app/oracle/product/12.1/bin/tnslsnrLISTENER-inherit
17 automount--pid-file/var/run/
17 dbus-daemon--system
20 rpc.mountd
21 ora_lgwr_DEMO
89 /usr/libexec/postfix/master


Same result, much quicker and more information about the process. This is the way I prefer, but remember that if you want to see all processes, you should be logged as root.


The proof

As I did for nproc, I have written a small C program that open files (passed as arguments) for a few seconds, so that I'm sure I'm counting the right things.

And I encourage to do the same on a test system and let me know if your result differs. Here is the source:

First, I set my nofile limit to only 10

ulimit -n 10


Then, let's open 7 files. In addition with stdin, stdout and stderr we will have 10 file handles:

[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp &
open file 1 of 7 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 7 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 7 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 7 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 7 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 7 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 7 getrlimit nofile: soft=10 hard=10 myfile7.tmp


I was able to open those 7 files. Then I check lsof:

[oracle@VM211 ulimit]$ lsof | grep openfiles
openfiles 21853    oracle  cwd       DIR  0,24    380928    9320 /tmp/ulimit
openfiles 21853    oracle  rtd       DIR 252,0      4096       2 /
openfiles 21853    oracle  txt       REG  0,24      7630    9494 /tmp/ulimit/openfiles
openfiles 21853    oracle  mem       REG 252,0    156928 1579400 /lib64/
openfiles 21853    oracle  mem       REG 252,0   1926800 1579401 /lib64/
openfiles 21853    oracle    0u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    1u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    2u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    3r      REG  0,24         0    9487 /tmp/myfile1.tmp
openfiles 21853    oracle    4r      REG  0,24         0    9488 /tmp/myfile2.tmp
openfiles 21853    oracle    5r      REG  0,24         0    9489 /tmp/myfile3.tmp
openfiles 21853    oracle    6r      REG  0,24         0    9490 /tmp/myfile4.tmp
openfiles 21853    oracle    7r      REG  0,24         0    9491 /tmp/myfile5.tmp
openfiles 21853    oracle    8r      REG  0,24         0    9492 /tmp/myfile6.tmp
openfiles 21853    oracle    9r      REG  0,24         0    9493 /tmp/myfile7.tmp


We see our 10 file handles and this proves that only numeric FD are counted when checking the nofile limit of 10. You see stdin, stdout, stderr as FD 0,1,2 and then my 7 files opened in read only.

Let's try to open one more file:

[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp myfile8.tmp
open file 1 of 8 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 8 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 8 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 8 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 8 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 8 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 8 getrlimit nofile: soft=10 hard=10 myfile7.tmp
open file 8 of 8 getrlimit nofile: soft=10 hard=10 myfile8.tmp
fopen() number 8 failed with errno=24


Here the limit is reached and the open() call returns error 24 (ENFILE) because we reached the nofile=10.



When counting the processes for the nproc limit, we have seen that threads must be counted as processes. For the nofile limit we don't need to detail the threads because all threads share the file descriptor table.


Recommended values

Currently this is what is set on Oracle linux 6 for 11gR2 (in /etc/security/limits.conf):

oracle   soft   nofile    1024
oracle   hard   nofile    65536


For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

oracle soft nofile 1024
oracle hard nofile 65536


Do you think it's a bit low? Just for information, here is what is set in the ODA X4-2:

oracle soft nofile 131072


In any case, it is a good idea to check if you are reaching the limit and the above scripts on lsof or /proc should help for that.

SQL Server: How to find the default data path?

Sun, 2014-06-15 18:16

I have read a lot of SQL Server blog postings and articles in order to find the default data path. This post covers different SQL Server versions (SQL Server 2012, SQL Server 2014, SQL Server 2008, SQL Server 2005) and provides a generic script with different methods.

SQL Server 2014: sys.dm_exec_query_profiles, a new promising feature

Fri, 2014-06-13 03:52

Among the hidden features provided by SQL Server 2014 there is a very interesting dynamic management view named sys.dm_exec_query_profiles. This feature will help the database administrators and experienced developers to troubleshoot long running queries in real-time. I'm pretty sure you had to deal with the following questions: When will this query finish? What percentage of total workload is performed by this request? Which steps are the longest? Before SQL Server 2014 it was impossible to answer the questions above!

But first, I have to admit that this new DMV has raised of lot of questions during my tests. I will try to share my thoughts and findings with you in this post.

My test consisted of running this long query:


select        YEAR(TransactionDate) AS year_tran,        MONTH(TransactionDate) AS month_tran,        FIRST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS first_product_nb,        LAST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS last_product_nb from AdventureWorks2012.dbo.bigTransactionHistory as a        join AdventureWorks2012.dbo.bigProduct as p              on a.ProductID = p.ProductID


On my computer this query takes 05’’47’ to run. The parallelism is enabled. My SQL Server instance can use up to 8 processors. The first time I was disappointed by viewing the output of the sys.dm_exec_query_profiles DMV during my query execution:


select * from sys.dm_exec_query_profiles




… No output! Ok what are we supposed to do to see something with sys.dm_exec_query_profiles ? Laughing  Keep good habits and go back to the SQL Server BOL that says:

To serialize the requests of sys.dm_exec_query_profiles and return the final results to SHOWPLAN XML, use SET STATISTICS PROFILE ON; or SET STATISTICS XML ON;

In other words, to have a chance to see something with this DMV we must use some session options like STATISTICS PROFILE, STATISTICS XML or force SQL Server to display the execution plan after execution query with SQL Server Management Studio. My first though was: why do we have to use some additional options to produce some output for this new DMV? These constraints severely limit the scenarios where we can use this DMV … After some reflexions and discussions with others French MVPs we can think that is normal because tracking the execution plan information is expensive and we could potentially bring a production server to its knees by enabling this feature for all the workload statements. But using additional session options can be impractical in production environments because it requires to execute itself the concerned query and it is not always possible. Fortunately Microsoft provides the query_post_execution_showplan event which can be used into a SQL Server trace or an extended event session. But the implementation design of this event has a significant impact to the performance of a SQL Server instance. Indeed, even with a short-circuit predicate this event will be triggered each time a SQL statement will be executed because the query duration is not known ahead of time (please refer to this Microsoft connect item). To summarize, using this event in OLTP production environment should be used in a short period for troubleshooting purposes. In OLAP environment the story is not the same because we don’t have to deal with a lot of short queries but only with long running queries issued by cube processing or ETL processes for example.

After discussing the pros and cons of this new feature let’s start with my precedent T-SQL statement and the use of the session option: SET STATISTICS PROFILE ON


SET STATISTICS PROFILE ON;   select        YEAR(TransactionDate) AS year_tran,        MONTH(TransactionDate) AS month_tran,        FIRST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS first_product_nb,        LAST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS last_product_nb from AdventureWorks2012.dbo.bigTransactionHistory as a        join AdventureWorks2012.dbo.bigProduct as p              on a.ProductID = p.ProductID


In other session:


select * from sys.dm_exec_query_profiles




This DMV provides a lot of useful information. These information are more granular than SET STATISTICS IO because  the counters returned are per operator per thread (node_id / physical_operator_name and thread_id columns).  In my case the SQL Server instance can use up to 8 processors and we can notice that for some operators (node_id) we have several threads using in parallel (thread_id). Furthermore two others columns are interesting like estimate_row_count and row_count. The former is the number of estimated rows an operator should address and the latter is the current number of rows addressed by the same operator (remember these counters are per operator per thread …). We can compare these two columns to estimate a completion percentage per operator (or per operator per thread). Finally we can categorize some information provided by this DMV in two parts: information related either to the query execution engine (row_count, estimated_row_count) or the storage execution engine (logical_read_count, write_page_count, lob_logical_read_count etc.)

As you can see, this DMV provide a lot of information to us and you will certainly customize the final output yourself depending on what you want to focus on.

Here is an example of a custom script we can create that uses the sys.dm_exec_query_profiles, sys.objects, sys.indexes, sys.dm_exec_requets, sys.dm_exec_sessions, and sys.dm_os_waiting_tasks to cover both the real-time execution information and the related waits occurring in the same time.

USE AdventureWorks2012; GO   SELECT        qp.node_id,        qp.session_id, AS [object_name], AS index_name,        qp.physical_operator_name + QUOTENAME(CAST(COUNT(qp.thread_id) AS VARCHAR(4))) AS physical_operator_name,        SUM(qp.estimate_row_count) AS estimate_row_count,        SUM(qp.row_count) AS row_count,        CASE              WHEN SUM(qp.row_count) * 1. / SUM(qp.estimate_row_count + .00001) THEN CAST(CAST(SUM(qp.row_count) * 100. / SUM(qp.estimate_row_count + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %'              ELSE '100 % + (' + CAST(CAST((SUM(qp.row_count) * 100. / SUM(qp.estimate_row_count + .00001)) - 100 AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %)'        END AS [completed_%],        -- Query execution engine        CAST(CAST(SUM(qp.elapsed_time_ms) * 100. /(SUM(SUM(qp.elapsed_time_ms)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_elapsed_time_%],        CAST(CAST(SUM(qp.cpu_time_ms) * 100. /(SUM(SUM(qp.cpu_time_ms)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_cpu_%],        -- Storage engine        CAST(CAST(SUM(qp.logical_read_count) * 100. / SUM(SUM(qp.logical_read_count)) OVER() + .00001 AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_logical_read_%],        CAST(CAST(SUM(qp.physical_read_count) * 100. /(SUM(SUM(qp.physical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_physical_read_%],        CAST(CAST(SUM(qp.lob_logical_read_count) * 100. /(SUM(SUM(qp.lob_logical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [lob_logical_read_%],        CAST(CAST(SUM(qp.lob_physical_read_count) * 100. /(SUM(SUM(qp.lob_physical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [lob_physical_read_%],        CAST(CAST(SUM(qp.write_page_count) * 100. /(SUM(SUM(qp.write_page_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_write_%],        SUM(qp.logical_read_count) AS total_logical_read_count,        SUM(qp.lob_logical_read_count) AS total_lob_logical_read_count,        SUM(qp.lob_physical_read_count) AS total_lob_physical_read_count,        SUM(qp.lob_read_ahead_count) AS total_lob_read_ahead_count,        SUM(qp.physical_read_count) AS total_physical_read_count,        SUM(qp.read_ahead_count) AS total_read_ahead_count,        SUM(qp.write_page_count) AS total_write_page_count,        -- Both        SUM(qp.cpu_time_ms) AS total_cpu_time_ms,        SUM(qp.elapsed_time_ms) AS total_elapsed_time_ms,        -- wait info information        COALESCE(wait.wait_info, '') AS wait_info FROM sys.dm_exec_query_profiles AS qp(NOLOCK)        LEFT JOIN sys.objects AS o(NOLOCK)              ON qp.object_id = o.object_id        LEFT JOIN sys.indexes AS i(NOLOCK)              ON qp.index_id = i.index_id                     AND qp.object_id = i.object_id OUTER APPLY (        SELECT        STUFF(                     (SELECT                                   ',' + ws.wait_type + ' ' + QUOTENAME(CAST(SUM(COALESCE(ws.wait_duration_ms, 0)) AS VARCHAR(20)) + ' ms')                            FROM sys.dm_exec_requests AS r(NOLOCK)                                   JOIN sys.dm_exec_sessions AS s(NOLOCK)                                         ON r.session_id = s.session_id                                   JOIN sys.dm_os_waiting_tasks AS ws(NOLOCK)                                         ON ws.session_id = s.session_id                            WHERE s.session_id = qp.session_id                            GROUP BY ws.wait_type                            FOR XML PATH (''), TYPE).value('.', 'varchar(max)') , 1, 1, ''              ) ) AS wait(wait_info) GROUP BY qp.node_id, session_id, qp.physical_operator_name,,, COALESCE(wait.wait_info, '') ORDER BY node_id OPTION(RECOMPILE);


The first section of the above script concerns the general information:




1: This above script groups the information per operator. The information in brackets is the number of thread used for the concerned operator.

2: Completion is provided as a ratio between rows addressed by the corresponding operator so far and the total estimated rows. As we can expect, the estimated rows and the real rows addressed by an operator can differ. The above script highlights this gap by showing the difference in percentage between brackets if row_count is greater than estimated_row_count. This information can be useful to “estimate” the general completion of the query.

However Total_elasped_time_% and total_cpu_% columns are the relative percentage of the elapsed time and cpu time consumed by all the query execution operators.

3: For some physical operators like Clustered Index Scan, Table Scan etc … we will show the related table and index objects


The second section concerns the storage engine statistic for the query execution:




The above information can be useful to highlight the top operators depending on the resource or the operation type. For example, in my case I can notice that the cluster index scan of the bigTransactionHistory table is the main consuming physical operator for logical reads. Furthermore, we can observe that the sort operator is the root cause of spilling. Finally as expected we can notice that by design hash match operator consumes a lot of CPU and it’s the top operator in this case.


The next section shows the absolute values per operator per type of resources and operation:




Finally the last section shows the wait types and aggregated duration by all related threads in brackets during the query execution in a real time:




In my case we retrieve the wait type related to a query execution in parallel  (CXPACKET) and certainly one of the root cause of the CXPACKET wait type (IO_COMPLETION) but here we don't have sufficient information to verify it. We can just notice that the query execution generates CXPACKET and IO_COMPLETION wait types in a real time.

At this point we can ask why we don’t have a remaining time information. In fact, computing an accurate remaining time is very difficult because a lot of factors must be correlated together. We can have a naive approach by computing the remaining time with the row_count, estimated_row_count and elapsed time columns for each operator, but you will be disappointed when testing this… According to my own tests, I can claim that the result is never as accurate as you would expect. Please feel free to share your results or thoughts about computing an accurate remaining time with us!

The sys.dm_exec_query_profiles DMV can be correlated by external tools like XPERF or Perfmon. I will prepare a next blog post (as soon as I have some time) to share my experience with you.

Happy query profiling!

Linux: how to monitor the nproc limit

Tue, 2014-06-10 01:14

You probably know about 'nproc' limits in Linux which are set in /etc/limits.conf and checked with 'ulimit -u'. But do you know how to handle the monitoring and be alerted when you're close the fixed limit?

Nproc and ps

Nproc is defined at OS level to limit the number of processes per user. Oracle documentation recommends the following:

oracle soft nproc 2047
oracle hard nproc 16384

But that is often too low, especially when you have the Enterprise Manager agent or other java programs running.

Do you want to check that you are far from the limit? then you can use 'ps'. But beware, 'ps' by default does not show all processes.
In Linux, when doing multithreading, each thread is implemented as a light-weight process (LWP). And you must use the '-L' to see all of them.

Let's take an example. I have a system where 'ps -u oracle' returns 243 lines. But including LWPs shows a lot more processes which is near the limit:

$ ps h -Led -o user | sort | uniq -c | sort -n
      1 dbus
      1 ntp
      1 rpc
      1 rpcuser
      2 avahi
      2 haldaemon
      2 postfix
    166 grid
    400 root
   1370 oracle

So the 'oracle' user has 1370 processes. That's high. And this is the actual number where the nproc limit applies.

'ps -Lf' can show the detail. And even without '-L' we can display the NLWP which is the number of threads per process:

ps -o nlwp,pid,lwp,args -u oracle | sort -n
   1  8444  8444 oracleOPRODP3 (LOCAL=NO)
   1  9397  9397 oracleOPRODP3 (LOCAL=NO)
   1  9542  9542 oracleOPRODP3 (LOCAL=NO)
   1  9803  9803 /u00/app/oracle/product/agent12c/core/ /u00/app/oracle/product/agent12c/core/ agent /u00/app/oracle/product/agent12c/agent_inst/sysman/log/emagent.nohup
  19 11966 11966 /u00/app/11.2.0/grid/bin/oraagent.bin
1114  9963  9963 /u00/app/oracle/product/agent12c/core/ ... emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain

The Oracle 12c EM agent has started 1114 threads and the grid infrastructure 'oraagent.bin' has 19 threads. In addition to that I've a lot of other monothreaded processes. This is how we reach 1370 which is the exact value to compare to the nproc limit.

So what are the good values to set? About the high number of threads for EM agent 12c, there are a few bugs. And I suspect that 1000 threads is too much, especially when checking them with 'jstack' I see that they are "CRSeOns" threads that should not be used in and higher. But that's another problem which I'm currently investigating. When you reach the nproc limit, the user will not be able to create new processes. clone() calls will return EAGAIN and that is reported by Oracle as:

ORA-27300: OS system dependent operation:fork failed with status: 11 
ORA-27301: OS failure message: Resource temporarily unavailable 

And that is clearly bad when it concerns an +ASM instance or archiver processes.

The goal of the nproc limit is only to prevent 'fork bombs' where a process forks forever and exhausts all resources. So there is no problem to increase this limit. However if you set it high for some users ('oracle' and 'grid' usually), it can be a good idea to monitor the number of processes with the ps h -L above. Because having too many processes is suspect and increasing the limit just hides a process leak and defer the failure.

In 'ps h -L -o' The argument 'h' is there to remove the header line, and '-L' to show all processes including LWP. Then you can count with 'wc -l'.

The proof

In order to be sure that 'ps h -L' gives the exact number, I have tested it. In case you want to check this on your system, here is how to do it. And please report any difference.

First, set your limit to 1024 processes. This is a limit for my user, and the limit is set for my shell and all its child processes:

[oracle@VM211 ocm]$ ulimit -u 1024

Now you can check it:

[oracle@VM211 ocm]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15919
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Then you can run a small C program ( that calls fork() in a loop until it fails with EAGAIN:

[oracle@VM211 ocm]$ ./testnproc
parent says fork number 871 sucessful
 child says fork number 872 pid 1518
parent says fork number 872 sucessful
 child says fork number 873 pid 1519
parent says fork number 873 sucessful
 child says fork number 874 pid 1520
parent says fork number 874 sucessful
parent says fork number 875 failed (nproc: soft=1024 hard=1024) with errno=11

And finally, because the processes sleep for a while, you can check how many processes you have. I do that from another user account for the simple reason that I need to create 2 more processes ('ps' and 'wc') for that:

[root@VM211 ocm]# ps h -Lu oracle | wc -l
Recommended values for Oracle

Currently this is what is set on Oracle linux 6 for 11gR2 by the preinstall package (in /etc/security/limits.conf):

oracle   soft   nproc    16384
oracle   hard   nproc    16384

For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

oracle soft nproc 16384
oracle hard nproc 16384

And just for your information, here is what is set in the ODA X4-2:

oracle soft nproc 131072

So what do you want to set? You probably don't want it too low and experience 'resource temporarily unavailable'. But what you don't want either is 100000 processes on your server. So my recommendation is to set it high but monitor it when the number of processes reaches something that is not sensible. Then you prevent having the system down in case of process leak, but you can detect it and ask for a patch.

SQL Server 2014: Availability groups failover impossible with Failover Cluster Manager

Fri, 2014-06-06 01:27

A couple of weeks ago, I worked for a customer that wanted to implement SQL Server 2012 (and not SQL Server 214) AlwaysOn with availability groups. During our tests we performed a bunch of failover tests and the customer tried to perform a failover of one of the installed availability group by using the failover cluster manager (FCM). Of course, I told him this is not best practice because the failover cluster manager is not aware of the synchronization state of the availability group. But with SQL Server 2014, the story has changed because I noticed a different behavior. I would like to share this information with you in this posting.

But let me first demonstrate the SQL Server 2012 behavior with the following example:

I have an availability group named SQL12 configured with 2 synchronous replicas and automatic failover. However, as you can see, the synchronization state of my availability database is not healthy as shown below:




Now, if I try to failover my availability group using the failover cluster manager ...






.... the next available node is SQL143 ...






... and we can notice the failover did not occur as expected because the SQL12 resource is still on the SQL141 node as shown in the following picture:




Having a look at the cluster error log does not help us in this case because we have a classic 1069 error number without helpful information:





Generating detailed cluster error logs could help us but I prefer to directly look at the SQL Server side for the moment. The AlwaysOn_health extended event is a good start to check for some existing records associated to the problem.




We have indeed some information about the failover attempt. First the SQL141SQL12 replica state changed from PRIMARY_NORMAL to RESOLVING_NORMAL due to the manual failover issued by the FCM.




Then we can see an error message that explains that the dummy database is changing its role from PRIMARY to RESOLVING because there is a problem with a role synchronization. This error is issued by the forced failover of the availability group resource that I used.




Finally, we notice the failover process did not complete succesfully and the dummy database failbacks on the SQL141SQL12 replica (availability_replica_id 1B9007CA-EE3F-486D-A974-838EFED0203D associated to the SQL141SQL12 replica in my case)




On the other side, the SQL143SQL12 secondary replica also features a lot of useful information:






To summarize, this test demonstrates clearly that the FCM is not aware of the availability databases synchronization state inside an availability group. Using FCM may result in unintended outcomes, including unexpected downtime!

Now, it's time to test the same scenario with SQL Server 2014 and a configured availability group. During the failover attempt, I get the following error message:






If we take a look at the possible owners of the corresponding availability group resource, we can see that the current cluster node that hosts the availability group is the only possible owner - unlike SQL Server 2012.




As a reminder, possible and preferred owners are resetted during the availability group creation and failover based on the primary replica and its secondaries. For fun, I decided to compare the two processes issued by the both versions of SQL Server and availability groups (SQL Server 2012 and SQL Server 2014) after having triggered an availability group failover and generating the associated cluster logs:


Get-ClusterLog -UseLocalTime -Span 5

Result with SQL Server 2012

000005d4.00000778::2014/05/26-22:10:55.088 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (sql12, 1) 000005d4.00000778::2014/05/26-22:10:55.088 INFO [GUM] Node 1: executing request locally, gumId:215, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.088 INFO [RCM] rcm::RcmGum::AddPossibleOwner(sql12,1) 000005d4.00000778::2014/05/26-22:10:55.103 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000005d4.00000778::2014/05/26-22:10:55.103 WARN [DM] Aborting group transaction 29:29:613+1 000005d4.00000778::2014/05/26-22:10:55.103 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000005d4.00000778::2014/05/26-22:10:55.103 WARN [RCM] sql12 cannot be hosted on node 3 000005d4.00000778::2014/05/26-22:10:55.103 WARN [RCM] Possible owners: 000005d4.00000778::2014/05/26-22:10:55.103 WARN     2 000005d4.00000778::2014/05/26-22:10:55.103 WARN     1 000005d4.00000778::2014/05/26-22:10:55.103 WARN 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmApi::RemovePossibleOwner: (sql12, 2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:215, my action: /rcm/gum/RemovePossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::RemovePossibleOwner(sql12,2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] Removing node 2 from resource 'sql12'. 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (sql12, 2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:216, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::AddPossibleOwner(sql12,2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] Adding node 2 to resource 'sql12'. 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:217, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(sql12, 000005d4.00000778::2014/05/26-22:10:55.103 INFO     1 000005d4.00000778::2014/05/26-22:10:55.103 INFO     2 000005d4.00000778::2014/05/26-22:10:55.103 INFO


Result with SQL Server 2014

000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (AdventureWorksGrp, 1) 000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:230, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::AddPossibleOwner(AdventureWorksGrp,1) 000005d4.00000bb0::2014/05/26-22:14:54.578 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [DM] Aborting group transaction 29:29:627+1 000005d4.00000bb0::2014/05/26-22:14:54.578 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [RCM] AdventureWorksGrp cannot be hosted on node 3 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [RCM] Possible owners: 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN     2 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN     1 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmApi::RemovePossibleOwner: (AdventureWorksGrp, 2) 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:230, my action: /rcm/gum/RemovePossibleOwner, # of updates: 1 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::RemovePossibleOwner(AdventureWorksGrp,2) 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] Removing node 2 from resource 'AdventureWorksGrp'. 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.000011f4::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:231, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000005d4.000011f4::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(AdventureWorksGrp, 000005d4.000011f4::2014/05/26-22:14:54.578 INFO     1 000005d4.000011f4::2014/05/26-22:14:54.578 INFO


As I said earlier, possible and preferred owners properties are managed automatically by SQL Server AlwaysOn. We can see here this is done by the cluster resource control manager and the following functions:

rcm::RcmApi::AddPossibleOwner(), rcm::RcmApi::RemovePossibleOwner() and rcm::RcmApi::SetGroupPreferredOwners () .

You can notice that two nodes are added as possible owners with SQL Server 12. However, with SQL Server 2014 only one node is added as possible owner of the concerned availability group resource. Interesting change isn't it?