Skip navigation.

Yann Neuhaus

Syndicate content
All blog entries from
Updated: 9 hours 31 min ago

Understanding JVM Java memory leaks

Thu, 2014-06-26 21:38

Lots of people think that Java is free of memory management as the JVM uses a Garbage Collector. The goal of this collector is to free objects that are no longer used in the program without the developer being forced to declare that the object can be collected. Everything is automatic.

It’s really helpful and it avoids wasting time in managing memory. But, as it is an automatic process, it can produce some issues. They are better known as memory leaks.


What is a memory leak?

A leak appears when an object is no longer used in the program but is still referenced somewhere at a location that is not reachable. Thus, the garbage collector cannot delete it. The memory space used for this object will not be released and the total memory used for the program will grow. This will degrade performances over time and the JVM may run out of memory.


Identifying memory leaks

Memory leaks are difficult to identify, they require a good knowledge of the application. Usually, they are related to an Out Of Memory Error exception (also called OOM). But note that not all Out Of Memory Errors imply memory leaks, and not all memory leaks are due to Out Of Memory Errors.

Having an Out Of Memory Error is a first sign but you must make a difference between a “normal” OOM and a memory leak.

For example, if the program loads data from external files and one time a file is bigger than it has been expected it could result in an OOM. But this one is “normal” as the design of the application was not able to handle such a big file.

In the other hand, if the program is used to process data with the same size or which are similar to each other and you get an OOM it may be a memory leak. In addition a memory leak generally eats free space gradually, if your memory has been fulfilled suddenly it might be a “normal” OOM and you should look at the stack trace to find out the origin.

Another symptom of a memory leak would be an allocation issue. In fact when too much space is taken in the heap and not freed allocations of new objects may be complicated. Mainly if objects are huge. When an object is first created in memory it goes to a part of the heap called Eden, then if the object survives several garbage collections it goes to the tenured space. Usually there are only old objects in the tenured space; objects which are long life based. But when the Eden is fulfilled by leaks the objects directly go to the tenured space and you can see an abnormal stack of objects in the tenured space as it is generally not fulfilled. So if you have an OOM of a tenured space it might be an allocation issue (or simply your heap configuration is too low).

The heap usage can be checked with tools such as VisualVM, jconsole, Java Mission Control (only for 7u40+ JRE) and so on. I will describe the use of this kind of tools in future blogs.


How to prevent memory leaks

As said before you cannot really prevent memory leaks as this is related to the design of the application. If you are not aware of the type of application you are running or the internal design it uses, you cannot prevent leaks.

Many applications, either desktop built or web-based, will use many threads to run. Some threads such as ThreadLocal can store references to objects which reference their classes which reference their class loader. These threads can keep objects references in order to use objects later in the thread as in methods without passing it as argument. This way, when a web app is redeployed (e.g. in tomcat) a new class loader is created to load the new application but LocalThreads might not be dealocated as they are loaded or they use things from the permgen. As you may know the permgen is a part of the JVM heap which is usually not collected by the GC. And this produces memory leaks.

What you have to remember is to reduce the use of such a threads and be careful of the implementation and design of the application you are using. Of course, memory leaks can result from other reasons but the principle is the same: Objects allocated are no longer reachable and they are not released by the garbage collector.

Oracle Parallel Query: Did you use MapReduce for years without knowing it?

Thu, 2014-06-26 06:42

I've read this morning that MapReduce is dead. The first time I heard about MapReduce was when a software architect proposed to stop writing SQL on Oracle Database and replace it with MapReduce processing. Because the project had to deal with a huge amount of data in a small time and they had enough budget to buy as many cores as they need, they wanted the scalability of parallel distributed processing.

The architect explained how you can code filters and aggregations in Map & Reduce functions and then distribute the work over hundreds of CPU cores. Of course, it's very interesting, but it was not actually new. I was doing this for years on Oracle with Parallel Query. And not only filters and aggregations, but joins as well - and without having to rewrite the SQL statements.

I don't know if MapReduce is dead, but for 20 years we are able to just flip a switch (ALTER TABLE ... PARALLEL ...) and bring scalability with parallel processing. Given that we understand how it works.

Reading a parallel query execution plan is not easy. In this post, I'll just show the basics. If you need to go further, you should have a look at some Randolf Geist presentations and read his Understanding Parallel Execution article. My goal is not to go very deep, but only to show that it is not that complex.

I'll explain how Parallel query works by showing an execution plan for a simple join between DEPT and EMP tables where I want to read EMP in parallel - and distribute the join operation as well.

For the fun of it, and maybe because it's easier to read at the first time, I've done the execution plan on an Oracle 7.3.3 database (1997):




Let's start by the end. I want to read the EMP table by several processes (4 processes because I've set the parallel degree to 4 on table EMP). The table is not partitioned. It is a heap table where rows are scattered into the segment without any specific clustering. So each process will process an arbitrary range of blocks and this is why you see an internal query filtering on ROWID between :1 and :2. My session process, which is known as the 'coordinator', and which will be represented in green below, has divided the range of rowid (it's a full table scan, that reads all blocks from start to high water mark) and has mandated 4 'producer' processes to do the full scan on their part. Those producers are represented in dark blue below.

But then there is a join to do. The coordinator could collect all the rows from the 'producer' processes and do the join, but that is expensive and not scalable. We want the join to be distributed as well. Each producer process can read the DEPT table and do the join, which is fine if it is a small table only. But anyway, we don't want the DEPT table to be read in parallel because we have not set a parallel degree on it. So the EMP table will be read by only one process: my session process, which does all the no-parallel (aka the serial) things in addition to its 'coordinator' role.

Then we have a new set of 4 processes that will do the Hash Join. They need some rows from DEPT and they need some rows from EMP. They are the 'consumer' processes that will consume rows from 'producers', and are represented in pink below. And they don't need them randomly. Because it is a join, each 'consumer' process must have the pairs of rows that match the join columns. In the plan above, you see an internal query on internal 'table queue' names. The parallel full scan on EMP distributes its rows: it's a PARALLEL_TO_PARALLEL distribution, the parallel producers sending their rows to parallel consumers. The serial full scan on DEPT distributes its rows as well: it's a PARALLEL_FROM_SERIAL distribution, the parallel consumers receiving their rows from the serial coordinator process. The key for both distributions are given by a hash function on the join column DEPTNO, so that rows are distributed to the 4 consumer processes, but keeping same DEPTNO into the same process.

We have a group by operation that will be done in parallel as well. But the processes that do the join on DEPTNO cannot do the group by which is on others columns (DNAME,JOB). So we have to distribute the rows again, but this time the distribution key is on DNAME and JOB columns. So the join consumer processes are also producers for the group by operation. And we will have a new set of consumer processes that will do the join, in light blue below. That distribution is a PARALLEL_TO_PARALLEL as it distributes from 4 producers arranged by (DEPTNO) to 4 consumers arranged by (DNAME,JOB).

At the end only one process receives the result and sends it to the client. It's the coordinator which is 'serial'. So it's a PARALLEL_TO_SERIAL distribution.

Now let's finish with my Oracle 7.3.3 PLAN_TABLE and upgrade to 12c which can show more detailed and more colorful execution plans. See here on how to get it.

I've added some color boxes to show the four parallel distributions that I've detailed above:

  • :TQ10001 Parallel full scan of EMP distributing its rows to the consumer processes doing the join.
  • :TQ10000 Serial full scan of DEPT distributing its rows to the same processes, with the same hash function on the join column.
  • :TQ10002 The join consumer receiving both, and then becoming the producer to send rows to the consumer processes doing the group by
  • :TQ10003 Those consumer processes doing the group by and sending the rows to the coordinator for the final result.



So what is different here?

First we are in 12c and the optimizer may choose to broadcast all the rows from DEPT instead of the hash distribution. It's the new HYBRID HASH distribution. That decision is done when there are very few rows and this is why they are counted by the STATISTICS COLLECTOR.

We don't see the predicate on rowid ranges, but the BLOCK ITERATOR is there to show that each process reads its range of blocks.

And an important point is illustrated here.

Intra-operation parallelism can have a high degree (here I've set it to 4 meaning that each parallel operation can be distributed among 4 processes). But Inter-operation parallelism is limited to one set of producer sending rows to one set of consumers. We cannot have two consumer operations at the same time. This is why the :TQ0001 and the :TQ10003 have the same color: it's the same processes that act as the EMP producer, and then when finished, then are reused as the GROUP BY consumer.

And there are additional limitations when the coordinator is also involved in a serial operation. For those reasons, in a parallel query plan, some non-blocking operations (those that can send rows above on the fly as they receive rows from below) have to buffer the rows before continuing. Here you see the BUFFER SORT (which buffers but doesn't sort - the name is misleading) which will keep all the rows from DEPT in memory (or tempfiles when it's big).

Besides the plan, SQL Monitoring show the activity from ASH and the time spent in each parallel process:




My parallel degree was 4 so I had 9 processes working on my query: 1 coordinator, two sets of 4 processes. The coordinator started to distribute the work plan to the other processes, then had to read DEPT and distribute its rows, and when completed it started to receive the result and send it to the client. The blue set of processes started to read EMP and distribute its rows, and when completed was able to process the group by. The red set of processes has done the join. The goal is to have the DB time distributed on all the processes running in parallel, so that the response time is equal to the longest one instead of the total. Here, it's the coordinator which has taken 18 milliseconds. The query duration was 15 milliseconds:




This is the point of parallel processing: we can do a 32 ms workload in only 15 ms. Because we had several cpu running at the same time. Of course we need enough resources (CPU, I/O and temp space). It's not new. We don't have to define complex MapReduce functions. Just use plain old SQL and set a parallel degree. You can use all the cores in your server. You can use all the servers in your cluster. If you're I/O bound on the parallel full scans, you can even use your Exadata storage cells to offload some work. And in the near future the CPU processing will be even more efficient, thanks to in-memory columnar storage.

Linux: how to monitor the nofile limit

Wed, 2014-06-18 01:47

In a previous post I explained how to measure the number of processes that are generated when a fork() or clone() call checks the nproc limit. There is another limit in /etc/limits.conf - or in /etc/limits.d - that is displayed by 'ulimit -n'. It's the number of open files - 'nofile' - and here again we need to know what kind of files are counted.



'nofile' is another limit that may not be easy to monitor, because if you just count the 'lsof' output you will include a lot of lines which are not file descriptors. So how can we count the number of files descriptors in a process?



'lsof' is a utility that show all the open files. Let's take an example:

I get the pid of my pmon process:

[oracle@VM211 ulimit]$ ps -edf | grep pmon
oracle   10586     1  0 19:21 ?        00:00:02 ora_pmon_DEMO
oracle   15494 15290  0 22:12 pts/1    00:00:00 grep pmon


And I list the open files for that process

[oracle@VM211 ulimit]$ lsof -p 10586
ora_pmon_ 10586 oracle  cwd  DIR  252,0      4096 /app/oracle/product/12.1/dbs
ora_pmon_ 10586 oracle  rtd  DIR  252,0      4096 /
ora_pmon_ 10586 oracle  txt  REG  252,0 322308753 /app/oracle/product/12.1/bin/oracle
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150175744_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_1
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_2
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_3
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_4
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_5
ora_pmon_ 10586 oracle  mem  REG  252,0   1135194 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle  mem  REG  252,0   6776936 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle  mem  REG  252,0     14597 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle    0r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    1w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    2w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    3r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    4r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb
ora_pmon_ 10586 oracle    5r DIR    0,3         0 /proc/10586/fd
ora_pmon_ 10586 oracle    6u REG  252,0      1544 /app/oracle/product/12.1/dbs/hc_DEMO.dat
ora_pmon_ 10586 oracle    7u REG  252,0        24 /app/oracle/product/12.1/dbs/lkDEMO_SITE1
ora_pmon_ 10586 oracle    8r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb

I've removed hundreds of lines with FD=mem and size=4M. I'm in AMM with memory_target=800M and SGA is implemented in /dev/shm granules. With lsof, we see all of them. And with a large memory_target we can have thousands of them (even if granule becomes 16M when memory_target is larger than 1GB). But don't worry, they don't count in the 'nofile' limit. Only 'real' file descriptors are counted - those with a numeric FD.

So, if you want to know the processes that are near the limit, you can use the following:

[oracle@VM211 ulimit]$ lsof | awk '$4 ~ /[0-9]+[rwu -].*/{p[$1"\t"$2"\t"$3]=p[$1"\t"$2"\t"$3]+1}END{for (i in p) print p[i],i}' | sort -n | tail
15 ora_dmon_    10634   oracle
16 ora_dbw0_    10608   oracle
16 ora_mmon_    10626   oracle
16 ora_rsm0_    10722   oracle
16 tnslsnr      9785    oracle
17 automount    1482    root
17 dbus-daem    1363    dbus
20 rpc.mount    1525    root
21 ora_lgwr_    10610   oracle
89 master       1811    root


The idea is to filter the output of lsof and use awk to keep only the numeric file descriptors, and aggregate per process. Then, we sort them and show the highest counts. Here the Postfix master process has 89 files open. Then log writer follows.

You can get the same information from /proc filesystem where files handles are in /proc//fd:

for p in /proc/[0-9]* ; do echo $(ls $p/fd | wc -l) $(cat $p/cmdline) ; done | sort -n | tail
15 ora_dmon_DEMO
16 ora_dbw0_DEMO
16 ora_mmon_DEMO
16 ora_rsm0_DEMO
16 /app/oracle/product/12.1/bin/tnslsnrLISTENER-inherit
17 automount--pid-file/var/run/
17 dbus-daemon--system
20 rpc.mountd
21 ora_lgwr_DEMO
89 /usr/libexec/postfix/master


Same result, much quicker and more information about the process. This is the way I prefer, but remember that if you want to see all processes, you should be logged as root.


The proof

As I did for nproc, I have written a small C program that open files (passed as arguments) for a few seconds, so that I'm sure I'm counting the right things.

And I encourage to do the same on a test system and let me know if your result differs. Here is the source:

First, I set my nofile limit to only 10

ulimit -n 10


Then, let's open 7 files. In addition with stdin, stdout and stderr we will have 10 file handles:

[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp &
open file 1 of 7 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 7 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 7 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 7 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 7 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 7 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 7 getrlimit nofile: soft=10 hard=10 myfile7.tmp


I was able to open those 7 files. Then I check lsof:

[oracle@VM211 ulimit]$ lsof | grep openfiles
openfiles 21853    oracle  cwd       DIR  0,24    380928    9320 /tmp/ulimit
openfiles 21853    oracle  rtd       DIR 252,0      4096       2 /
openfiles 21853    oracle  txt       REG  0,24      7630    9494 /tmp/ulimit/openfiles
openfiles 21853    oracle  mem       REG 252,0    156928 1579400 /lib64/
openfiles 21853    oracle  mem       REG 252,0   1926800 1579401 /lib64/
openfiles 21853    oracle    0u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    1u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    2u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    3r      REG  0,24         0    9487 /tmp/myfile1.tmp
openfiles 21853    oracle    4r      REG  0,24         0    9488 /tmp/myfile2.tmp
openfiles 21853    oracle    5r      REG  0,24         0    9489 /tmp/myfile3.tmp
openfiles 21853    oracle    6r      REG  0,24         0    9490 /tmp/myfile4.tmp
openfiles 21853    oracle    7r      REG  0,24         0    9491 /tmp/myfile5.tmp
openfiles 21853    oracle    8r      REG  0,24         0    9492 /tmp/myfile6.tmp
openfiles 21853    oracle    9r      REG  0,24         0    9493 /tmp/myfile7.tmp


We see our 10 file handles and this proves that only numeric FD are counted when checking the nofile limit of 10. You see stdin, stdout, stderr as FD 0,1,2 and then my 7 files opened in read only.

Let's try to open one more file:

[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp myfile8.tmp
open file 1 of 8 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 8 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 8 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 8 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 8 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 8 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 8 getrlimit nofile: soft=10 hard=10 myfile7.tmp
open file 8 of 8 getrlimit nofile: soft=10 hard=10 myfile8.tmp
fopen() number 8 failed with errno=24


Here the limit is reached and the open() call returns error 24 (ENFILE) because we reached the nofile=10.



When counting the processes for the nproc limit, we have seen that threads must be counted as processes. For the nofile limit we don't need to detail the threads because all threads share the file descriptor table.


Recommended values

Currently this is what is set on Oracle linux 6 for 11gR2 (in /etc/security/limits.conf):

oracle   soft   nofile    1024
oracle   hard   nofile    65536


For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

oracle soft nofile 1024
oracle hard nofile 65536


Do you think it's a bit low? Just for information, here is what is set in the ODA X4-2:

oracle soft nofile 131072


In any case, it is a good idea to check if you are reaching the limit and the above scripts on lsof or /proc should help for that.

SQL Server: How to find the default data path?

Sun, 2014-06-15 18:16

I have read a lot of SQL Server blog postings and articles in order to find the default data path. This post covers different SQL Server versions (SQL Server 2012, SQL Server 2014, SQL Server 2008, SQL Server 2005) and provides a generic script with different methods.

SQL Server 2014: sys.dm_exec_query_profiles, a new promising feature

Fri, 2014-06-13 03:52

Among the hidden features provided by SQL Server 2014 there is a very interesting dynamic management view named sys.dm_exec_query_profiles. This feature will help the database administrators and experienced developers to troubleshoot long running queries in real-time. I'm pretty sure you had to deal with the following questions: When will this query finish? What percentage of total workload is performed by this request? Which steps are the longest? Before SQL Server 2014 it was impossible to answer the questions above!

But first, I have to admit that this new DMV has raised of lot of questions during my tests. I will try to share my thoughts and findings with you in this post.

My test consisted of running this long query:


select        YEAR(TransactionDate) AS year_tran,        MONTH(TransactionDate) AS month_tran,        FIRST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS first_product_nb,        LAST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS last_product_nb from AdventureWorks2012.dbo.bigTransactionHistory as a        join AdventureWorks2012.dbo.bigProduct as p              on a.ProductID = p.ProductID


On my computer this query takes 05’’47’ to run. The parallelism is enabled. My SQL Server instance can use up to 8 processors. The first time I was disappointed by viewing the output of the sys.dm_exec_query_profiles DMV during my query execution:


select * from sys.dm_exec_query_profiles




… No output! Ok what are we supposed to do to see something with sys.dm_exec_query_profiles ? Laughing  Keep good habits and go back to the SQL Server BOL that says:

To serialize the requests of sys.dm_exec_query_profiles and return the final results to SHOWPLAN XML, use SET STATISTICS PROFILE ON; or SET STATISTICS XML ON;

In other words, to have a chance to see something with this DMV we must use some session options like STATISTICS PROFILE, STATISTICS XML or force SQL Server to display the execution plan after execution query with SQL Server Management Studio. My first though was: why do we have to use some additional options to produce some output for this new DMV? These constraints severely limit the scenarios where we can use this DMV … After some reflexions and discussions with others French MVPs we can think that is normal because tracking the execution plan information is expensive and we could potentially bring a production server to its knees by enabling this feature for all the workload statements. But using additional session options can be impractical in production environments because it requires to execute itself the concerned query and it is not always possible. Fortunately Microsoft provides the query_post_execution_showplan event which can be used into a SQL Server trace or an extended event session. But the implementation design of this event has a significant impact to the performance of a SQL Server instance. Indeed, even with a short-circuit predicate this event will be triggered each time a SQL statement will be executed because the query duration is not known ahead of time (please refer to this Microsoft connect item). To summarize, using this event in OLTP production environment should be used in a short period for troubleshooting purposes. In OLAP environment the story is not the same because we don’t have to deal with a lot of short queries but only with long running queries issued by cube processing or ETL processes for example.

After discussing the pros and cons of this new feature let’s start with my precedent T-SQL statement and the use of the session option: SET STATISTICS PROFILE ON


SET STATISTICS PROFILE ON;   select        YEAR(TransactionDate) AS year_tran,        MONTH(TransactionDate) AS month_tran,        FIRST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS first_product_nb,        LAST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS last_product_nb from AdventureWorks2012.dbo.bigTransactionHistory as a        join AdventureWorks2012.dbo.bigProduct as p              on a.ProductID = p.ProductID


In other session:


select * from sys.dm_exec_query_profiles




This DMV provides a lot of useful information. These information are more granular than SET STATISTICS IO because  the counters returned are per operator per thread (node_id / physical_operator_name and thread_id columns).  In my case the SQL Server instance can use up to 8 processors and we can notice that for some operators (node_id) we have several threads using in parallel (thread_id). Furthermore two others columns are interesting like estimate_row_count and row_count. The former is the number of estimated rows an operator should address and the latter is the current number of rows addressed by the same operator (remember these counters are per operator per thread …). We can compare these two columns to estimate a completion percentage per operator (or per operator per thread). Finally we can categorize some information provided by this DMV in two parts: information related either to the query execution engine (row_count, estimated_row_count) or the storage execution engine (logical_read_count, write_page_count, lob_logical_read_count etc.)

As you can see, this DMV provide a lot of information to us and you will certainly customize the final output yourself depending on what you want to focus on.

Here is an example of a custom script we can create that uses the sys.dm_exec_query_profiles, sys.objects, sys.indexes, sys.dm_exec_requets, sys.dm_exec_sessions, and sys.dm_os_waiting_tasks to cover both the real-time execution information and the related waits occurring in the same time.

USE AdventureWorks2012; GO   SELECT        qp.node_id,        qp.session_id, AS [object_name], AS index_name,        qp.physical_operator_name + QUOTENAME(CAST(COUNT(qp.thread_id) AS VARCHAR(4))) AS physical_operator_name,        SUM(qp.estimate_row_count) AS estimate_row_count,        SUM(qp.row_count) AS row_count,        CASE              WHEN SUM(qp.row_count) * 1. / SUM(qp.estimate_row_count + .00001) THEN CAST(CAST(SUM(qp.row_count) * 100. / SUM(qp.estimate_row_count + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %'              ELSE '100 % + (' + CAST(CAST((SUM(qp.row_count) * 100. / SUM(qp.estimate_row_count + .00001)) - 100 AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %)'        END AS [completed_%],        -- Query execution engine        CAST(CAST(SUM(qp.elapsed_time_ms) * 100. /(SUM(SUM(qp.elapsed_time_ms)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_elapsed_time_%],        CAST(CAST(SUM(qp.cpu_time_ms) * 100. /(SUM(SUM(qp.cpu_time_ms)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_cpu_%],        -- Storage engine        CAST(CAST(SUM(qp.logical_read_count) * 100. / SUM(SUM(qp.logical_read_count)) OVER() + .00001 AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_logical_read_%],        CAST(CAST(SUM(qp.physical_read_count) * 100. /(SUM(SUM(qp.physical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_physical_read_%],        CAST(CAST(SUM(qp.lob_logical_read_count) * 100. /(SUM(SUM(qp.lob_logical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [lob_logical_read_%],        CAST(CAST(SUM(qp.lob_physical_read_count) * 100. /(SUM(SUM(qp.lob_physical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [lob_physical_read_%],        CAST(CAST(SUM(qp.write_page_count) * 100. /(SUM(SUM(qp.write_page_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_write_%],        SUM(qp.logical_read_count) AS total_logical_read_count,        SUM(qp.lob_logical_read_count) AS total_lob_logical_read_count,        SUM(qp.lob_physical_read_count) AS total_lob_physical_read_count,        SUM(qp.lob_read_ahead_count) AS total_lob_read_ahead_count,        SUM(qp.physical_read_count) AS total_physical_read_count,        SUM(qp.read_ahead_count) AS total_read_ahead_count,        SUM(qp.write_page_count) AS total_write_page_count,        -- Both        SUM(qp.cpu_time_ms) AS total_cpu_time_ms,        SUM(qp.elapsed_time_ms) AS total_elapsed_time_ms,        -- wait info information        COALESCE(wait.wait_info, '') AS wait_info FROM sys.dm_exec_query_profiles AS qp(NOLOCK)        LEFT JOIN sys.objects AS o(NOLOCK)              ON qp.object_id = o.object_id        LEFT JOIN sys.indexes AS i(NOLOCK)              ON qp.index_id = i.index_id                     AND qp.object_id = i.object_id OUTER APPLY (        SELECT        STUFF(                     (SELECT                                   ',' + ws.wait_type + ' ' + QUOTENAME(CAST(SUM(COALESCE(ws.wait_duration_ms, 0)) AS VARCHAR(20)) + ' ms')                            FROM sys.dm_exec_requests AS r(NOLOCK)                                   JOIN sys.dm_exec_sessions AS s(NOLOCK)                                         ON r.session_id = s.session_id                                   JOIN sys.dm_os_waiting_tasks AS ws(NOLOCK)                                         ON ws.session_id = s.session_id                            WHERE s.session_id = qp.session_id                            GROUP BY ws.wait_type                            FOR XML PATH (''), TYPE).value('.', 'varchar(max)') , 1, 1, ''              ) ) AS wait(wait_info) GROUP BY qp.node_id, session_id, qp.physical_operator_name,,, COALESCE(wait.wait_info, '') ORDER BY node_id OPTION(RECOMPILE);


The first section of the above script concerns the general information:




1: This above script groups the information per operator. The information in brackets is the number of thread used for the concerned operator.

2: Completion is provided as a ratio between rows addressed by the corresponding operator so far and the total estimated rows. As we can expect, the estimated rows and the real rows addressed by an operator can differ. The above script highlights this gap by showing the difference in percentage between brackets if row_count is greater than estimated_row_count. This information can be useful to “estimate” the general completion of the query.

However Total_elasped_time_% and total_cpu_% columns are the relative percentage of the elapsed time and cpu time consumed by all the query execution operators.

3: For some physical operators like Clustered Index Scan, Table Scan etc … we will show the related table and index objects


The second section concerns the storage engine statistic for the query execution:




The above information can be useful to highlight the top operators depending on the resource or the operation type. For example, in my case I can notice that the cluster index scan of the bigTransactionHistory table is the main consuming physical operator for logical reads. Furthermore, we can observe that the sort operator is the root cause of spilling. Finally as expected we can notice that by design hash match operator consumes a lot of CPU and it’s the top operator in this case.


The next section shows the absolute values per operator per type of resources and operation:




Finally the last section shows the wait types and aggregated duration by all related threads in brackets during the query execution in a real time:




In my case we retrieve the wait type related to a query execution in parallel  (CXPACKET) and certainly one of the root cause of the CXPACKET wait type (IO_COMPLETION) but here we don't have sufficient information to verify it. We can just notice that the query execution generates CXPACKET and IO_COMPLETION wait types in a real time.

At this point we can ask why we don’t have a remaining time information. In fact, computing an accurate remaining time is very difficult because a lot of factors must be correlated together. We can have a naive approach by computing the remaining time with the row_count, estimated_row_count and elapsed time columns for each operator, but you will be disappointed when testing this… According to my own tests, I can claim that the result is never as accurate as you would expect. Please feel free to share your results or thoughts about computing an accurate remaining time with us!

The sys.dm_exec_query_profiles DMV can be correlated by external tools like XPERF or Perfmon. I will prepare a next blog post (as soon as I have some time) to share my experience with you.

Happy query profiling!

Linux: how to monitor the nproc limit

Tue, 2014-06-10 01:14

You probably know about 'nproc' limits in Linux which are set in /etc/limits.conf and checked with 'ulimit -u'. But do you know how to handle the monitoring and be alerted when you're close the fixed limit?

Nproc and ps

Nproc is defined at OS level to limit the number of processes per user. Oracle documentation recommends the following:

oracle soft nproc 2047
oracle hard nproc 16384

But that is often too low, especially when you have the Enterprise Manager agent or other java programs running.

Do you want to check that you are far from the limit? then you can use 'ps'. But beware, 'ps' by default does not show all processes.
In Linux, when doing multithreading, each thread is implemented as a light-weight process (LWP). And you must use the '-L' to see all of them.

Let's take an example. I have a system where 'ps -u oracle' returns 243 lines. But including LWPs shows a lot more processes which is near the limit:

$ ps h -Led -o user | sort | uniq -c | sort -n
      1 dbus
      1 ntp
      1 rpc
      1 rpcuser
      2 avahi
      2 haldaemon
      2 postfix
    166 grid
    400 root
   1370 oracle

So the 'oracle' user has 1370 processes. That's high. And this is the actual number where the nproc limit applies.

'ps -Lf' can show the detail. And even without '-L' we can display the NLWP which is the number of threads per process:

ps -o nlwp,pid,lwp,args -u oracle | sort -n
   1  8444  8444 oracleOPRODP3 (LOCAL=NO)
   1  9397  9397 oracleOPRODP3 (LOCAL=NO)
   1  9542  9542 oracleOPRODP3 (LOCAL=NO)
   1  9803  9803 /u00/app/oracle/product/agent12c/core/ /u00/app/oracle/product/agent12c/core/ agent /u00/app/oracle/product/agent12c/agent_inst/sysman/log/emagent.nohup
  19 11966 11966 /u00/app/11.2.0/grid/bin/oraagent.bin
1114  9963  9963 /u00/app/oracle/product/agent12c/core/ ... emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain

The Oracle 12c EM agent has started 1114 threads and the grid infrastructure 'oraagent.bin' has 19 threads. In addition to that I've a lot of other monothreaded processes. This is how we reach 1370 which is the exact value to compare to the nproc limit.

So what are the good values to set? About the high number of threads for EM agent 12c, there are a few bugs. And I suspect that 1000 threads is too much, especially when checking them with 'jstack' I see that they are "CRSeOns" threads that should not be used in and higher. But that's another problem which I'm currently investigating. When you reach the nproc limit, the user will not be able to create new processes. clone() calls will return EAGAIN and that is reported by Oracle as:

ORA-27300: OS system dependent operation:fork failed with status: 11 
ORA-27301: OS failure message: Resource temporarily unavailable 

And that is clearly bad when it concerns an +ASM instance or archiver processes.

The goal of the nproc limit is only to prevent 'fork bombs' where a process forks forever and exhausts all resources. So there is no problem to increase this limit. However if you set it high for some users ('oracle' and 'grid' usually), it can be a good idea to monitor the number of processes with the ps h -L above. Because having too many processes is suspect and increasing the limit just hides a process leak and defer the failure.

In 'ps h -L -o' The argument 'h' is there to remove the header line, and '-L' to show all processes including LWP. Then you can count with 'wc -l'.

The proof

In order to be sure that 'ps h -L' gives the exact number, I have tested it. In case you want to check this on your system, here is how to do it. And please report any difference.

First, set your limit to 1024 processes. This is a limit for my user, and the limit is set for my shell and all its child processes:

[oracle@VM211 ocm]$ ulimit -u 1024

Now you can check it:

[oracle@VM211 ocm]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15919
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Then you can run a small C program ( that calls fork() in a loop until it fails with EAGAIN:

[oracle@VM211 ocm]$ ./testnproc
parent says fork number 871 sucessful
 child says fork number 872 pid 1518
parent says fork number 872 sucessful
 child says fork number 873 pid 1519
parent says fork number 873 sucessful
 child says fork number 874 pid 1520
parent says fork number 874 sucessful
parent says fork number 875 failed (nproc: soft=1024 hard=1024) with errno=11

And finally, because the processes sleep for a while, you can check how many processes you have. I do that from another user account for the simple reason that I need to create 2 more processes ('ps' and 'wc') for that:

[root@VM211 ocm]# ps h -Lu oracle | wc -l
Recommended values for Oracle

Currently this is what is set on Oracle linux 6 for 11gR2 by the preinstall package (in /etc/security/limits.conf):

oracle   soft   nproc    16384
oracle   hard   nproc    16384

For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

oracle soft nproc 16384
oracle hard nproc 16384

And just for your information, here is what is set in the ODA X4-2:

oracle soft nproc 131072

So what do you want to set? You probably don't want it too low and experience 'resource temporarily unavailable'. But what you don't want either is 100000 processes on your server. So my recommendation is to set it high but monitor it when the number of processes reaches something that is not sensible. Then you prevent having the system down in case of process leak, but you can detect it and ask for a patch.

SQL Server 2014: Availability groups failover impossible with Failover Cluster Manager

Fri, 2014-06-06 01:27

A couple of weeks ago, I worked for a customer that wanted to implement SQL Server 2012 (and not SQL Server 214) AlwaysOn with availability groups. During our tests we performed a bunch of failover tests and the customer tried to perform a failover of one of the installed availability group by using the failover cluster manager (FCM). Of course, I told him this is not best practice because the failover cluster manager is not aware of the synchronization state of the availability group. But with SQL Server 2014, the story has changed because I noticed a different behavior. I would like to share this information with you in this posting.

But let me first demonstrate the SQL Server 2012 behavior with the following example:

I have an availability group named SQL12 configured with 2 synchronous replicas and automatic failover. However, as you can see, the synchronization state of my availability database is not healthy as shown below:




Now, if I try to failover my availability group using the failover cluster manager ...






.... the next available node is SQL143 ...






... and we can notice the failover did not occur as expected because the SQL12 resource is still on the SQL141 node as shown in the following picture:




Having a look at the cluster error log does not help us in this case because we have a classic 1069 error number without helpful information:





Generating detailed cluster error logs could help us but I prefer to directly look at the SQL Server side for the moment. The AlwaysOn_health extended event is a good start to check for some existing records associated to the problem.




We have indeed some information about the failover attempt. First the SQL141SQL12 replica state changed from PRIMARY_NORMAL to RESOLVING_NORMAL due to the manual failover issued by the FCM.




Then we can see an error message that explains that the dummy database is changing its role from PRIMARY to RESOLVING because there is a problem with a role synchronization. This error is issued by the forced failover of the availability group resource that I used.




Finally, we notice the failover process did not complete succesfully and the dummy database failbacks on the SQL141SQL12 replica (availability_replica_id 1B9007CA-EE3F-486D-A974-838EFED0203D associated to the SQL141SQL12 replica in my case)




On the other side, the SQL143SQL12 secondary replica also features a lot of useful information:






To summarize, this test demonstrates clearly that the FCM is not aware of the availability databases synchronization state inside an availability group. Using FCM may result in unintended outcomes, including unexpected downtime!

Now, it's time to test the same scenario with SQL Server 2014 and a configured availability group. During the failover attempt, I get the following error message:






If we take a look at the possible owners of the corresponding availability group resource, we can see that the current cluster node that hosts the availability group is the only possible owner - unlike SQL Server 2012.




As a reminder, possible and preferred owners are resetted during the availability group creation and failover based on the primary replica and its secondaries. For fun, I decided to compare the two processes issued by the both versions of SQL Server and availability groups (SQL Server 2012 and SQL Server 2014) after having triggered an availability group failover and generating the associated cluster logs:


Get-ClusterLog -UseLocalTime -Span 5

Result with SQL Server 2012

000005d4.00000778::2014/05/26-22:10:55.088 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (sql12, 1) 000005d4.00000778::2014/05/26-22:10:55.088 INFO [GUM] Node 1: executing request locally, gumId:215, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.088 INFO [RCM] rcm::RcmGum::AddPossibleOwner(sql12,1) 000005d4.00000778::2014/05/26-22:10:55.103 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000005d4.00000778::2014/05/26-22:10:55.103 WARN [DM] Aborting group transaction 29:29:613+1 000005d4.00000778::2014/05/26-22:10:55.103 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000005d4.00000778::2014/05/26-22:10:55.103 WARN [RCM] sql12 cannot be hosted on node 3 000005d4.00000778::2014/05/26-22:10:55.103 WARN [RCM] Possible owners: 000005d4.00000778::2014/05/26-22:10:55.103 WARN     2 000005d4.00000778::2014/05/26-22:10:55.103 WARN     1 000005d4.00000778::2014/05/26-22:10:55.103 WARN 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmApi::RemovePossibleOwner: (sql12, 2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:215, my action: /rcm/gum/RemovePossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::RemovePossibleOwner(sql12,2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] Removing node 2 from resource 'sql12'. 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (sql12, 2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:216, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::AddPossibleOwner(sql12,2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] Adding node 2 to resource 'sql12'. 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:217, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(sql12, 000005d4.00000778::2014/05/26-22:10:55.103 INFO     1 000005d4.00000778::2014/05/26-22:10:55.103 INFO     2 000005d4.00000778::2014/05/26-22:10:55.103 INFO


Result with SQL Server 2014

000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (AdventureWorksGrp, 1) 000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:230, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::AddPossibleOwner(AdventureWorksGrp,1) 000005d4.00000bb0::2014/05/26-22:14:54.578 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [DM] Aborting group transaction 29:29:627+1 000005d4.00000bb0::2014/05/26-22:14:54.578 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [RCM] AdventureWorksGrp cannot be hosted on node 3 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [RCM] Possible owners: 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN     2 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN     1 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmApi::RemovePossibleOwner: (AdventureWorksGrp, 2) 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:230, my action: /rcm/gum/RemovePossibleOwner, # of updates: 1 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::RemovePossibleOwner(AdventureWorksGrp,2) 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] Removing node 2 from resource 'AdventureWorksGrp'. 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.000011f4::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:231, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000005d4.000011f4::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(AdventureWorksGrp, 000005d4.000011f4::2014/05/26-22:14:54.578 INFO     1 000005d4.000011f4::2014/05/26-22:14:54.578 INFO


As I said earlier, possible and preferred owners properties are managed automatically by SQL Server AlwaysOn. We can see here this is done by the cluster resource control manager and the following functions:

rcm::RcmApi::AddPossibleOwner(), rcm::RcmApi::RemovePossibleOwner() and rcm::RcmApi::SetGroupPreferredOwners () .

You can notice that two nodes are added as possible owners with SQL Server 12. However, with SQL Server 2014 only one node is added as possible owner of the concerned availability group resource. Interesting change isn't it?

How to list all Oracle system schemas

Thu, 2014-06-05 04:51

Do you want to know which users come with the Oracle Database and which are the ones you have created? This is that not easy, especially prior to Oracle 12c. You know SYS and SYSTEM, but there are many others - especially when you have installed all options.

And it is important to know them all, e. g. when you import or apply a password policy, etc.

Let's see the options we have.


12c shows Oracle maintained users

Oracle 12c has introduced the ORACLE_MAINTAINED in the DBA_USERS dictionary view.

It is YES for the users created by oracle scripts (create database, calalog.sql, catproc.sql). So it's the right place to check when you are in 12c. 


12c Common users

Oracle 12c has another column to distinguish the user commonality. This is for CDB: the oracle dictionary is stored in the root container, and each PDB have links to it. And what is nice is that even when in a non-cdb the column is at 'YES' for Oracle maintained users and at 'NO' for created users. (except if you create C## common users in a cdb).


Default passwords

The table sys.default_pwd$ stores the default password hash value so that we can check which users can log with trivial password. Even if it includes all the Oracle maintained users, it has also some well known users which are not provided by database creation at all.


Sysaux occupants

Another idea that I took from Martin Bach is to crosscheck check with v$sysaux_occupants. Some Oracle provided users are missing, but the nice thing is that there is a description for the schemas that are listed.


Not exported by DataPump

DataPump maintains its own list of users for which the tables are not exported, and you can query sys.ku_noexp_tab for that. The old exp has also its list stored in sys.exu8usr. But both are not exhaustive.

Another exclusion list is the one used by logical standby: system.logstdby$skip which may be used as well to crosscheck (only records with 'action=0' are relevant for schema exclusion).

Let's check all that on a 12c non-cdb database. Here is the query that crosscheck all than information with outer joins:

select created,username,oracle_maintained,common,no_exp,no_expdp,no_sby,default_password,sysaux,occupant_desc
from dba_users 
left outer join 
 (select distinct name username,'Y' no_expdp from sys.ku_noexp_tab where obj_type='SCHEMA') 
left outer join (select distinct name username,'Y' no_exp from sys.exu8usr)
left outer join (select distinct name username,'Y' no_sby from system.logstdby$skip_support where action in (0,-1))
left outer join (select distinct user_name username,'Y' default_password from sys.default_pwd$)
left outer join (select schema_name username,'Y' sysaux,decode(count(*),1,min(occupant_desc)) occupant_desc from v$sysaux_occupants group by schema_name)
order by created,username;

and the result, ordered by creation date so that you can check which users were created by the database creation:

AUX 28-FEB-14 08:25:02 AUDSYS Y YES Y Y Y Y Y 28-FEB-14 08:25:02 SYS Y YES Y Y Y Y Y 28-FEB-14 08:25:02 SYSBACKUP Y YES   Y Y Y   28-FEB-14 08:25:02 SYSDG Y YES   Y Y Y   28-FEB-14 08:25:02 SYSKM Y YES   Y Y Y   28-FEB-14 08:25:02 SYSTEM Y YES Y   Y Y Y 28-FEB-14 08:25:05 OUTLN Y YES Y   Y Y   28-FEB-14 08:29:33 GSMADMIN_INTERNAL Y YES   Y Y Y   28-FEB-14 08:29:33 GSMUSER Y YES   Y Y Y   28-FEB-14 08:29:43 DIP Y YES   Y Y Y   28-FEB-14 08:30:39 XS$NULL Y YES   Y Y Y   28-FEB-14 08:31:12 ORACLE_OCM Y YES   Y Y Y   28-FEB-14 08:39:55 DBSNMP Y YES   Y Y Y Y 28-FEB-14 08:39:56 APPQOSSYS Y YES   Y Y Y   28-FEB-14 08:40:03 ANONYMOUS Y YES   Y Y Y   28-FEB-14 08:40:03 XDB Y YES   Y Y Y Y 28-FEB-14 08:44:29 GSMCATUSER Y YES   Y Y Y   28-FEB-14 08:53:38 WMSYS Y YES   Y Y Y Y 28-FEB-14 08:55:31 OJVMSYS Y YES   Y Y Y   28-FEB-14 08:59:58 CTXSYS Y YES   Y Y Y Y 28-FEB-14 09:00:44 MDSYS Y YES   Y Y Y Y 28-FEB-14 09:00:44 ORDDATA Y YES   Y Y Y Y 28-FEB-14 09:00:44 ORDPLUGINS Y YES   Y Y Y Y 28-FEB-14 09:00:44 ORDSYS Y YES   Y Y Y Y 28-FEB-14 09:00:44 SI_INFORMTN_SCHEMA Y YES   Y Y Y Y 28-FEB-14 09:09:48 OLAPSYS Y YES   Y Y Y Y 28-FEB-14 09:10:17 MDDATA Y YES Y   Y Y   28-FEB-14 09:15:09 SPATIAL_WFS_ADMIN_USR Y YES Y   Y Y   28-FEB-14 09:15:13 SPATIAL_CSW_ADMIN_USR Y YES Y   Y Y   28-FEB-14 09:20:00 LBACSYS Y YES   Y Y Y   28-FEB-14 09:20:16 APEX_040200 Y YES       Y   28-FEB-14 09:20:16 APEX_PUBLIC_USER Y YES Y     Y   28-FEB-14 09:20:16 FLOWS_FILES Y YES       Y   28-FEB-14 09:45:17 DVF Y YES   Y Y Y   28-FEB-14 09:45:17 DVSYS Y YES   Y Y Y   21-MAR-14 18:15:19 FRANCK N NO Y         09-APR-14 20:33:34 RMAN N NO Y     Y   11-APR-14 09:04:55 DEMO N NO Y     Y   23-APR-14 20:53:09 HR N NO Y   Y Y   23-APR-14 20:53:09 IX N NO Y   Y Y   23-APR-14 20:53:09 OE N NO Y   Y Y   23-APR-14 20:53:09 PM N NO Y   Y Y   23-APR-14 20:53:09 SH N NO Y   Y Y   25-APR-14 21:26:33 SQLTXADMIN N NO Y         25-APR-14 21:26:33 SQLTXPLAIN N NO Y         27-MAY-14 21:48:24 SCOTT N NO Y   Y Y  


So you see that the 12c columns are highly reliable. They include all users that have been created during the database creatio, but do not include the users that were created afterwards.

When not in 12c, the exclusion list coming from DataPump is quite good, but not exhaustive.

Sometimes the most useful features are not those that change the future of database, but rather those little things that make life easier.

In DBA_USERS, we have two very useful new columns: ORACLE_MAINTAINED and LAST_LOGIN. That is a great improvement.

How to avoid Failover Cluster with PowerShell scripting

Wed, 2014-06-04 20:15

One of my colleague asked me some days ago if I could create a PowerShell script to get rid from Windows Server Failover Cluster. The reason is that it has, in some specific cases, the bad habit to failover a resource or a service even when we don't want it. To be honest, I use a lot the Failover Cluster feature of Windows Server without any problems as a SQL Server DBA, but for this specific case, I did not find a successful configuration.

Here, we are using Failover Cluster to provide a Virtual IP address for a database using Dbvisit Standby. The database is open only on one node and the switchover or failover operation is performed by hand so we want to prevent failover of the Virtual IP resource to avoid redirecting client connection to a database that is not open.

At one of our clients, we have a Windows Server 2008 R2 cluster with two nodes and a FileShare. A Client Access Point has been provided as a service for client connections and this service must not failover automatically in any cases. For some reasons, this service failovers and I cannot avoid it even with special configuration settings.

For a solution to this problem, I propose to create a PowerShell script which will create a new IP address attached to the Network IPv4 interface. This script will read a configuration file to collect informations about the IP address like network address mask, network interface name, IP address, nodes where to start and stop the IP address, etc.

Here is the format of our configuration file:

# Configuration file for VIP management of databases
# Configuration format :
# SID            : Oracle SID for which the VIP is configured - mandatory
# VIP_ADR     : Address of the VIP - mandatory
# VIP_MASK   : VIP Network address Mask - mandatory
# VIP_IF         : Network interface to attach the VIP (i.e Local Area Connection) - mandatory
# VIP_LSNR    : VIP Listener - optional
# VIP_Node1   : Name of Node 1 - mandatory
# VIP_Node2   : Name of Node 2 - mandatory
# Note :
# The usage of parameters delimiters allows to have potentially empty (not
# defined parameters)
# How to call the script :
# .CreateIPAddress.ps1 'Start' 'DBITEST' [-F]
# Param 1: "Start" or "Stop" - mandatory
# Param 2: VIP Address name - mandatory
# Param 3: -F to force creation of the VIP Address locally even if it is not possible to check if the VIP Address exits on the second node - optional
# Example :
# DBITEST: Area Connection::test1:test2

In this blog, I will share how the script works step by step and what are the components/commands I had to learn about to create an IP on a network interface.

First of all, this PowerShell script uses the cmdlet Invoke-Command which uses Windows Remote Management (WinRM). If WinRM is not configured you can enable the WS-Management protocol on the local computer and set up the default configuration for remote management with the following command: Winrm quickconfig.

You need also to change the execution policy in your PowerShell screen with Set-ExecutionPolicy Unrestricted to allow script execution. You should also open your PowerShell screen with the Run As Administrator option.


The first thing I do is to clear my screen:


Now, I have to check if I receive the parameter I need for my script:


I have initialized my two mandatory parameters and perhaps a third one which is optional. It is time to read where my configuration file is located, test this location and initialize the variables.


I am able now to read my configuration file sequentially to find my VIP Address. I will test first character of each line to avoid comment line and when I will find a non-comment line, check the first parameter of the line to test if it is equal to my VIP Address enter into the command line.


I have found my VIP Address in my configuration file, I will now split my configuration line to collect all informations I need and connect to my second node to get back the different IP Addresses attached to Network Adapters.

If the second node doesn't respond we stop the script to avoid to create a local IP Address which already exists on the second node. But if the optional parameter "-F" has been entered we force the creation.


If we run the command "Get-WmiObject -Class Win32_NetworkAdapterConfiguration -Filter IPEnabled=TRUE -ComputerName $node2 -erroraction stop", we have a list of each Network Adapter with its IP Addresses associated.


We have now to search in the different lines if our VIP Address exists. If we find it, we have to delete this IP Address and if it doesn't exist, we should print a message.


For the moment, we have to check remotely if our IP Address existed on the remote server and if it was the case we have to delete this IP Address.

Now, based on our parameter, we will stop or start this IP Address locally.


Our VIP address is now started or stopped locally depending on the defined parameters. Finally, if we have not found the VIP Address in the configuration file, an error will be displayed and the script ends.


At present, we have a local VIP Address where the client can come to connect. If our node goes down for whatever reason when it comes back, our VIP address will by always in the same node. If we want to change for node 1 to node 2, we just have to run our script on node 2 and go for the parameter start. This will stop the IP Address on node 1 and start the IP Address on node 2.

I hope this script and the different functions I have shown here were able to help you.

Configure a Kerberos Client and Alfresco as a Kerberized application

Wed, 2014-05-28 03:57

In the previous blog, I described How to install and manage a Kerberos Server but that's useless if there are no clients and if no application have been kerberized! That's why in this blog I will explain in the first part how to install a kerberos client in linux. The second part will be dedicated to the configuration of a browser to use Kerberos tickets and the last part will explain how to configure a popular application to use the newly created Kerberos MIT KDC. This application is Alfresco (leader in Open Source solutions for Electronic Content Management).

For this blog, let's define the following properties/variables:

  • = the DNS Domain
  • = the FQDN of the KDC
  • document.write(['mpatou','EXAMPLE.COM'].join('@')) = the principal of a test user
  • = the FQDN of the Alfresco host server

All configurations below have been tested on our infrastructure.

1. Install MIT Kerberos Client

On this part, I will only present how to install a Linux Client. I think the Mac release is available as part of the Mac OS X since version 10.3 (the current release is Mavericks: 10.9) and so there is nothing to do to install a Kerberos client but this installation isn't configured. The Windows installation is quite easy if you use the installer but this isn't a perfect solution. I think the best solution for windows would be to use the Windows implementation of Kerberos to configure the connection to a Linux KDC but this is a little bit more complex and I personally have troubles to configure that...

So, all steps to install a Linux client are quite the same as those to install a Kerberos Server. Indeed, there is no specific source code for the client and so the basic installation is the same but it's not the case for the configuration.

Obviously, the first thing to do is to download the current release of the MIT Kerberos distribution for the target operating system. This could be done at the following URL: The current Linux release is krb5-1.12.1-signed.tar:

# wget
    --2014-04-01 14:00:28--
    Connecting to||:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 11950080 (11M) [application/x-tar]
    Saving to: “krb5-1.12.1-signed.tar”
    100%[===============================================>] 11,950,080  1.52M/s   in 7.3s

    2014-04-01 14:00:38 (1.56 MB/s) - “krb5-1.12.1-signed.tar” saved [11950080/11950080]

# tar  -xvf krb5-1.12.1-signed.tar

As you can see, this file is signed and you could (should) verify the integrity and identity of the software. This can be done, for example, using GNU Privacy Guard:

# gpg  --verify  krb5-1.12.1.tar.gz.asc

After that, just extract the MIT Kerberos source code:

# tar  -zxf  krb5-1.12.1.tar.gz
# cd  krb5-1.12.1/src/
# ./configure
# yum  install  *yacc*
# make
# make  install

At this step, Kerberos should be installed properly and the binaries, libraries and the documentation should be under /usr/local. The default location is sufficient for a client installation:

# krb5-config  --all
    Version:     Kerberos 5 release 1.12.1
    Vendor:      Massachusetts Institute of Technology
    Prefix:      /usr/local
    Exec_prefix: /usr/local

To set up a linux client, there is only one step remaining: tell Kerberos where to find a KDC. This is done through a configuration file named krb5.conf. In the following configuration, I've included two lines (forwardable and proxiable) that are important for Alfresco SSO using Kerberos to work properly. These two lines aren't mandatory for other kerberized applications:

# vi  /etc/krb5.conf
        default_realm = EXAMPLE.COM
        forwardable = true                            {Line only important for Alfresco}
        proxiable = true                              {Line only important for Alfresco}

        EXAMPLE.COM = {
            kdc =
            admin_server =
            default_domain =

    [domain_realm] = EXAMPLE.COM = EXAMPLE.COM

That should be enough to obtain a ticket for the test user ("kinit mpatou" OR "kinit document.write(['mpatou','EXAMPLE.COM'].join('@'))") and delete this ticket:

# klist
    klist: Credentials cache file '/tmp/krb5cc_0' not found
# kinit  mpatou
    Password for document.write(['mpatou','EXAMPLE.COM'].join('@')):
# klist
    Ticket cache: FILE:/tmp/krb5cc_0
    Default principal: document.write(['mpatou','EXAMPLE.COM'].join('@'))

    Valid starting               Expires               Service principal
    05/20/2014 10:54:48     05/20/2014 20:54:48     krbtgt/document.write(['mpatou','EXAMPLE.COM'].join('@'))
    renew until 05/21/2014 10:54:47
# kdestroy
# klist
    klist: Credentials cache file '/tmp/krb5cc_0' not found

2. Configure the Browser

Now the client should be able to acquire the first ticket (TGT) but that's not enough! The next step is to configure the Browser to use tickets. Indeed, if the client tries to access to a kerberized application, the browser has to present a ticket for that application to be logged in automatically.

a. Mozilla Firefox (Window/Linux/Mac)

Open a new window/tab in Mozilla Firefox:

  • Enter in the URL field "about:config"
  • All properties of Firefox should be displayed
  • Search for "network.negotiate" and set: network.negotiate-auth.delegation-uris = .EXAMPLE.COM
  • Set: network.negotiate-auth.trusted-uris = .EXAMPLE.COM

b. Google Chrome

i. Mac

Modify the Google Chrome application with:

open 'Google' --args --auth-schemes="basic,digest,ntlm,negotiate" --auth-server-whitelist="*EXAMPLE.COM" --auth-negotiate-delegate-whitelist="*EXAMPLE.COM"

ii. Linux

Modify Google Chrome with:

google-chrome --enable-plugins --args --auth-server-whitelist="*EXAMPLE.COM" --auth-negotiate-delegate-whitelist="*EXAMPLE.COM" --auth-schemes="basic,digest,ntlm,negotiate"

iii. Windows

Modify the Google Chrome shortcut with:

chrome.exe --auth-schemes="basic,digest,ntlm,negotiate" --auth-server-whitelist="*EXAMPLE.COM" --auth-negotiate-delegate-whitelist="*EXAMPLE.COM"

c. Internet Explorer

Open a new window/tab in Internet Explorer and:

  • Open the "tools" (ALT+X)
  • Click on "internet options" (o)
  • Choose the "Advanced" tab
  • In the security section, select the "Enable Integrated Windows Authentication*"
  • Choose the "Security" tab
  • Select "Local intranet"
  • Click on "Sites"
  • Click on "Advanced"
  • Add "http://*" or/and "https://*"
  • Click on "Close"
  • Click on "OK"
  • Click on "OK"
  • Restart the computer

If everything works fine, then the client should be able to access to kerberized applications. If it's not the case, a good start to debug the kerberos installation is to use a network analyzer like WireShark.

3. Configure Alfresco to use Kerberos SSO

Actual test configuration:

  • OS: Oracle Enterprise Linux 6
  • Alfresco: Alfresco Community Enterprise 4.2.c installed on /opt/alfresco
  • Application Server: Tomcat 7

So let's begin the configuration of a Kerberized application with Alfresco. The first thing to know about Alfresco is that Alfresco uses two main Web Clients: Alfresco Explorer and Alfresco Share. The first one is the core of Alfresco and the second one is the new interface (mainly oriented on collaboration) that uses a proxy to ask Explorer to do the job. The configuration of Alfresco Explorer is quite easy but to get Alfresco Share working it's not the same as Share add a new layer above Explorer.

If nothing is specified, all modifications below should be done on, the Alfresco host server.

For Kerberos to work properly, the maxHttpHeaderSize must be increased:

# vi  /opt/alfresco/alfresco-4.2.c/tomcat/conf/server.xml
   Connector port="8080" URIEncoding="UTF-8" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" maxHttpHeaderSize="32768"

 Then the authentication chain must allow Kerberos tickets to be used to authenticate a user:

# vi  /opt/alfresco/alfresco-4.2.c/tomcat/shared/classes/

After that, the Kerberos subsystem must be configured:

# cd  /opt/alfresco/alfresco-4.2.c/tomcat/shared/classes/alfresco/extension/subsystems
# vi  Authentication/kerberos/kerberos1/

Once this is done, the KDC must be configured to recognize Alfresco as a Kerberized application. For that purpose, enter in the KDC kadmin interface. Caution, this requires the installation of a Kerberos Server on as described in the previous blog (the important part is that kadmin must be available from a remote computer) and the installation of a Kerberos Client on If your kadmin doesn't work from a remote location, you will have to use the kadmin.local interface on the KDC host server.

# /usr/local/bin/kadmin
    addprinc  cifs/document.write(['','EXAMPLE.COM'].join('@'))
    addprinc  HTTP/document.write(['','EXAMPLE.COM'].join('@'))
    ktadd  -k  /etc/krb5cifs.keytab  cifs/document.write(['','EXAMPLE.COM'].join('@'))
    ktadd  -k  /etc/krb5http.keytab  HTTP/document.write(['','EXAMPLE.COM'].join('@'))


This will create 2 files named 'krb5cifs.keytab' and 'krb5http.keytab' on the host where the kadmin was run. That means that if you used the kadmin from, then there is nothing more to do but if you used the kadmin or kadmin.local from, then those 2 files must be moved to

# cd  /etc
# scp  krb5cifs.keytab  document.write(['root',''].join('@')):/etc
enter password:
# scp  krb5http.keytab  document.write(['root',''].join('@')):/etc
enter password:

Create or update the config file for the Java Security. Be careful that this is the Java used by Alfresco:

# vi  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/java.login.config:
    Alfresco { sufficient;

    AlfrescoCIFS { required

    AlfrescoHTTP { required
    ShareHTTP { required
    }; { sufficient;

    other { sufficient;

Configure the default Java Security to use our custom configuration by adding a line at the end of the file:

# vi  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/

 Finally, update the share-config-custom.xml file to contain the Kerberos configuration (the three images are in a unique file attached in this blog ->

# vi /opt/alfresco/alfresco-4.2.c/tomcat/shared/classes/alfresco/web-extension/share-config-custom.xml



That could be enough to get the Kerberos SSO working for Alfresco Explorer and Alfresco Share. I think there are some additional steps to get the Alfresco CIFS working too but I'm not sure. Moreover, it's possible that you get some strange exceptions that prevent Alfresco to authenticate your Kerberos Ticket. That probably comes from the Java Cryptography Extension that is missing on your Alfresco server. For that purpose, download the JCE corresponding to the Java version used by Alfresco (JCE6 or JCE7) and deploy it (I assume below that the JCE7 is under /opt):

# unzip  /opt/
# cd  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security
# cp  local_policy.jar  local_policy.jar.orig
# cp  US_export_policy.jar  US_export_policy.jar.orig
# cd  /opt/UnlimitedJCEPolicy
# cp  local_policy.jar  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/
# cp  US_export_policy.jar  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/

I hope I was clear enough in my explanations and I hope I was able to share (a little bit?) my passion for open source solutions!

Good luck with Kerberos and Alfresco!

Oracle Exadata - poor optimization for FIRST_ROWS

Wed, 2014-05-28 03:44

In a previous blog, I discussed the difference between rownum and row_number(), in particular their behaviour in implicitely adding a first_rows(n) to the optimizer. That reminded me that I forgot to blog about an issue I encountered and which concerns both approaches. It was on an Exadata: a nice full table scan with smartscan was taking a long time. And forcing to an index access - with a very bad index - was better. The query had a very selective predicate (and that's why SmartScan should be very good here) but where the predicate cannot be used by the index access.

In order to explain this, I'll do the same query but without any predicates. It's on a simple TEST table with 1 million rows.

select /*+ gather_plan_statistics */ * from (
  select * from TEST order by n 
) where rownum


And here is the execution plan with execution statistics

| Id  | Operation                               | Name | E-Rows | A-Rows | Buffers | Reads  |
|   0 | SELECT STATEMENT                        |      |        |     10 |    2137 |   2135 |
|*  1 |  COUNT STOPKEY                          |      |        |     10 |    2137 |   2135 |
|   2 |   VIEW                                  |      |   1000K|     10 |    2137 |   2135 |
|*  3 |    SORT ORDER BY STOPKEY                |      |   1000K|     10 |    2137 |   2135 |
|*  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS| TEST |   1000K|   1000K|    2137 |   2135 |


It sounds good. SmartScan is used (TABLE ACCESS STORAGE). We read 1 million rows, then sort them and keep only the top-10. But it takes longer than our expectation. Let's have a look at some SmartScan statistics:


NAME                                                                   VALUE
---------------------------------------------------------------- -----------
cell IO uncompressed bytes                                        17,596,416
cell blocks processed by cache layer                                   3,329
cell blocks processed by data layer                                    2,148
cell blocks processed by txn layer                                     3,329
cell num fast response sessions                                            1
cell num fast response sessions continuing to smart scan                   1
cell physical IO bytes eligible for predicate offload             17,498,112
cell physical IO interconnect bytes                               31,273,680
cell physical IO interconnect bytes returned by smart scan        15,848,144
cell scans                                                                 2
physical read bytes                                               17,506,304
physical read total bytes                                         32,923,648


Something is wrong here. We have 17GB eligible to SmartScan, which is the 2135 blocks reads we see in the execution plan (I have the default 8k blocks).
But I've exchanged 31GB through interconnect. What are those additional 16GB ? I'm doing only SmartScan here, according to the execution plan.

Well. Easy to check. The full table scan must read all blocks up the the high water mark.


select blocks,blocks*8192 MBYTES from user_tables where table_name='TEST';
---------- ----------
      4013   32882245


So my table has 4013 blocks and is 32GB. Not only my execution plan above is not optimal. But it is false. It is just impossible to full scan my table with only 2137 block reads. I have to read 4013 blocks, which is more than 30GB.
So in the statistics above, the right values are 'physical read total bytes' and 'cell physical IO interconnect bytes'. But it seems that only 17GB of data has been read through smart scan.

There we have to know a little bit more about an exadata optimization that is not very well documented. The clue was 'FIRST ROWS' in the execution plan, just after 'STORAGE'. If it's a special feature, we can check how to enable/disable it with an underscore parameter. Names and descriptions are in sys.x$ksppi:


SQL> select KSPPINM,KSPPDESC from sys.x$ksppi where KSPPDESC like '%first%rows%';

Enable smart scan optimization for fast response (first rows)

cost ratio for sort eimination under first_rows mode

enable the use of first K rows due to rownum predicate

Optimizer index bias over FTS/IFFS under first K rows mode


From its description, _kcfis_fast_response_enabled is related with SmartScan and with 'first rows' (which we have in the execution plan and which is coming from the rownum < 10.
So let's disable it:


alter session set "_kcfis_fast_response_enabled"=false;


and execute again, with execution plan and statistics.


| Id  | Operation                               | Name | E-Rows | A-Rows | Buffers | Reads  |
|   0 | SELECT STATEMENT                        |      |        |     10 |    4019 |   4017 |
|*  1 |  COUNT STOPKEY                          |      |        |     10 |    4019 |   4017 |
|   2 |   VIEW                                  |      |   1000K|     10 |    4019 |   4017 |
|*  3 |    SORT ORDER BY STOPKEY                |      |   1000K|     10 |    4019 |   4017 |
|*  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS| TEST |   1000K|   1000K|    4019 |   4017 |


The first observation was that the execution time was much faster. And here we see the 4000 blocks of my table that we have to read.


NAME                                                                                VALUE
---------------------------------------------------------------- ------------------------
cell IO uncompressed bytes                                                     33,120,256
cell blocks processed by cache layer                                                5,913
cell blocks processed by data layer                                                 4,043
cell blocks processed by txn layer                                                  5,913
cell physical IO bytes eligible for predicate offload                          32,915,456
cell physical IO interconnect bytes                                            29,833,624
cell physical IO interconnect bytes returned by smart scan                     29,825,432
cell scans                                                                              2
physical read bytes                                                            32,923,648
physical read total bytes                                                      32,923,648


All statistics are right now. 32GB eligible to SmartScan. 32GB returned by SmartScan.

So what happened?

When we use first_rows(10), either explicitely or coming from rownum < 10, Oracle knows that we need only 10 rows. And Exadata has an optimization to avoid SmartScan for only few rows because it has an overhead to start. This is the 'Enable smart scan optimization for fast response (first rows)' feature and you can see it in the 'cell num fast response sessions' statistic above. No SmartScan occurs when the result is expected to come quickly. However, if it is longer than expected, it can switch so SmartScan later ('cell num fast response sessions continuing to smart scan'). But there are two problems with that.
First, the statistics are wrong. The statistics from the first phase (the non-SmartScan one) are just lost. We see it in the execution plan where about 2000 reads are missing. And we see it also in 'physical read bytes' where 16GB are missing.
Second, and worse, this optimization is not good at all in our case. No matter we need only 10 rows. We have to read 1 million rows because we have to sort them before filtering them. And 1 million is not a good candidate for 'fast response (first rows)' optimization. Here, 16GB has been read without SmartScan, and when I checked the wait events, they were all single block reads. The fastest machine has chosen to use the slowest way to get data :(

This is just one example of first rows issues. There are many others. In general, first_rows_n optimization is good when it avoids to sort all the rows, using the index which is maintained ordered. And by the way, if you see 'first rows' without an order by, then it's suspicous: who wants the top-n of a random row set ?
In order to have a better control on first rows optimizations, I prefer to avoid rownum and use the row_number() function or the 12c FETCH FIRST ROWS syntax. See the previous blog about that.

Tempdb enhancements with SQL Server 2014

Sun, 2014-05-25 11:40

SQL Server 2014 is definitively designed for performance and I will try to demonstrate it during this blog post. I like to talk about hidden performance features because generally it does not require any changes for applications unlike in-memory tables for instance (aka hekaton tables).

Since SQL Server 2005 version some improvements have been made for tempdb. Tempdb caching is one of them and allows to reduce the page allocation contention. Basically to create a table SQL Server must first build the system catalog entries related to system pages. Then, SQL Server has to allocate an IAM page and find a mixed extent in an SGAM page to store data and mark it in the PFS page (as a reminder, by default mixed extent is chosen unless to force uniform extent by using the traceflag 1118). Finally the allocation process must be recorded to system pages. When a table is dropped SQL Server has to revert all it have done for creating the table. It implies some normal locks and latches during all the allocation process, same ones used for creating and dropping a temporary table. However in tempdb tables are created and dropped very quickly and it can generate page allocation contention especially for PFS, SGAM and GAM system pages (the famous PAGELATCH_UP wait type against the concerned pages). The bottom line is that SQL Server can cache some of the metadata and page allocations from temporary objects for easier and faster reuse with less contention.


In addition, to allow a temporary table to be cached it must be first used into a stored procedure but some situations can prevent this caching like:


  • Using named constraints
  • Using DDL after the temporary table creation
  • Create a table in a different scope
  • Using a stored procedure with recompile option

By executing the following T-SQL script with SQLQueryStress we can easily show that temporary tables are not reused by SQL Server.

use [AdventureWorks2012] go   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;


I used 8 concurrent threads with 100 iterations during this test.




In the same time, I enabled the following perfmon counters:


Counter name

Min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




Cache objects in Use




Now if I rewrite the same ad-hoc T-SQL statement into a stored procedure and then I perform the same test we can notice some speed improvements:

use [AdventureWorks2012] go   create procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;




counter name

min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




Cache objects in Use





As expected, this improvement is due to the tempdb caching mechanism. We can notice here that SQL Server reuses caching objects (“Cache objects in Use” counter > 0) that are in fact the temporary table into the stored procedure. Using caching objects decrease drastically the temporary table creation rate (Temp Tables creation rate / sec is equal to 0 here).

The cached objects themselves are visible by using the system table sys.tables in the tempdb context. For example during the first test we can easily observe that SQL Server does not deallocate completely a temporary table used into a stored procedure. The relationship can be made with the object id column value with a negative number. When SQL Server uses a temporary table the name of table is #test and when SQL Server doesn’t use it without deallocating the associated pages the name is composed of a 8-character hexadecimal string that maps in fact to the object id value. #AF42A2AE is the hexadecimal representation of the #test temporary table with the object id equal to -1354587474.








Furthermore we can notice several records in the above results because I used SQLQueryStress with 8 concurrent threads that imply concurrent executions of the stored procedure with separate cached objects in tempdb. We can see 4 records (I didn’t show completely the entire result here) but in fact we retrieved 8 records.

As I said earlier, DDL statements after the creation of the temporary table inhibits the ability to cache the temporary objects by SQL Server and can decrease the global performance of the stored procedure (we can ask here what is a DDL statement .. because DROP TABLE #table is apparently not considered as such because tempdb caching mechanism is not impacted). In my sample, SQL Server proposes to create the following index on the ProductID column to improve the query statement:


create nonclustered index idx_test_transaction_product_id on #test (        ProductID )

Go ahead, we trust SQL Server and we will add the creation of the index after the creation of the temporary table into the stored procedure:

use [AdventureWorks2012] go   create procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   --create index for ProductID predicate create nonclustered index idx_test_transaction_product_id on #test (        ProductID )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;


However, the result is not as good as we would expect ...




If we take a look at the perfmon counters values:

Counter name

min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




temp tables for destruction




Cache objects in Use





For this test I added a new perfmon counter: temp tables for destruction that indicates clearly that the temporary tables will be destroyed by SQL Server because they cannot be used in this case: the index creation DDL prevents the tempdb caching mechanism.

Here comes a new SQL Server 2014 feature that introduces a new way for declaring nonclustered indexes directly into the table creation DDL which can be a good workaround to the preceding test.


alter procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID int index idx_test_transaction_product_id, --< index created "on the fly"        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 1000000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;


After running the test we can notice that the temp tables creation rate and temp tables for destruction counters value are again equal to zero. SQL Server used the temporary table during the testing as showing the "Cache objects in User" counter.


Counter name

min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




temp tables for destruction




Cache objects in Use





However if we can still use the tempdb caching mechanism with SQL Server 2014 and this new tips the above result is contrasted with the total duration of execution as showed by the following picture:




The global execution time is larger than the test first with the stored procedure, the temporary table without any nonclustered index (02:44 vs 00:21) in my case. This is because inserting data into a table with a nonclustered index can take more time than a table without any indexes but in a real production environment we will probably encounter situations where the cost for inserting data into a table with an index would be substantial compared to the gain made for the following readings. If you have some examples please feel free to share with us Laughing

Another interesting feature since many versions is the concept of eager writes that prevent flooding the buffer pool with pages that are newly created, from bulk activities, and need to be written to disk. Eager write is another background process that helps to reduce the pressure of the well-known lazy writes and checkpoint background processes as well as increasing the IO performance by gathering pages before writing to disk. Basically, SQL Server tracks these pages into a circular list in memory. When the list is full old entries are removed by writing them to disk if still dirty.

Let me show you with the following T-SQL script on the SQL Server 2012 instance. I used the traceflag 3917 to show eager writes activity (thanks to Bod Dorr for this tip).


use AdventureWorks2012; go   -- create procedure sp_test_tempdb_2 -- bulk activity by using select into #table CREATE procedure sp_test_tempdb_2 as   select        bth.*,        p.Name AS ProductName,        p.Color into #test from AdventureWorks2012.dbo.bigTransactionHistory as bth        join AdventureWorks2012.dbo.bigProduct as p              on bth.ProductID = p.ProductID where p.Color in('White')        and p.Size = 'M' option (maxdop 1);   select        TransactionDate,        ProductID,        ProductName        Quantity        --Quantity * ActualCost AS total_individual_sale from (        select              ROW_NUMBER() OVER (PARTITION BY TransactionDate ORDER BY Quantity DESC) AS num,              *        from #test ) transaction_production_sales_top_ten where num option (maxdop 1);   drop table #test go   -- using of traceflag 3917 to show eager write activity (be carefull the ouput may be verbose) dbcc traceon(3917); dbcc traceon(3605); go   -- cycle errorlog for next easy read exec sp_cycle_errorlog; go   -- execution of the stored procedure dbo.sp_test_tempdb_2; exec dbo.sp_test_tempdb_2; go   -- Reading the error log file exec xp_readerrorlog;

Below a sample of the SQL Server error log:




We can notice that SQL Server writes up contiguous 32 dirty pages to disk in my test.

Even if this process is optimized to write pages efficiently to disk, we have still IO activity. SQL Server 2014 enhances this process by relaxing the need to flush these pages to disk as quickly as the older versions. SQL Server recognizes the bulk activity and the concerned pages are loaded, queried and released without any flushing disk activity.

The same test performed on the SQL Server 2014 environment gives the following result:




The eager write process was not triggered this time. So let’s compare with a simulating workload by using ostress this time. Ostress is a stress tool provided by the RML utilities. This time I used ostress with 4 threads and 1000 iterations each. SQLQueryStress generated a bunch of ASYNC_IO_NETWORK during my tests which potentially distorts the final result.

So, I used the following script for the both environment (SQL Server 2012 and SQL Server 2014):

"C:Program FilesMicrosoft CorporationRMLUtilsostress.exe" -Slocalhost -dAdventureWorks2012 -Q"exec dbo.sp_test_tempdb_2" -n4 -r1000 -N –q


SQL Server 2012




… the corresponding io file stats:


SELECT AS database_name, AS [file_name],        f.physical_name,        f.type_desc,        vf.num_of_reads,        vf.num_of_writes FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vf INNER JOIN sys.databases AS d ON d.database_id = vf.database_id INNER JOIN sys.master_files AS f ON f.file_id = vf.file_id    AND f.database_id = vf.database_id where f.database_id = db_id('tempdb')




… and the corresponding wait types:


Wait type

Total wait


Total wait count

Avg wait time



















SQL Server 2014










Wait type

Total wait


Total wait count

 style="margin-bottom: 0.0001pt; line-

Oracle 12c extended datatypes better than CLOB?

Fri, 2014-05-23 08:15

12c has introduced character strings that can can go above 4000 bytes. In the previous versions, in PL/SQL only we were allowed to have VARCHAR2 up to 32k. In SQL the VARCHAR2 datatype was limited to 4000 and CHAR was limited to 2000. That became a bit small especially when lot of applications needed to store unicode characters.

From 12c we can have SQL datatypes having up to 32k bytes for VARCHAR2, CHAR and RAW. It's not allowed by default. We need to set max_string_size=extended and recompile views with utl32k. Nice improvement. But is it a good idea to use that new feature when we already have CLOB for large character strings ? The New Features documentation is clear about that: extended datatypes have been introduced to be compatible with other databases - not to replace existing features.

I will not go into the details how they are stored. Information about that is available elsewhere. See for example @ludodba recent blog post Where are Extended Data Types stored?. Extended datatypes are stored as chained rows if you just extend an existing table, or as a LOB if you defined them on a new table. Chained rows is clearly not a good option, so, given that you (re)create the tables, their storage is similar to CLOB.

But there is something that I don't like with LOBS: they are fetched row by row. When you select a row you get only the handle. And you get the CLOB later when you access to it through the handle. Did you ever try to datapump a table with LOBs through network_link? Huge amount of roundtrips and very bad performance. It's one rare case where doing expdp/impdp with a dumpfile is better. For very large objects, you will do several roundtrips anyway, so this is not an issue. But with character strings that are just a few kilobytes having them as LOB introduces an ineffective overhead.

Let's compare the fetch behaviour with those new extended datatypes. For my demo, I'll use a table with a clob column "C" and an extended varchar2 column "E", and insert same data into both columns.


SQL> create table TEST ( C clob , E varchar2(9000) );
Table created.
SQL> insert into TEST select lpad(rownum,9000,'x'),lpad(rownum,9000,'x') from dual connect by level 10 rows created.


Here is the autotrace when reading the CLOB from 10 rows:


SQL> set autotrace trace stat
SQL> select C from TEST;

 10 rows selected.

           2  recursive calls
           0  db block gets
          27  consistent gets
          20  physical reads
           0  redo size
       93936  bytes sent via SQL*Net to client
        2722  bytes received via SQL*Net from client
          22  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
          10  rows processed


For only 10 rows I've made 22 roundtrips. This is the problem with LOBs. Too many roundtrips. Well there is another problem that I'll not show here, which is the fact that you can fetch the lob a long time after, even when the cursor is closed. It does consistent read so you have to set your undo_retention accordingly.

Now here is the same data from the extended varchar2 column:


SQL> select E from TEST;

 10 rows selected.

           1  recursive calls
           0  db block gets
          56  consistent gets
           0  physical reads
           0  redo size
       90501  bytes sent via SQL*Net to client
         492  bytes received via SQL*Net from client
           2  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
          10  rows processed


Here I got the same volume (10 times 9000 characters) but this time I did only 2 roundtrips.

Let's go further and trace with sql_trace. LOB calls are instrumented since 11g so we can see them from the trace file:


PARSING IN CURSOR #139894737850360 len=18 dep=0 uid=103 oct=3 lid=103 tim=8952647276 hv=844696927 ad='77e1a518' sqlid='132sh6wt5k3az'
select C from TEST
PARSE #139894737850360:c=0,e=82,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895260
EXEC #139894737850360:c=0,e=69,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=897416
FETCH #139894737850360:c=0,e=97,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=848200
LOBREAD: c=999,e=10893,p=2,cr=1,cu=0,tim=8952659696
FETCH #139894737850360:c=0,e=43,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=860372
LOBREAD: c=1000,e=1614,p=2,cr=1,cu=0,tim=8952662447
FETCH #139894737850360:c=0,e=47,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=863495
LOBREAD: c=2000,e=657,p=2,cr=1,cu=0,tim=8952664615
FETCH #139894737850360:c=0,e=40,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=865575
LOBREAD: c=0,e=706,p=2,cr=1,cu=0,tim=8952666808
FETCH #139894737850360:c=0,e=37,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=867552
LOBREAD: c=1000,e=949,p=2,cr=1,cu=0,tim=8952669193
FETCH #139894737850360:c=0,e=92,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=869825
LOBREAD: c=0,e=844,p=2,cr=1,cu=0,tim=8952671276
FETCH #139894737850360:c=0,e=68,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=872168
LOBREAD: c=1000,e=756,p=2,cr=1,cu=0,tim=8952673521
FETCH #139894737850360:c=0,e=44,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=874712
LOBREAD: c=999,e=914,p=2,cr=1,cu=0,tim=8952676180
FETCH #139894737850360:c=0,e=64,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=877352
LOBREAD: c=0,e=931,p=2,cr=1,cu=0,tim=8952678875
FETCH #139894737850360:c=0,e=52,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=879774
LOBREAD: c=1000,e=795,p=2,cr=1,cu=0,tim=8952681136
FETCH #139894737850360:c=1000,e=7,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=891850
STAT #139894737850360 id=1 cnt=10 pid=0 pos=1 obj=100085 op='TABLE ACCESS FULL TEST (cr=16 pr=0 pw=0 time=60 us cost=5 size=20980 card=10)'
CLOSE #139894737850360:c=0,e=31,dep=0,type=0,tim=8952684289


And the sql_trace with the same data from the extended datatype.


PARSING IN CURSOR #139895028091224 len=18 dep=0 uid=103 oct=3 lid=103 tim=8954178349 hv=1829009117 ad='7b48ba08' sqlid='4kq232tqh8xqx'
select E from TEST
PARSE #139895028091224:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895345
EXEC #139895028091224:c=0,e=36,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895435
FETCH #139895028091224:c=1000,e=896,p=0,cr=8,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=899458
FETCH #139895028091224:c=4000,e=3660,p=0,cr=48,cu=0,mis=0,r=9,dep=0,og=1,plh=1357081020,tim=89535
STAT #139895028091224 id=1 cnt=10 pid=0 pos=1 obj=100085 op='TABLE ACCESS FULL TEST (cr=56 pr=0 pw=0 time=670 us cost=5 size=20010 card=10)'
CLOSE #139895028091224:c=0,e=13,dep=0,type=0,tim=8954214996


So there is is one big advantage over CLOB: the column values are returned without additional roundtrips.

That would mean that if you have character strings that may be between 4k and 32k then extended datatypes can be a good option. It's a new feature however, and designed for another goal (easy migration from other databases). So it's something to test carefully and the tests must integrate all you infrastructure components (backups, exports, replication, etc).

Check SQL Server Error Log size with a policy

Fri, 2014-05-23 00:52

In many blogs or articles, we can read that it is recommended to recycle the error log to be sure that the size is not too big

In the Best Practice Analyzer (BPA) for SQL Server, this is a rule.

But what is a good size for the error log file?

I have often heard “it depends”, but in my opinion, this is not a good answer

SQL Server 2014 : New incremental statistics

Tue, 2014-05-20 20:27

Do you use partitioning with SQL Server? If yes this post is for you because there are good news about partitioning with SQL Server 2014. I remember an old job with a healthcare customer who we decided to implement a partitioning strategy for a big transaction archive table for medical records. We used SQL Server 2005 and we already faced to some challenges like statistics updates. Database administrators who use partitioning with version older than SQL Server 2014 know what I mean.

The first news that concern this blog post is the new incremental statistic strategy provided with SQL Server 2014. This feature allows to update statistics without reading the entire table that can be a problem when your table becomes very big! Imagine for instance you have only one small active partition with recent records and a lot of changes (update, insert or delete) against 30 read only big partitions with billions of records. With SQL Server you have different strategies to update statistics by using either the sample method or the full scan method. The latter is more reliable with partitioned tables because in this case we have to deal with skewed data but requires more time (and more resources) because SQL Server have to read all the table records. With big partitioned tables we talk about potentially many hours. Adding a new non empty partition could be also problematic because we have to deal with same issue. New data are not represented into the concerned statistic.

Using incremental statistics it is possible to update only one or several partitions as necessary. The information is then gathered or merged with existing information to create the final statistic. We will see later how works this new mechanism.

During my tests I will use a modified transaction history table in the AdventureWorks2012. The table records are generated from the Adam Machanic’s T-SQL script. This table will contain 49270382 records for a total size of 2 GB. We will partition the bigTransactionHistory table as following:

USE AdventureWorks2012; GO   -- Partition function with a partitionning strategy by year. -- The first day of the year will be used as boundary of the next partition CREATE PARTITION FUNCTION BigTransactionRange(DATETIME) AS RANGE RIGHT FOR VALUES (    '20050101', '20060101', '20070101', '20080101',    '20090101', '20100101' ); GO   -- Partition scheme -- All partitions will be in the primary filegroup CREATE PARTITION SCHEME TransactionsPS1 AS PARTITION BigTransactionRange ALL TO ( [PRIMARY] ); GO


Now we can create a clustered index by using the partition scheme TransactionPS1 and the new instruction STATISTICS_INCREMENTAL.

CREATE CLUSTERED INDEX idx_bigtransactionhistory_transactiondate ON [dbo].[bigTransactionHistory]([TransactionDate]) WITH (SORT_IN_TEMPDB = ON, STATISTICS_INCREMENTAL = ON); GO


The sys.stats system view has a new is_incremental column which indicates if a statistic is incremental or not.

SELECT        stats_id,        name AS stat_name,        is_incremental FROM sys.stats WHERE object_id = object_id('bigtransactionhistory')




The new partition configuration of the bigTransactionHistory table is the following:

SELECT        p.object_id, AS table_name,        p.partition_number,        p.rows,        au.total_pages,        au.total_pages / 128 AS total_size_mb,        au.type_desc, AS [filegroup_name],        RVL.value AS left_range_boundary,        RVR.value AS right_range_boundary        --PF.[name], RV.boundary_id, RV.[value] FROM sys.partitions AS p        JOIN sys.objects AS o              ON o.object_id = p.object_id        JOIN sys.indexes i        ON p.object_id = i.object_id                     AND p.index_id = i.index_id        JOIN sys.allocation_units AS au              ON p.hobt_id = au.container_id        JOIN sys.filegroups AS g              ON g.data_space_id = au.data_space_id        LEFT JOIN sys.partition_schemes AS PS              ON ps.data_space_id = i.data_space_id        LEFT JOIN sys.partition_functions AS PF              ON PF.function_id = ps.function_id                    LEFT JOIN sys.partition_range_values AS RVL              ON RVL.function_id = PF.function_id                     AND RVL.boundary_id + 1 = p.partition_number        LEFT JOIN sys.partition_range_values AS RVR              ON RVL.function_id = PF.function_id                     AND RVR.boundary_id = p.partition_number WHERE p.object_id = object_id('bigtransactionhistory')        AND p.index_id = 1;




Now let’s take a look at the statistics by using the DBCC SHOW_STATS command:

DBCC SHOW_STATISTICS('bigtransactionhistory','idx_bigtransactionhistory_transactiondate');












We retrieve the same information as the older versions. In fact we’re trying to read statistic data from the final statistic. You will understand why later in the blog post. Note here that the final statistic can have a maximum of 200 steps. In others words currently new incremental statistics does not change the story: we already have a maximum of 200 steps for the entire table. I hope we will see some improvements maybe in the next version.

Now let’s fill up the table with some records which will have a transaction date greater than 2011-01-01 in order to move them to the last partition.

INSERT dbo.bigTransactionHistory SELECT TOP 1        TransactionID + 1, ProductID, dateadd(year, 1, TransactionDate), Quantity + Quantity, ActualCost * 1.2 FROM dbo.bigTransactionHistory ORDER BY TransactionID DESC


The new row has been moved to the correct partition number 8:




If we take a look at the statistic histogram of the index idx_bigtransactionhistory_transactiondate we can notice that the new record is not propagated to the existing histogram.




Of course, an update statistics is mandatory here. Before SQL Server 2014 we had to update statistics from the entire table but as said earlier we can now use a new option: WITH RESAMPLE ON PARTITIONS (). The concerned partition is the number 8 in my case. The word RESAMPLE is important here because all pages in the new statistic tree structure must be aligned with the same sample.

UPDATE STATISTICS dbo.BigTransactionHistory(idx_bigtransactionhistory_transactiondate) WITH RESAMPLE ON PARTITIONS(8);




Good news: the incremental statistics works ! In addition the following table gives us an idea of the performance we could obtain by using the incremental statistic (I used and compared the both methods with full scan and with resample on partitions):


Update statistics option

Elapsed Time (ms)

CPU Time (ms)








I guess you can easily imagine the result with this new method on a real production environment …

What about auto update statistic with incremental statistics? As you certainly know SQL Server uses a specific algorithm to update automatically statistics when the table has more than 500 rows. The update operation is triggered when the number of rows reaches 20% of the total existing rows + 500 rows. With tables that have a billions of rows, we can spend much time without any automatic update statistics operation. Besides, adding a new partition that does not modify more than 20% of the total rows will not issue an automatic update statistic operation and no information about it will not be available. Fortunately incremental statistics changes the story here. The update statistic can be triggered per partition when the number of modification reaches the threshold value = (total rows / number of partitions) * 20%.

Let’s try with the following test. In my case the bigTransactionhistory table contains 46895292 rows. According to the above formula the theoretical threshold should be: 46895292 / 8 * 0.2 = 1172383. Thus, I will update 1172892 rows in the partition number 7 to be sure to issue an automatic update statistic operation by using a SELECT statement with a predicate on the TransactionDate column.

UPDATE TOP (1172892) bigTransactionHistory SET TransactionDate = DATEADD(dd, 1, TransactionDate) WHERE $PARTITION.BigTransactionRange(TransactionDate) = 7;


I can confirm the number of changes is over the threshold by viewing the rowmodctr column from the legacy system view sysindexes:

SELECT        indid, AS table_name, AS index_name,    STATS_DATE(,i.indid) AS last_date_updated,        rowmodctr AS number_of_changes,        st.is_incremental FROM sysobjects o        JOIN sysindexes i              ON =        JOIN sys.stats st              ON st.object_id =                     AND st.stats_id = i.indid WHERE xtype = 'U' AND IS NOT NULL        AND = object_id('bigtransactionhistory') ORDER BY last_date_updated GO




Then, I perform a SELECT statement with a predicate on the TransactionDate column …

SELECT TOP 10 * FROM dbo.bigTransactionHistory WHERE TransactionDate > '20110101';


… and I can conclude the SELECT statement has issued an automatic update statistic by viewing the rowmodctr column value equal to 0 for the idx_bigtransactionhistory_transactiondate column.




The automatic update statistic operation has been issued for only 2.5% of total rows change in the entire table in my case.

Now as promise I will give you more information about statistic pages with incremental statistics. First, I would like to thank you Frédéric Pichaud, Senior Escalor Engineer at Microsoft in France, for giving us the following information. We can use the new internal dynamic management function to figure out how statistic pages are organized with incremental statistics.

SELECT,        s.is_incremental,        sp.node_id,        sp.first_child,        sp.next_sibling,        sp.last_updated,        sp.modification_counter,        sp.rows,        sp.rows_sampled,        sp.steps,        sp.left_boundary,        sp.right_boundary FROM sys.stats as s CROSS APPLY sys.dm_db_stats_properties_internal(s.object_id, s.stats_id) AS sp WHERE s.object_id = object_id('bigtransactionhistory')        AND = 'idx_bigtransactionhistory_transactiondate' ORDER BY, node_id;




We have to focus on 3 columns here: node_id, first_child and next_cibling columns. node_id is an identifier for a page statistic. The page with the node_id equal to 1 is the final statistic page we’ve seen earlier by using DBCC SHOW_STATISTICS. This page contains the total number of rows in the bigTransactionHistory table (column rows = 46895292). We can notice that each partition, easily identifiable by the left and right boundaries, has its own page statistic (node_id 3 to 9). The first_child and the next sibling columns help us to rebuild the complete tree of statistics objects as shown below:




From my understanding, the green squares are the statistic page at the leaf level linked to a partition and red squares are the binary merge pages. For instance pages with node_id = 8 (4790286 rows) and 9 (1 row) are merged to the page with node_id = 10 (4790286 + 1 = 4790287). In turn, the page with node_id = 7 (7672943 rows) and the page with node_id = 10 are merged into the page with node_id = 12 (7672943 + 4790287 = 12463230) and so on… The page with node_id = 1 is the final page statistic that merges all the others pages. The page with node_id = 2 is a reserved page for future needs.

Earlier in the blog post we issued an automatic update statistic and the partition with the number 10 was concerned. The page with node_id = 8 is linked to this partition and we can notice the update statistic propagation throughout the statistic objects tree. Only the page statistic of this partition is updated with all the merge pages.




Finally we can use the traceflag 2309 to see the detail of a page in the statistic tree. After enabling this traceflag we can use the DBCC SHOW_STATISTICS command with an additional third parameter node_id

DBCC TRACEON(2309); GO   DBCC SHOW_STATISTICS('bigtransactionhistory','idx_bigtransactionhistory_transactiondate', 8);




The results is the same as a classic DBCC SHOW_STATISTICS command but the detail is only for the concerned page in the tree. Enabling this traceflag is only for debugging purpose. For instance if we suspect a general update statistic issue we can try to identify which pages are relevant.

Database administrators who faced to statistics problems with partitioned tables will probably have fun to try this new feature. See you soon for the next good news about partitioning ;-)

Oracle policy managed databases: Policies and policy sets

Tue, 2014-05-20 15:32

In this post, I will continue my experiments with Oracle policy managed databases with Grid Infrastructure, and present one of the new features introduced with Grid Infrastructure 12c last June: the cluster configuration policies and policy sets. It allows the administrator to dynamically reconfigure all server pools at a time, with a single command, according to the business workload.


To take a simple example, imagine you have two servers allocated for your production instances, and two servers allocated for you development instances. This repartition is conform to your daily workload. But you have very intensive batches running every night on your production databases. The production servers are strongly used, and you have two development servers, up and running, but not used at all. What a great resources wasting!


In this case you have several solutions:


1) Shut down development servers during the night because they are not used. It will at least allow energy saving :-)

2) Review your needs and allocate more server(s) to the production system

3) Create a crazy script to reconfigure each server pool, one by one, to accomodate the configuration to the real needs, with srvctl commands run in serial.

4) Use policies and policy sets


As you may expect, I will develop the last solution!


A policy set is a "document" (Oracle definition) which contains one or more policies. Each policy contains the configuration of all server-pools in the cluster: MIN_SIZE and MAX_SIZE, IMPORTANCE, server list, etc.

It will make the administrator able to activate a policy or another, according to the workload or business needs, to dynamically and immediately change all the configuration of the cluster server-pools. For instance: use two servers for production and two servers for development during working days, and use three servers for production and one server for development during nights and week ends.


First of all, there is my current configuration:


[oracle@node01 ~]$ srvctl config srvpool
Server pool name: Free
Importance: 0, Min: 0, Max: -1
Candidate server names:
Server pool name: Generic
Importance: 0, Min: 0, Max: -1
Candidate server names:
Server pool name: srvprod
Importance: 0, Min: 2, Max: 2
Category: hub
Candidate server names:
Server pool name: srvtest
Importance: 0, Min: 2, Max: 2
Category: hub
Candidate server names:


I have two nodes running in my srvprod pool and two nodes running in the srvtest pool.

An instance DB1 (prod) is running on nodes 1 and 4:

[oracle@node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01
Instance DB1_2 is running on node node04


And an instance DB2 (dev) is running on nodes 2 and 3:

[oracle@node01 ~]$ srvctl status database -d DB2
Instance DB2_1 is running on node node02
Instance DB2_2 is running on node node03


The simplest way to benefit of policies and policy sets is to create a text file with the wanted configuration. Following steps explain how to proceed.


Step 1: Create a new policy set

The crsctl tool offers an option to generate a new policy set file, based on the current active configuration. We must connect as grid:


[grid@node01 ~]$ mkdir -p /u00/app/oracle/admin/policies
[grid@node01 ~]$ crsctl create policyset -file /u00/app/oracle/admin/policies/new_policy


The file has been generated:

[grid@node01 ~]$ ll /u00/app/oracle/admin/policies
total 4
-rw-r--r-- 1 grid oinstall 305 Apr 29 16:21 new_policy


There is the file content:

[grid@node01 ~]$ cat /u00/app/oracle/admin/policies/new_policy
SERVER_POOL_NAMES=Free ora.srvprod ora.srvtest


The line "NAME=Default" corresponds to the name of the policy in the policy set. The policy set itself is represented by the file "new_policy" created above.


Step 2: Update the current policy set

To update the current policy set, we just have to edit our policy set file "new_policy" and to add new policies above the Default one.

As an example, imagine we want to allocate one server more for production during the nights, and to allocate all servers to the production during week ends, to disable the development instances. We append the following policies:




Once the file is edited, it must be loaded into the policy set configuration using crsctl:

[grid@node01 ~]$ crsctl modify policyset -file /u00/app/oracle/admin/policies/new_policy


Step 3: Enable a new policy

Our policy set now contains three policies (Default, Nights and WeekEnds). The currently enabled policy is "Default":

[grid@node01 ~]$ crsctl status policyset | grep -i last


We are going to enable the "Nights" policy and wee what happens. To enable a policy, use the following crsctl command as grid:


[grid@node01 ~]$ crsctl modify policyset -attr "LAST_ACTIVATED_POLICY=Nights"
CRS-2673: Attempting to stop 'ora.db2.db' on 'node02'
CRS-2677: Stop of 'ora.db2.db' on 'node02' succeeded
CRS-2672: Attempting to start 'ora.db1.db' on 'node02'
CRS-2676: Start of 'ora.db1.db' on 'node02' succeeded


In the previous output, we can see on node02 that Oracle has stopped DB2 and started DB1 instead. We can confirm with srvctl:


[grid@node01 ~]$ srvctl status srvpool
Server pool name: Free
Active servers count: 0
Server pool name: Generic
Active servers count: 0
Server pool name: srvprod
Active servers count: 3
Server pool name: srvtest
Active servers count: 1


[grid@node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01
Instance DB1_2 is running on node node04
Instance DB1_3 is running on node node02


[grid@node01 ~]$ srvctl status database -d DB2
Instance DB2_2 is running on node node03


Now I enable the "WeekEnds" policy. The remaing development server will be allocated to the production server pool, and 4 production instances will be running:


[grid@node01 ~]$ crsctl modify policyset -attr "LAST_ACTIVATED_POLICY=WeekEnds"
CRS-2673: Attempting to stop 'ora.db2.db' on 'node03'
CRS-2677: Stop of 'ora.db2.db' on 'node03' succeeded
CRS-2672: Attempting to start 'ora.DB1.db' on 'node03'
CRS-2676: Start of 'ora.DB1.db' on 'node03' succeeded


All nodes are now part of the production pool:

[grid@node01 ~]$ srvctl status srvpool
Server pool name: Free
Active servers count: 0
Server pool name: Generic
Active servers count: 0
Server pool name: srvprod
Active servers count: 4
Server pool name: srvtest
Active servers count: 0


Four production instances are opened:

[grid@node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01
Instance DB1_2 is running on node node04
Instance DB1_3 is running on node node02
Instance DB1_4 is running on node node03


And the service for development is completely down:

[grid@node01 ~]$ srvctl status database -d DB2
Database is not running.


To reset to the initial policy, just use the same command with the "Default" policy attribute:

[grid@node01 ~]$ crsctl modify policyset -attr "LAST_ACTIVATED_POLICY=Default"



This feature represents an easy way to configure a whole policy managed cluster at will. I have demonstrated how to edit a policy set using a text file generated using crsctl, but Oracle also documented how to manually definea policy set with command line. In my opinion, this is most fastidious and the text file is a better choice.

I only regret that Oracle has not created a way to enable policies according to the database workload, or to schedule policies over the time. Here, we must manually enable the policy or use the crontab...

Oracle SQL Monitoring reports in flash, html, text

Thu, 2014-05-15 13:28

I have recently posted on the way I like to extract execution plans. When we have Tuning pack, I suggest to get them with SQL Real-Time Monitoring as an active report. However, Martin Preiss said in a comment that he prefers the text format - easier to search, copy, and paste. And that's a very good remark. We still need plain text.

My point is that if you send me the active report, then I can extract the text version from it. The opposite is not possible. It's a bit tricky, I'll show how to do.

The active report is very nice: it is an html file with an embedded xml data in it. When you open it, it loads a flash application from the oracle site and displays that nice report:




Its easy to navigate, drill dow, etc. I often get to the html source in order to see some information that is hidden in the xml but which is not displayed, such as the optimizer parameters used by the query. Here is an extract from the html source:




Yes. I used to do that in 11g. And once, I asked someone to send me his 12c report and I had a bad surprise. The xml is compressed and I can't see anything:




Oracle 12c introduced that nice optimization in order to have smaller files, but I prefered the previous one. But look at the attributes: it's base 64 encoded and compressed with zlib.

So don't worry, you have linux utilities to uncompress that:




From the sqlmon12c.html, I have extracted to sqlmon12c_decoded.html the report with xml data as I've shown above. Good. I got my xml back :)

Then back to Martin's comment. The active report is good to navigate on it. The xml is good to find some hidden information. But If I want to paste a plan to the report I deliver to my customer, I prefer text. A table in text is ok for a Word document.

First I extract only the xml and include it within 'report' tags:



[sorry for the code as screenshot, but the html tags in the awk script are hard to display here] 


From that xml, I apply the sqlmonitorText.xsl XSLT template which is found in ORACLE_HOME:


xsltproc $ORACLE_HOME/rdbms/xml/orarep/sqlmonitor/sqlmonitorText.xsl sqlmon12c_extracted.xml > sqlmon12c_text.html


And here is the result, a nice HTML table that perfectly fits in my .doc document:




But there is something else that we cannot do with that. On large execution plans, I often have to bring the cursor up and down in order to go from a child operation to its parent (example: from a high 'execs' number within a nested loop I go to the rowsource that provided that number of rows). So I need text. Plain text.

No need to call the DBMS_REPORT functions for that. Linux has text web browsers. For example:


elinks -dump -dump-width 150 sqlmon12c_text.html | grep -v "^ |--" > sqlmon12c_text.txt


...which gives me something useful for notepad.exe or vim:




So this is why I prefer to have the SQL Monitoring report in the active html format: I can read it immediately, and convert it to what I want later. I have a script for that. A very dirty script. I'll share it anyway. Just call it with the name of the active html report file. here it is:

But if someone wants to make a clean script for that, don't hesitate to share :)

ESPC 2014: Upgrading To SharePoint 2013

Wed, 2014-05-14 23:19

I have decided to make a resume of the presentation "Upgrading SharePoint 2013" by Sam Hassani that I have followed at the European SharePoint Conference 2014. This is meant for the people who did not have the chance to attend to the event taking place from 5 to 8 May 2014 in Barcelona!


For more details on the Conference presentations, have a look at the agenda.


History of upgrades in SharePoint

Historically, there are three upgrade methods:

  • In-place upgrade: SharePoint Services are overwritten by the newest version, and the content databases are changed.
  • Gradual upgrade: Each group of site collections is upgraded. The data in the groups is copied from the original database to a new database before the data is upgraded to the new SharePoint Services. The original SharePoint Services data is maintained in the original database until it is deleted.
  • Database-attach: In this upgrade, you need to create and configure the new SharePoint farm first. Then, you need to copy the content and service application databases from the old SharePoint farm as well as attach and upgrade the databases: the data is then upgraded to the new version. in a second step, site owners can upgrade the individual site collections.

The available upgrade methods depend on the SharePoint version:

SharePoint 2003 to 2007

  • Gradual upgrade
  • In-place upgrade
  • Database attach

SharePoint 2007 to 2010

  • In-place upgrade
  • Database attach

SharePoint 2010 to 2013

  • Database attach

As you can see, for SharePoint 2013, upgrade methods are no longer available.


Supported databases

These are the supported databases:

  • Content database
  • Search admin database
  • Profile database
  • Social database
  • Managed Metadata database
  • Subscription database
  • PerformancePoint database
  • Secure Store database
  • Business Data Connectivity database


Upgrade - general process


Before upgrading a Site Collection

First, run Site Collection Health Checks. It will look for common known issues and inform you on the feasibility of your upgrade.

You can also create an Evaluation Site Collection to preview the site content in the latest version of SharePoint Online.

Then, you must Upgrade Throttling to prevent overload from self-service site collection upgrade. If an upgrade is not possible due to a requested throttling, it is queued. Finally, you should migrate to Claims before upgrading to SharePoint 2013. Convert the web application to use claims, then perform user migration.


Site Collection upgrade

You cannot do “Big Bang” upgrades. First, upgrade SharePoint databases, then upgrade site collections. It is two different steps that you have to follow.

Moreover, it is better to provide a self-service site collection upgrade capability to site collection admins. They can easily do the upgrade, moving gradually over to 2013 experience: this is much more comprehensive for end users.

Furthermore, site collection admins can make the choice to not upgrade a site collection if it is required.


Unsupported with Site Collection in SharePoint 2010 mode

If site collection admins make the choice to not upgrade a site collection, it is impossible to access

  • All new SharePoint 2013 features
  • 2010 Web Analytics
  • 2010 Office Web applicationsService Applications Migration Process


Service Applications Migration Process



If you want to add a solution to the farm, use Add-SPSolution.

If you want to deploy a solution to the farm, use Install-SPSolution.

Globally, you have three typical scenarios:

  • An existing solution works for both 2010 and 2013 mode sites.
  • An existing solution works for 2010 mode sites, but additional functionalities are required for 2013 mode sites.
  • An existing solution works for 2010 mode sites, but a new solution is required for 2013 mode sites.



Always ensure that the environment is fully functioning before you begin your upgrade. Cleanup and preparation are required.

You must treat pre-production testing like a production. Formulate and validate an Upgrade Plan!

You should test carefully, without skipping any errors or warnings: be methodic and procedural.

One day @ Bob Ward's SQLOS Workshop

Wed, 2014-05-14 23:07

We had the opportunity with David Barbarin, to fly to Wroclaw (Poland) to participate in the SQLDay 2014 (annual conference of the Polish SQL Server User Group) for the first time. We don't speak Polish but we met Bob Ward who held a very interesting level 500 workshop titled "Inside SQLOS and Memory". This pre-conference seminar session was for us the opportunity to benefit from Bob Ward's Expertise.


This event was held at the football stadium in Wroclaw where the European football 2012 cup took place


I will not explain or expose the entire content of the workshop but I will just give you the basics of the workshop program.


Inside SQLOS

After a good review of fundamentals like preemptive and non-preemptive scheduling, we started with an introduction of SQLOS.

As a reminder, SQLOS is not just an executable or a Windows Service, but an abstraction layer of SQL Server and an API with two main libraries:

  • SQLDK.dll (SQL Server Development Kit)
  • SQLOS.dll (main dll)

Of course, we have other libraries like sqllang.dll that contains the T-SQL language and the Query Processor and sqlmin.dll that contains well-known components of SQL Server such as LazyWriter, the checkpoint process, the lock monitor etc.

After a basic presentation of the SQLOS, Bob Ward showed us how to perform some debug stuff using WindDbg.

His debugger comes from the Debugging Tools for Windows here.

After the installation, you find the executable windbg.exe in C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x64.

In addition to this debugger, you must use "symbols". Symbols are used to decode memory addresses to their corresponding function names then we can easily understand the memory. The command line to use is:


windbg -y srv*g:\symbols* -pn sqlservr.exe


I will not explore this point further.

Then, after seeing the context switching (SQLOS and Windows) and how SQLOS executes threads, a good point is to know how many worker threads are available for SQL Server processes. Based on the number of logical CPUs and platforms, the instance setting is 'Max worker threads'.

By default, this value is 0 and let SQL Server to SQL Server to automatically choose the maximum number of worker threads.

  • For 64 bits and less than 64 CPUs, SQL Server calculates 512 + (16* (number of CPU-4))
  • For 64 bits and more than 64 CPUs, SQL Server calculates 512 + (32* (number of CPU-4))
  • To optimize max worker threads, you may want to read this msdn article.

During the break, there was a quick exchange between David and Bob, with the computer sticking out his tongue … it is of course a pure coincidence Tongue out


  Inside Memory

The next part of the workshop concerned the memory management of SQLOS. As you maybe know, there are some fundamental changes with SQL Server 2012. First, Bob Ward explained that the single and multi-pages allocators concepts do not exist anymore.

Now, we have virtual allocators, workspaces, fragment manager, block allocators and buffer pool (this last on does not act as an allocator by itself). These fundamental changes provide a more consistent model for all memory management. Max server memory now really means max server memory and does not concern only the buffer pool!

Then, Bob Ward gave us a good overview of the memory nodes with NUMA architecture. There are some improvements with SQL Server 2012 (affinity, mapping of SQL Server memory numa nodes directly to the hardware NUMA nodes, etc.)

Unfortunately, after this chapter, it was the time to go back home. We did not have the opportunity to see the last part of the workshop but we left with a lot a good things in mind!


Finally, I would like to thank our Polish colleagues for this event and Bob for this wonderful workshop day.

I hope that in the future, we can organize the same event in Switzerland or in France with the participation of Bob Ward.Cool

Exploring Oracle SE & EE performance statistics with Orachrome Lighty

Mon, 2014-05-12 04:58

At dbi services, we really like Lighty for Oracle, a tool which helps exploring performance statistics whether they originate from AWR (available only in Enterprise Edition with Diagnostic Pack option) or from Statspack (available via Standard Edition). As a matter of fact, we like it so much that we have became Orachrome Partner.

If you want to try it out for yourselves, there is a free trial here. If you have Statspack installed (see spdoc.txt in ORACLE_HOME/rdbms/admin to install it), you can immediately use Lighty.

Let me show you how to install it. First, you have to define a connection (shortcut: Ctrl-A):




It is a jdbc connection without the need to install anything else. You just need the listener host, port, and service:




As you are on Oracel Standard Edition, you don't have Diagnostic Pack nor tuning Pack.

But you have installed Statspack, and you will install L-ASH very soon:




Now, you can install L-ASH as documented here. But the goal of this post is to show what you can get it immediately. Go to the menu: Reports > DB Statistics > Past Activity (Statspack)

You can select a day and see the snapshots. The color bar immediately shows when the activity was high:




Then you can choose the graphical equivalent for my preferred Statspack sections:

Top Events:




Top Segments (change the Statspack level to 7 for that):




And of course Top SQL Statements:




For each statement (click on its sql_id) you will have everything: SQL text, execution plan, statistics.

If you like the tool, it's time to go further and install L-ASH so that you can mine from the Activity Viewer tab, which is the most valuable feature of Lighty for Oracle.