Skip navigation.

Yann Neuhaus

Syndicate content
All blog entries from http://www.dbi-services.com/
Updated: 17 hours 49 min ago

Oracle TNS_ADMIN issues due to bad environment settings

Mon, 2014-04-14 18:11

Recently, I faced a TNS resolution problem at a customer. The reason was a bad environment setting: The customer called the service desk because of a DBLINK pointing to a bad database.

The users were supposed to be redirected to a development database, and the DBLINK was redirecting to a validation database instead. The particularity of the environment is that development and validation databases are running on the same server, but on different Oracle homes, each home having its own tnsnames.ora. Both tnsnames.ora contain common alias names, but pointing on different databases. Not exactly best practice, but this is not the topic here.

The problem started with some issues to reproduce the case. Our service desk was not able to reproduce the situation without understanding that the customer was trying to access the database remotely via a development tool (through the listener), while we were connected locally on the server.

Let me present the case with my environment.

First, this is the database link concerned by the issue:

 

SQL> select * from dba_db_links;
OWNER      DB_LINK              USERNAME                       HOST       CREATED
---------- -------------------- ------------------------------ ---------- ---------
PUBLIC     DBLINK               DBLINK                         MYDB       21-MAR-14

 

And this is the output when we try to display the instance name through the DBLINK, when connected locally:

 

SQL> select instance_name from v$instance@DBLINK;
INSTANCE_NAME
----------------
DB2

 

The user is redirected on the remote database, as expected. Now, let's see what happens when connected using the SQL*Net layer:

 

[oracle@srvora01 ~]$ sqlplus system@DB1
SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 24 10:07:45 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
 
Enter password:
 
Connected to:
 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
 
SQL> select instance_name from v$instance@DBLINK;
INSTANCE_NAME
----------------
DB1

 

Here we can see that the user is not redirected to the same database (here, for demonstration puproses, on the database itself).

The first thing to check is the TNS_ADMIN variable, if it exists:

 

[oracle@srvora01 ~]$ echo $TNS_ADMIN
/u00/app/oracle/product/11.2.0/db_3_0/network/admin

 

There is the content of the tnsnames.ora file on that location:

 

[oracle@srvora01 ~]$ cat /u00/app/oracle/product/11.2.0/db_3_0/network/admin/tnsnames.ora
DB1 =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = tcp)(HOST = srvora01)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = DB1)
    )
  )
MYDB =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = tcp)(HOST = srvora01)(PORT = 1521))
    (CONNECT_DATA =
      (SERVICE_NAME = DB2)
    )
  )

 

Clearly, we have a problem with the TNS resolution. The local connection resolves the MYDB alias correctly, while the remote connection resolves a different database with the alias. In this case, we have two solutions:

  • The tnsnames.ora is not well configured: this is not the case, as you can see above
  • Another tnsnames.ora file exists somewhere on the server and is used by remote connections

 To confirm that the second hypothesis is the good one, we can use the strace tool:

 

SQL> set define #
SQL> select spid from v$process p join v$session s on p.addr=s.paddr and s.sid=sys_context('userenv','sid');
SPID
------------------------
5578

 

SQL>  host strace -e trace=open -p #unix_pid & echo $! > .tmp.pid
Enter value for unix_pid: 5578
SQL> Process 5578 attached - interrupt to quit

 

SQL> select instance_name from v$instance @ DBLINK;
open("/u00/app/oracle/product/11.2.0/db_3_0/network/admin/tnsnames.ora", O_RDONLY) = 8
open("/etc/host.conf", O_RDONLY)        = 8
open("/etc/resolv.conf", O_RDONLY)      = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 10
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 10
open("/etc/hostid", O_RDONLY)           = -1 ENOENT (No such file or directory)
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 10INSTANCE_NAME
----------------
DB2

 

The DBLINK is resolved using the file /u00/app/oracle/product/11.2.0/db_3_0/network/admin/tnsnames.ora.

Now, when connected remotely:

 

SQL> set define #
SQL> select spid from v$process p join v$session s on p.addr=s.paddr and s.sid=sys_context('userenv','sid');
SPID
------------------------
6838

 

SQL> host strace -e trace=open -p #unix_pid & echo $! > .tmp.pid
Enter value for unix_pid: 6838
SQL> Process 6838 attached - interrupt to quit

 

SQL> select instance_name from v$instance@DBLINK;
open("/u00/app/oracle/network/admin/tnsnames.ora", O_RDONLY) = 8
open("/etc/host.conf", O_RDONLY)        = 8
open("/etc/resolv.conf", O_RDONLY)      = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 9
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9
open("/etc/hostid", O_RDONLY)           = -1 ENOENT (No such file or directory)
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9INSTANCE_NAME
----------------
DB1

 

Here the DBLINK is resolved with the file /u00/app/oracle/network/admin/tnsnames.ora.

 

Two different tnsnames.ora files are used according to the connection method! If we query the content of the second tnsnames.ora, we have an explanation for our problem:

 

[oracle@srvora01 ~]$ cat /u00/app/oracle/network/admin/tnsnames.ora
MYDB =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = tcp)(HOST = srvora01)(PORT = 1521))
    (CONNECT_DATA =
      (SERVICE_NAME = DB1)
    )
  )

 

It is not clearly documented by Oracle, but the database session can inherit the environment variables in three different ways:

  • When you connect locally to the server (no SQL*Net, no listener), the Oracle session inherits the client environment
  • When you connect remotely to a service statically registered on the listener, the Oracle session inherits the environment which started the listener
  • When you connect remotely to a service dynamically registered on the listener, the Oracle session inherits the environment which started the database

In our case, the database was restarted with the wrong TNS_NAMES value set. Then, the database registered this value for remote connections. We can check this with the following method:

 

[oracle @ srvora01 ~]$ ps -ef | grep pmon
oracle    3660     1  0 09:02 ?        00:00:00 ora_pmon_DB1
oracle    4006     1  0 09:05 ?        00:00:00 ora_pmon_DB2
oracle    6965  3431  0 10:44 pts/1    00:00:00 grep pmon

 

[oracle @ srvora01 ~]$ strings /proc/3660/environ | grep TNS_ADMIN
TNS_ADMIN=/u00/app/oracle/network/admin

 

Note that we can get the value for TNS_ADMIN using the dbms_system.get_env.

The solution was to restart the database with the correct TNS_ADMIN value:

 

[oracle @ srvora01 ~]$ echo $TNS_ADMIN
/u00/app/oracle/product/11.2.0/db_3_0/network/admin

 

[oracle@srvora01 ~]$ sqlplus / as sysdba
 
SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 24 10:46:03 2014
 
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
 
 
Connected to:
 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

 

SQL> startup
ORACLE instance started.Total System Global Area 1570009088 bytes
Fixed Size                  2228704 bytes
Variable Size            1023413792 bytes
Database Buffers          536870912 bytes
Redo Buffers                7495680 bytes
Database mounted.
Database opened.

 

[oracle@srvora01 ~]$ ps -ef | grep pmon
oracle    4006     1  0 09:05 ?        00:00:00 ora_pmon_DB2
oracle    7036     1  0 10:46 ?        00:00:00 ora_pmon_DB1
oracle    7116  3431  0 10:46 pts/1    00:00:00 grep pmon

 

[oracle@srvora01 ~]$ strings /proc/7036/environ | grep TNS_ADMIN
TNS_ADMIN=/u00/app/oracle/product/11.2.0/db_3_0/network/admin

 

The value for TNS_ADMIN is now correct.

 

[oracle@srvora01 ~]$ sqlplus system @ DB1
 
SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 24 10:47:21 2014
 
Copyright (c) 1982, 2011, Oracle.  All rights reserved.Enter password:
 
Enter password:
 
Connected to:
 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

SQL> select instance_name from v$instance @ DBLINK;
INSTANCE_NAME
----------------
DB2

 

Remote connections are now using the right tnsnames.ora.

I hope this will help you with your TNS resolution problems.

Oracle 12c Adaptive Plan & inflection point

Fri, 2014-04-11 08:25

The Oracle 12c Adaptive Plan feature was already presented by Nicolas Jardot in OOW 2013: Solving customer issues with the 12c Optimizer.

I recently had to answer several questions about its behavior at execution time. Maybe the term 'adaptive' is misleading. It's not that a join will stop and restart to another join method. Even with adaptive plan there will only be one join method to be applied. The feature only defers a decision that was made at parse time in previous versions and that will now be made at execution time - after reading a few rows.

In order to show what happens exactly at execution time, I will reproduce the kind of exercise that we do in our training session Oracle 12c New Features workshop in this posting.

How to install and manage a Kerberos Server

Thu, 2014-04-10 19:04


For some time now, I have been working on how to set up a Single Sign-On (SSO) solution in my company. As a big fan of Open Source solutions, I have obviously proposed the implementation of a solution based on Kerberos. What I mean by that is a solution based on the true Kerberos, i. e. MIT Kerberos. Indeed, Kerberos was originally a research project at the Massachusetts Institute for Technology (MIT) in the early 1980s.

Before starting this kind of project, it's important to clearly define and have in mind the following points:

  • Architecture of the enterprise
  • Operating systems used by end users
  • Operating systems used by applications which must be kerberized
  • Is it difficult to kerberized these applications?

The answers to these elements provide insight on which types of solutions are possible. For example if there is no restrictions on which operating system an end user can use (Windows or different Linux distribution or ...) then the introduction of a solution based on a Linux Kerberos could be a good idea. That's why in this blog, I will first explain how to install a MIT Kerberos Server. In the second part, I will focus on how to manage this Server.

 

1. Install MIT Kerberos

It's important to note that the server and the client share the same release and that the MIT Keberos server could only be installed on a Unix-like. The Mac release is available as part of the Mac OS X since version 10.3 (the current release is Mavericks: 10.9). The Key Distribution Center (KDC) is the Kerberos Server where all identities (users, computers and kerberized applications) will be stored.

For this installation, let's define the followings properties/variable:

  • example.com = the DNS Domain
  • EXAMPLE.COM = the KDC REALM which should be the DNS Domain in UPPERCASE. In case where there should be more than one KDC, all names must be unique and self descriptive
  • kdc01oel.example.com = the FQDN of the KDC
  • 192.168.1.2 = the IP of kdc01oel.example.com

So let's begin the installation. Obviously, the first thing to do is to download the current release of the MIT Kerberos distribution for the target operating system. This could be done at the following URL: http://web.mit.edu/kerberos/dist/index.html. The current Linux release is krb5-1.12.1-signed.tar:

[root@oel opt]# wget http://web.mit.edu/kerberos/dist/krb5/1.12/krb5-1.12.1-signed.tar
--2014-04-01 14:00:28--  http://web.mit.edu/kerberos/dist/krb5/1.12/krb5-1.12.1-signed.tar
Resolving web.mit.edu... 23.58.214.151
Connecting to web.mit.edu|23.58.214.151|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11950080 (11M) [application/x-tar]
Saving to: “krb5-1.12.1-signed.tar”

100%[===============================================>] 11,950,080  1.52M/s   in 7.3s
2014-04-01 14:00:38 (1.56 MB/s) - “krb5-1.12.1-signed.tar” saved [11950080/11950080]
[root@oel opt]# tar -xvf krb5-1.12.1-signed.tar
krb5-1.12.1.tar.gz
krb5-1.12.1.tar.gz.asc

As you could see, this file is signed and you could (should) verify the integrity and identity of the software. This can be done, for example, using GNU Privacy Guard (need another file that can be found on the MIT Kerberos download page):

[root@oel opt]# gpg --verify krb5-1.12.1.tar.gz.asc

After that, just extract the MIT Kerberos source code and build it:

[root@oel opt]# tar -zxf krb5-1.12.1.tar.gz
[root@oel opt]# cd krb5-1.12.1/src/
[root@oel src]# ./configure
...
[root@oel src]# yum install *yacc*
...
[root@oel src]# make
...
[root@oel src]# make install
...

At this step, Kerberos should be installed properly and the binaries, libraries and the documentation should be under /usr/local. The default location is sufficient in almost all cases:

[root@oel src]# krb5-config --all
Version:     Kerberos 5 release 1.12.1
Vendor:      Massachusetts Institute of Technology
Prefix:      /usr/local
Exec_prefix: /usr/local

As Kerberos should be installed properly, the next step is to configure it. This is done through a configuration file named krb5.conf:

[root@oel src]# vi /etc/krb5.conf

[libdefaults]
  default_realm = EXAMPLE.COM
  forwardable = true
  proxiable = true

[realms]
  EXAMPLE.COM = {
    kdc = kdc01oel.example.com:88
    admin_server = kdc01oel.example.com:749
    default_domain = example.com
  }

[domain_realm]
  .example.com = EXAMPLE.COM
  example.com = EXAMPLE.COM

[logging]
  kdc = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmin.log
  default = FILE:/var/log/krb5lib.log

To avoid hostname resolution issues, the file /etc/hosts must contain the fully qualified domain name of the server as well as the IP address:

[root@oel src]# vi /etc/hosts

127.0.0.1         localhost   kdc01oel
192.168.1.2       kdc01oel.example.com   kdc01oel

The next thing to do is to create the realm and the KDC database. Let's begin with the creation of the database parent folder:

[root@oel src]# cd /usr/local
[root@oel local]# mkdir /usr/local/var
[root@oel local]# mkdir /usr/local/var/krb5kdc
[root@oel local]# chmod 700 /usr/local/var/krb5kdc

The file krb5.conf (just above) is the generic Kerberos configuration file but the KDC also has is own configuration file (kdc.conf). Create this file and populate it as follow:

[root@oel local]# cd /usr/local/var/krb5kdc/
[root@oel krb5kdc]# vi kdc.conf

[kdcdefaults]
  kdc_ports = 749,88

[realms]
  EXAMPLE.COM = {
    database_name = /usr/local/var/krb5kdc/principal
    admin_keytab = /usr/local/var/krb5kdc/kadm5.keytab
    acl_file = /usr/local/var/krb5kdc/kadm5.acl
    key_stash_file = /usr/local/var/krb5kdc/.k5.EXAMPLE.COM
    kdc_ports = 749,88
    max_life = 10h 0m 0s
    max_renewable_life = 7d 0h 0m 0s
  }

So let's create the Kerberos database using this configuration file:

[root@oel krb5kdc]# /usr/local/sbin/kdb5_util create -s
Loading random data
Initializing database '/usr/local/var/krb5kdc/principal' for realm 'EXAMPLE.COM',
master key name 'K/document.write(['M','EXAMPLE.COM'].join('@'))'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:
[root@oel krb5kdc]#

If there is any error at this point, it is certainly due to a misconfiguration of the /etc/krb5.conf file or because Kerberos can't resolve the hostname (the /etc/hosts file isn't configure properly).

This finaly conclude the first part about the installation of the MIT Kerberos Server.


2. Manage the KDC

For this part, I assume that the KDC is set up according to what I've explain above. In the previous part, I just shown how to install the KDC but in fact the KDC isn't running for the moment. So the first thing to do is to configure which will be able to connect to the KDC (that mean obtain a ticket) and with which permissions.

To enter to the KDC administration console, use the kadmin.local (only for the local machine):

[root@oel krb5kdc]# /usr/local/sbin/kadmin.local
Authenticating as principal root/document.write(['admin','EXAMPLE.COM'].join('@')) with password.
kadmin.local:

Once in the kadmin.local, several command can be used to manage the KDC. The following command can be used to list them all:

kadmin.local:  ?
Available kadmin.local requests:

add_principal, addprinc, ank
                         Add principal
delete_principal, delprinc
                         Delete principal
modify_principal, modprinc
                         Modify principal
rename_principal, renprinc
                         Rename principal
change_password, cpw     Change password
get_principal, getprinc  Get principal
list_principals, listprincs, get_principals, getprincs
                         List principals
add_policy, addpol       Add policy
modify_policy, modpol    Modify policy
delete_policy, delpol    Delete policy
get_policy, getpol       Get policy
list_policies, listpols, get_policies, getpols
                         List policies
get_privs, getprivs      Get privileges
ktadd, xst               Add entry(s) to a keytab
ktremove, ktrem          Remove entry(s) from a keytab
lock                     Lock database exclusively (use with extreme caution!)
unlock                   Release exclusive database lock
purgekeys                Purge previously retained old keys from a principal
get_strings, getstrs     Show string attributes on a principal
set_string, setstr       Set a string attribute on a principal
del_string, delstr       Delete a string attribute on a principal
list_requests, lr, ?     List available requests.
quit, exit, q            Exit program.

So for example let's create two principal. One with administrator capabilities (xxx/admin) and another one without:

kadmin.local:  addprinc mpatou/admin
WARNING: no policy specified for mpatou/document.write(['admin','EXAMPLE.COM'].join('@')); defaulting to no policy
Enter password for principal "mpatou/document.write(['admin','EXAMPLE.COM'].join('@'))":
Re-enter password for principal "mpatou/document.write(['admin','EXAMPLE.COM'].join('@'))":
Principal "mpatou/document.write(['admin','EXAMPLE.COM'].join('@'))" created.

kadmin.local:  addprinc mpatou
WARNING: no policy specified for document.write(['mpatou','EXAMPLE.COM'].join('@')); defaulting to no policy
Enter password for principal "document.write(['mpatou','EXAMPLE.COM'].join('@'))":
Re-enter password for principal "document.write(['mpatou','EXAMPLE.COM'].join('@'))":
Principal "document.write(['mpatou','EXAMPLE.COM'].join('@'))" created.

With a new "listprincs", the two new principals should be displayed but for now, mpatou/admin has no administrative access because this account isn't declared in the access control list. In the ACL file, permissions can be defined by using the following characters:

  • A = Addition of users or policies into the KDC database
  • D = Deletion of users or policies from the KDC database
  • M = Modification of users or policies in the KDC database
  • C = Changing principals' passwords in the KDC database
  • I = Inquiries into the database, to list principal information
  • L = Listing of the principals contained in the database
  • * = Grants the user all of the above permissions

Moreover, the wildcard (*) can be used to match more than one user. For example */admin will match all administrative account. Let's create the ACL file as define in the KDC configuration file (kdc.conf):

kadmin.local:  exit
[root@oel krb5kdc]# vi /usr/local/var/krb5kdc/kadm5.acl
*/document.write(['admin','EXAMPLE.COM'].join('@'))        ADMCIL

So there is a kadmin.local for local administration but there is also a remote administration console which is kadmin. This remote access can't be used for now because it need a file that doesn't exist. This file (a "keytab"), is a file that store a principal and an encryption key derived from the principal's password. It could be used to log into Kerberos without being prompted for a password and that's why this is useful for all kerberized applications.

[root@oel krb5kdc]# /usr/local/sbin/kadmin.local
Authenticating as principal root/document.write(['admin','EXAMPLE.COM'].join('@')) with password.
kadmin.local:  ktadd -k /usr/local/var/krb5kdc/kadm5.keytab kadmin/admin kadmin/changepw
Entry for principal kadmin/admin with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/admin with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/admin with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/admin with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
kadmin.local:  exit

The location of this kadm5.keytab is also define in the kdc.conf file. Now the KDC server process is ready to start:

[root@oel krb5kdc]# /usr/local/sbin/krb5kdc
[root@oel krb5kdc]# /usr/local/sbin/kadmind

If there is no error, then the KDC should be running and ready to reply to any client with a valid principal. The easiest way to test it is to try to obtain a TGT (Ticket Granting Ticket) using the kinit command:

[root@oel krb5kdc]# cd /usr/local/bin
[root@oel bin]# klist
klist: Credentials cache file '/tmp/krb5cc_0' not found
[root@oel bin]# kinit mpatou
Password for document.write(['mpatou','EXAMPLE.COM'].join('@')):
[root@oel bin]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: document.write(['mpatou','EXAMPLE.COM'].join('@'))

Valid starting       Expires              Service principal
04/03/2014 09:54:48  04/03/2014 19:54:48  krbtgt/document.write(['EXAMPLE.COM','EXAMPLE.COM'].join('@'))
    renew until 04/04/2014 09:54:47
[root@oel bin]# kdestroy
[root@oel bin]# klist
klist: Credentials cache file '/tmp/krb5cc_0' not found

The klist command can be used to list all existing tickets whereas the kdestroy is used to remove them. The KDC is now fully operational and some possible additional steps can be done (e.g. set up slaves KDC).

This finally concludes this blog about how to install a MIT Kerberos Server. If you need more information about Kerberos (MIT or Heimdal or Active Directory implementation), I strongly suggest you to read the book "Kerberos, The Definitive Guide" by Jason Garman. This book was for me the best source of knowledge on this subject.

Best practice for the sending of an Oracle execution plan

Tue, 2014-04-08 07:12

You have a query that takes too long and you want help to analyze the execution plan? Then you need to get it with relevant information, and correctly formatted. Autotrace is not a good option as it does not bind the variables in the same way as your application. Explain plan only shows estimations, but if we have a performance issue, this probably means that the estimation is wrong. I prefer SQL Monitoring when we have Tuning Pack, or the cursor plan with row-source statistics in all other cases. If you post it in a forum, don't forget to keep it formatted or it's impossible to read.

Here are the two ways I prefer to get an execution plan, depending on whether you have the tuning pack licence or not.

 

Without Tuning Pack

I set the sqlplus environment to nicely spool to text file and set the STATISTICS_LEVEL to ALL in order to gather plan execution statistics:

 

set pagesize 10000 linesize 300 trimspool on serveroutput off
alter session set statistics_level=all;

 

Then I execute the query. Don't forget to set the current_schema to the user that executes the statement, and to bind variables:

 

-- set the schema
alter session set current_schema=SCOTT;
-- define variables
variable minimum_salary number
-- bind values
exec :minimum_salary := 3000
-- execute the query
SELECT * FROM DEPT JOIN EMP USING(DEPTNO) WHERE SAL>:minimum_salary;

 

Finally I get the execution plan to a text file:

 

spool plan.txt
select * from table(dbms_xplan.display_cursor(format=>'allstats last +outline +peeked_binds'));
spool off

 

The plan will have both the estimations (E-Rows) and the actual number of rows (A-Rows) from the last execution. Note that if it is a parallel query statement, you must omit the 'last' in the format or you will have statistics only for the coordinator process.

 

With Tuning Pack

When you have tuning pack, you have access to the great SQL monitoring feature.

SQL> show parameter pack

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
control_management_pack_access       string      DIAGNOSTIC+TUNING

 

I set the sqlplus environment to nicely spool to html file:

 

set pagesize 0 linesize 10000 trimspool on serveroutput off long 1000000000 longc 1000000000 echo off feedback off

 

Then I execute the query. Don't forget to set the current_schema to the user that executes the statement, and to bind variables.

One difference here: the MONITOR hint to force SQL Monitoring.

 

-- set the schema
alter session set current_schema=SCOTT;
-- define variables
variable minimum_salary number
-- bind values
exec :minimum_salary := 3000
-- execute the query
SELECT /*+ monitor */ * FROM DEPT JOIN EMP USING(DEPTNO) WHERE SAL>:minimum_salary;

 

Finally, get the execution plan to a html file:

 

spool plan.htm
select dbms_sqltune.report_sql_monitor(report_level=>'all',type=>'active') from dual;
spool off

 

The html file is very small but will be rendered by an online flash automatically loaded from the oracle.com website.

You can see both output, plan.txt, and plan.htm from this zip: xplans.zip

 

Here is how they look like (but please never send me screenshots of execution plans...):

 

CapturePlantxt.PNG

 

And the colorful active report from SQL Monitoring:

 

CapturePlanHtm.PNG

 

The goal of this blog post is to simply (copy/paste) the formatting and the plan gathering code, so if you have any improvement ideas, please share.

SQL Server: transparent data encryption, key management, backup strategies

Mon, 2014-03-31 01:55

Transparent Data Encryption requires the creation of a database key encryption. The database key is part of the hierarchy of the SQL Server encryption tree with the DPAPI at the top of the tree. Then, if we go through the tree from top to bottom, we can find the service master key, the database master key, the server certificate or the asymmetric key, and finally the database encryption key (AKA the DEK). In this hierarchy each encryption key is protected by its parent. Encryption key management is one of the toughest tasks in cryptography. Improperly managing the encryption keys can compromises the entire security strategy.

Here is some basic advice on encryption keys:

  • Limit encryption key access to those who really only need it
  • Backup encryption keys and secure them. This is important to be able to restore them in case of corruption or disaster recovery scenarios
  • Rotate the encryption keys on a regular basis. Key rotation based on a regular schedule should be part of the IT policy. Leaving the same encryption key in place for lengthy periods of time gives hackers and other malicious persons the time to attack it. By rotating your keys regularly, your keys become a moving target, much harder to hit.

SQL Server uses the ANSI X.917 hierarchical model for key management which has certain advantages over a flat single-model for encryption keys, particularly in terms of key rotation. With SQL Server, rotate the encryption key that protects the database encryption key requires decrypting and reencrypting an insignificantly small amount of symmetric key data and not the entire database.

However, managing the rotation of the encryption key is very important. Imagine a scenario with a schedule rotation of every day (yes, we are paranoid!!!) and you have a strategy backup with a full back up every Sunday and a transaction log backup every night between Monday and Sunday.

 

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

FULL

LOG

LOG

LOG

LOG

LOG

LOG

TDE_Cert1

TDE_Cert2

TDE_Cert3

 

Here is an interesting question I had to answer: If I have a database page corruption on Tuesday morning that requires a restore of the concerned page from the full backup and the couple of backup logs from Monday to Tuesday, does it work with only the third encryption key? In short: do I need all the certificates TDE_Cert1, TDE_Cert2 and TDE_Cert3 in this case?

To answer this, let’s try with the AdventureWorks2012 database and the table Person.Person.

First, we can see the current server certificate used to protect the DEK of the AdventureWorks2012 database (we can correlate this with the certificate thumbprint):

SELECT        name AS certificate_name,        pvt_key_encryption_type_desc AS pvt_key_encryption,        thumbprint FROM master.sys.certificates WHERE name LIKE 'TDE_Cert%'; GO

 

billet5_tde_certificate_1

 

SELECT        DB_NAME(database_id) AS database_name,        key_algorithm,        key_length,        encryptor_type,        encryptor_thumbprint FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID('AdventureWorks2012')

 

billet5_tde_dek_1

 

Now, we perform a full backup of the AdventureWorks2012 database followed by the database log backup:

BACKUP DATABASE [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.BAK' WITH INIT, STATS = 10;   BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.TRN' WITH INIT, STATS = 10;

 

billet5_tde_bckp_1

 

Then, according to our rotation strategy, we change the old server certificate TDE_Cert by the new one TDE_Cert_2 to protect the DEK

-- Create a new server certificate USE [master]; GO   CREATE CERTIFICATE TDE_Cert2 WITH SUBJECT = 'TDE Certificat 2';   -- Encrypt the DEK by the new server certificate TDE_Cert_2 USE AdventureWorks2012; GO   ALTER DATABASE ENCRYPTION KEY ENCRYPTION BY SERVER CERTIFICATE TDE_Cert_2; GO -- Drop the old server certificate TDE_Cert USE [master]; GO   DROP CERTIFICATE TDE_Cert; GO   SELECT        name AS certificate_name,        pvt_key_encryption_type_desc AS pvt_key_encryption,        thumbprint FROM master.sys.certificates WHERE name LIKE 'TDE_Cert%'; GO

 

billet5_tde_dek_2

 

SELECT        DB_NAME(database_id) AS database_name,        key_algorithm,        key_length,        encryptor_type,        encryptor_thumbprint FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID('AdventureWorks2012')

 

billet5_tde_certificate_2

 

We perform again a new backup log:

BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_2.TRN' WITH INIT, STATS = 10;

 

billet5_tde_bckp_2

 

Finally, we repeat the same steps as above a last time (rotate the server certificate and perform a new log backup):

-- Create a new server certificate USE [master]; GO   CREATE CERTIFICATE TDE_Cert3 WITH SUBJECT = 'TDE Certificat 3';   -- Encrypt the DEK by the new server certificate TDE_Cert_3 USE AdventureWorks2012; GO   ALTER DATABASE ENCRYPTION KEY ENCRYPTION BY SERVER CERTIFICATE TDE_Cert_3; GO   -- Drop the old server certificate TDE_Cert USE [master]; GO   DROP CERTIFICATE TDE_Cert_2; GO   SELECT        name AS certificate_name,        pvt_key_encryption_type_desc AS pvt_key_encryption,        thumbprint FROM master.sys.certificates WHERE name LIKE 'TDE_Cert%'; GO

 

billet5_tde_certificate_3

 

SELECT        DB_NAME(database_id) AS database_name,        key_algorithm,        key_length,        encryptor_type,        encryptor_thumbprint FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID('AdventureWorks2012')

 

billet5_tde_dek_3

 

BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_3.TRN' WITH INIT, STATS = 10;

 

billet5_tde_bckp_3

 

So we have achieved our backup strategy with a full backup and a sequence of 3 transaction logs backups before initiating a database corruption next. In the same time, we have performed a rotation of 3 server certificates as encryption keys. Now it’s time to corrupt a data page that belongs to the table Person.Person in the AdventureWorks2012 database:

-- First we check IAM page to get a page ID that belongs to the Person.Person table DBCC IND(AdventureWorks2012, 'Person.Person', 1); GO

 

billet5_tde_dbcc_ind_person_person

 

Then we randomly take a page from the result with the ID 2840. Then, to quickly corrupt the page, we use the undocumented DBCC WRITEPAGE as follows (/! Don’t use DBCC WRITEPAGE in a production environment /!)

 

ALTER DATABASE AdventureWorks2012 SET SINGLE_USER; GO   DBCC WRITEPAGE(AdventureWorks2012, 1, 2840, 0, 2, 0x1111, 1); GO   ALTER DATABASE AdventureWorks2012 SET MULTI_USER; GO


We corrupt the page with the ID 2840 by introducing at the offset 0 two bytes with a global value of 0x1111. The last directORBufferpool option allows the page checksum failures to be simulated by bypassing the bufferpool and flushing the concerned page directly to the disk. We have to switch the AdventureWorks2012 database in the single user mode in order to use this option.

No let’s try to get data from the Person.Person table:

USE AdventureWorks2012; GO   SELECT * FROM Person.Person; GO

 

As expected a logical consistency I/O error with an incorrect checksum occurs during the reading of the Person.Person table with the following message:

 

billet5_tde_error_consistency

 

At this point, we have two options:

  • Try to run DBCC CHECKDB and the REPAIR option but we will likely lose data in this case
  • Restore the page ID 2840 from a consistent full back up and the necessary sequence of transaction log backups after taking a tail log backup

We are reasonable and decide to restore the page 2840 from the necessary backups, but first, we have to perform a tail log backup:

 

USE [master]; GO   -- tail log backup BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_TAILLOG.TRN' WITH NORECOVERY, INIT, STATS = 10;

...

Now we begin our restore process by trying to restore the concerned page from the full backup, but we encounter the first problem:

 

-- Restore the page ID 2840 from the full backup RESTORE DATABAE AdventureWorks2012 PAGE = '1:2840' FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.BAK' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_full_backup_error


According to the above error message, we can’t restore the page from this full backup media because it is protected by a server certificate. The displayed thumbprint corresponds to the TDE_Cert certificate which has been deleted during the rotation operation. At this point, we can understand why it is important to have a backup of the server certificate stored somewhere. This is where we remember the basis of encryption and key management.

Of course, we were on the safe side and performed a backup of each server certificate after their creation. Thus, we can restore the server certificate TDE_Cert:

 

USE [master]; GO

CREATE CERTIFICATE TDE_Cert

FROM FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert.cer' WITH PRIVATE KEY (        FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert.pvk',        DECRYPTION BY PASSWORD = 'P@$$w0rd' ); GO


Then, if we try to restore the page from the full database backup, it now works:

billet5_tde_restore_page_full_backup_success

 

To continue with the restore process we now have to restore the transaction log backup sequence with beginning with the ADVENTUREWORKS2012_DB.TRN media:

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.TRN' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_full_backup_success


Then we try to restore the second transaction log backup ADVENTUREWORKS2012_DB_2.TRN and we face to the same problem as during the earlier full backup. To open the backup media, we first have to restore the certificate with the thumbprint displayed below:

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_2.TRN' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_tran_log_backup_1_success


Ok, we have to restore the TDE_Cert_2 certificate …

 

CREATE CERTIFICATE TDE_Cert_2 FROM FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert_2.cer' WITH PRIVATE KEY (        FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert_2.pvk',        DECRYPTION BY PASSWORD = 'P@$$w0rd' ); GO


… and we retry to restore the second transaction log. As expected, it works:

 

billet5_tde_restore_page_tran_log_backup_2_success

At this point, we have only two transaction log backups to restore: ADVENTUREWORKS2012_DB_3.TRN and the tail log backup ADVENTUREWORKS2012_DB_TAILLO.TRN. Fortunately, these last two backup medias are encrypted by the TDE_Cert_3 which is the current server certificate that protects the DEK.

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_3.TRN' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_tran_log_backup_3_success

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_TAILLOG.TRN' WITH RECOVERY, STATS = 10; GO


billet5_tde_restore_page_tran_log_backup_4_success

 

The restore process is now finished and we can read the data from the Person.Person table without any problem:

 

USE AdventureWorks2012; GO   SELECT * FROM Person.Person


billet5_tde_select_person_person_table

 

To summarize, we have seen the importance of a good key management with a backup / restore strategy in this post. Of course, we chose a paranoid scenario to quickly highlight the problem, but you can easily transpose the same in a normal context with a normal rotation schedule of the encryptions keys - either it is a server certificate, an asymmetric key, or a third party tool.

So what about you, how do you manage your backup strategy with the rotation of the encryption keys?

Kerberos SSO with Documentum D2 3.1 SP1

Thu, 2014-03-27 02:06


Last week, when I was working on how to setup the Kerberos SSO for D2 3.1 SP1, I faced some issues due to a documentation that doesn’t seem to be up to date… In fact, our version is D2 3.1 SP1 and there is no specific documentation for SP1. In consequence, I read the D2 3.1 Installation Guide and the D2 4.0 Installation Guide. The first time I read the D2 3.1 documentation, I found it very light and I knew I would have some problems. Fortunately, I already had experience with Kerberos, essentially with the MIT distribution on Linux (how to setup a KDC, kerberize an application, manage users, etc.).

The first thing that is important to know is that as D2 isn’t a WDK client, the setup of the SSO using Kerberos just involves the D2 host server and the Active Directory server. There is no need to setup the Content Store to use Kerberos. So here is the configuration that were used:

  • Microsoft Active Directory on Windows Server 2008 R2. Let’s name this server ad001 and the related domain domain.com
  • D2 3.1 SP1 on Microsoft Server 2008 R2. Let’s name this server wd231

The second thing that is important to know is that not all Directory Servers are supported. Indeed, Documentum doesn't support Linux Key Distribution Center (KDC).


1. D2 3.1 documentation steps


The first part of this blog will describe which steps the official D2 3.1 Installation Guide provides to help Administrator to setup a SSO using Kerberos in D2. You will see that those steps aren’t very descriptive but with a little bit of imagination, you could do something with that.

a. Edit the shiro.ini file

Open the file shiro.ini and add the following lines:

[main]
D2-Kerberos=eu.c6.d2.web.filters.authc.D2KerberosHttpAuthenticationFilter
D2-Kerberos.servicePrincipal=HTTP/computerName.domainName
D2-Kerberos.krbConfLocation=C:/Windows/krb5.ini
D2-Kerberos.keyTabLocation=C:/computerName.keytab
D2-Kerberos.docbases=docbase1,login1,password1,domain1|docbase2,...
D2-Kerberos.debug=true
[urls]
/** = D2-Kerberos

b. On the AD

Create a user on the AD with the computer name of your application server and add the following options:

  • Use Kerberos DES encryption types for this account
  • This account supports Kerberos AES 128 bit encryption

Generate a keytab using the command below. Well in fact the official documentation only display the command and don’t explain what is it or where to execute it.

ktpass1.png

Transfer this keytab on the D2 host server.

c. krb5.ini

[libdefaults]
    default_realm = DOMAINNAME
[realms]
    DOMAINNAME = {
        kdc = adserver.domainName
    }

d. On Windows 7

  1. Run gpedit.msc
  2. Expand “Local Computer Policy” / “Computer Configuration” / “Windows Settings” / “Security Settings” / “Local Policies” / “Security Options” / “Network security:
  3. Configure encryption types allowed for Kerberos”
  4. Double click “Network security: Configure encryption types allowed for Kerberos”
  5. Select all.
  6. Press “OK”

 

2. D2 3.1 SP1 steps


The second part of this blog will present which steps must be done to get Kerberos SSO working with D2 3.1 SP1. If you only follow steps describes in the official documentation, as some explanations are missing, you will probably get issues. Now here are the steps that were required to get the SSO working on our D2 3.1 SP1:

Let’s begin with the beginning contrary to the official documentation which begin with the end. So logic!

a. On the Active Directory

Create a user on the Active Directory with the following properties:

  • Username doesn’t matter (don’t need to be the D2 server hostname). Let’s name this user: dmskrbsso
  • Password: dmskrbssoPassword
  • Password never expire
  • This account support Kerberos AES 128 bits encryption
  • Trust for Delegation to any service (Kerberos Only)
  • This account support Kerberos DES encryption

This last configuration isn’t mandatory as Kerberos will always use the most secure encryption available (AES 256 in general). So there is no problem if your Active Directory admin doesn’t want to enable DES encryption as this isn’t enough secure.

When the user is successfully created, open an administrator command prompt and create the keytab for D2 using the following command:

ktpass2.png

This command will create a file named wd231.keytab which should be transfer on the D2 host server. Let’s place this file at the following location: C:/Kerberos/wd231.keytab

b. On Documentum Administrator

Create a repository user with the same Name, Login Name and password as the one created on the AD:

  • State: Active
  • Name: dmskrbsso
  • Login Name: dmskrbsso
  • Login Domain: domain.com
  • User Source: Inline Password
  • Password: dmskrbssoPassword
  • Privileges: None
  • Extended Privileges: None
  • Client Capability: Consumer

This first user will just reflect the new user created on the AD but I think this user isn’t mandatory.

Create another repository user which will be used by the shiro.ini file to connect all other users through SSO:

  • State: Active
  • Name: d2krbsso
  • Login Name: d2krbsso
  • Login Domain: domain.com (This is mandatory! The SSO will not work without the AD domain here)
  • User Source: Inline Password
  • Password: d2krbssoPassword
  • Privileges: Superuser
  • Extended Privileges: None
  • Client Capability: Consumer


From a command prompt on the D2 server, execute the following command to get the encrypted password of the user d2krbsso:

java.png

This command assume that your D2-Client web application is at “C:/Tomcat6D2-Client31SP1/webapps/D2-Client/”. Remember the result of this command as the encrypted password will be needed later in the shiro.ini file. Let's name this encrypted password userEncryptedPw

c. krb5.ini

Create the file C:/Kerberos/krb5.ini with the following content:

[libdefaults]
  default_realm = DOMAIN.COM
[realms]
  DOMAIN.COM = {
    kdc = ad001.domain.com
  }
[domain_realm]
  .domain.com = DOMAIN.COM
  domain.com = DOMAIN.COM
[logging]
  default = FILE:C:/Kerberos/logs/kdc_default.log
  kdc = FILE:C:/Kerberos/logs/kdc.log

Don’t forget to create the folder C:/Kerberos/logs/.

d. Edit the shiro.ini file

The shiro.ini file is the file used by D2 to authenticate user through Kerberos. This file could be found at “C:/Tomcat6D2-Client31SP1/webapps/D2-Client/WEB-INF/classes/shiro.ini”. Replace the properties in this file with the following:

[main]
D2-Kerberos=com.emc.d2.web.filters.authc.D2KerberosHttpAuthenticationFilter
D2-Kerberos.servicePrincipal=HTTP/
D2-Kerberos.krbConfLocation=C:/Kerberos/krb5.ini
D2-Kerberos.keyTabLocation=C:/Kerberos/wd231.keytab
D2-Kerberos.docbases=docbase1,d2krbsso,userEncryptedPw,DOMAIN.COM
D2-Kerberos.debug=true
[urls]
/**=D2-Kerberos

docbase1 correspond to a repository using Kerberos. You could set more than 1 docbase using the following property:
D2-Kerberos.docbases=docbase1,login1,password1,domain1|docbase2,...

Maybe you didn’t see the difference with the original documentation but if you look at the property named “D2-Kerberos”, you will see that the Java Class D2KerberosHttpAuthenticationFilter isn’t in the same package. In our D2 3.1 SP1, this Java Class is located in com.emc.d2.web.filters whereas the D2 3.1 official documentation indicate it on eu.c6.d2.web.filters. Something funny is that on the D2 4.0 official documentation, this property indicate again another location: eu.c6.d2.portal.server.filters.

e. Verify the Java Runtime Environment

It seems that Kerberos SSO for D2 require a java 1.6 jre. The simplest way to verify this is to check the service configuration of your application server. Another way could be to take a look at the registry:

  • Run regedit
  • Navigate HKEY_LOCAL_MACHINE / SOFTWARE / Wow6432Node / Apache Software Foundation / Procrun 2.0 / tomcat6D231SP1 / Parameters / Java
  • Verify that the JVM parameter point to Java 1.6: C:/Program Files/Java/jre6/bin/server/jvm.dll

This assume that the JAVA_HOME of your application server is C:/Program Files/Java/jre6/

f. Install the Java Cryptography Extension (JCE)

Download Java Cryptography Extension (JCE) for the java version used by the JVM and copy both jars into:
C:/Program Files/Java/jre6/lib/security/

g. Restart

Restart your D2 application server and look for errors on the tomcat error log files:
C:/Tomcat6D2-Client31SP1/logs/tomcat6d231sp1-stdout.YYYY-MM-DD.log

h. User configuration

For D2 Kerberos SSO, there is no need to change anything on user properties. That means that, for example, a Login Domain of LDAP and User Source of LDAP is fine.

i. On client computer

  1. Run gpedit.msc
  2. Expand “Local Computer Policy” / “Computer Configuration” / “Windows Settings” / “Security Settings” / “Local Policies” / “Security Options” / “Network security:
  3. Configure encryption types allowed for Kerberos”
  4. Double click “Network security: Configure encryption types allowed for Kerberos”
  5. Select all.
  6. Press “OK


Please be aware that the D2-Client URL must be detected by Internet Explorer as an “Intranet Site”. This could be done through Internet Explorer options.

This finally concludes the configuration of Kerberos SSO in D2 3.1 SP1. To get a Kerberos ticket, just log in on the client machine with a user defined in AD and if you have followed the steps above, SSO should work. If this is not the case, please let me know and I might be able to help.

Good luck!

Transparent data encryption, key management and backup strategies

Wed, 2014-03-26 01:55

Transparent Data Encryption requires the creation of a database key encryption. The database key is a part of the hierarchy of SQL Server encryption tree with at the top of the tree the DPAPI. Then if we traverse the tree from the top to bottom we can find the service master key, the database master key, the server certificate or the asymmetric key and finally the database encryption key (AKA the DEK). In this hierarchy each encryption key is protected by its parent. Encryption key management is one of the toughest tasks in cryptography. Managing improperly the encryption keys can compromises the entire security strategy. Here the basis of encryption key:

  • Limit encryption key access to only those who really need it
  • Backup encryption keys and secure them. This is important we can restore them in case of corruption or disaster recovery scenarios
  • Rotate the encryption keys on regular basis. Key rotation based on a regular schedule should be part of the IT policy. Leave the same encryption key in place for lengthy periods of time give hackers and other malicious persons the time to attack it. By rotating your keys regularly your keys become a moving target, much harder to hit.

SQL Server uses the ANSI X.917 hierarchical model for key management which has certain advantages over a flat single-model for encryption keys, particularly in terms of key rotation. With SQL Server, rotate the encryption key that protects the database encryption key requires decrypting and reencrypting an insignificantly small amount of symmetric key data and not the entire database.

However manage the rotate of the encryption key is very important. Imagine a scenario with a schedule rotate every day (yes we are paranoid!!!) and you have a strategy backup with a full back up every Sunday and a transaction log backup every night between Monday and Sunday.

 

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

FULL

LOG

LOG

LOG

LOG

LOG

LOG

TDE_Cert1

TDE_Cert2

TDE_Cert3

 

Here an interesting question I had to answer: If I have a database page corruption on Thuesday morning that requires a restore of the concerned page from the full backup and the couple of backup logs from Monday to Tuesday does it work with only the third encryption key? In others do I need all the certificates TDE_Cert1, TDE_Cert2 and TDE_Cert3 in this case?

To answer, let’s try with the AdventureWorks2012 database and the table Person.Person.

First we can see the current server certificate used to protect the DEK of the AdventureWorks2012 database (we can correlate with the certificate thumbprint) :

SELECT        name AS certificate_name,        pvt_key_encryption_type_desc AS pvt_key_encryption,        thumbprint FROM master.sys.certificates WHERE name LIKE 'TDE_Cert%'; GO

 

billet5_tde_certificate_1

 

SELECT        DB_NAME(database_id) AS database_name,        key_algorithm,        key_length,        encryptor_type,        encryptor_thumbprint FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID('AdventureWorks2012')

 

billet5_tde_dek_1

 

Now we perform a full backup of the AdventureWorks2012 database following by the database log backup:

BACKUP DATABASE [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.BAK' WITH INIT, STATS = 10;   BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.TRN' WITH INIT, STATS = 10;

 

billet5_tde_bckp_1

 

Then according to our rotate strategy we change the old server certificate TDE_Cert by the new one TDE_Cert_2 to protect the DEK

-- Create a new server certificate USE [master]; GO   CREATE CERTIFICATE TDE_Cert2 WITH SUBJECT = 'TDE Certificat 2';   -- Encrypt the DEK by the new server certificate TDE_Cert_2 USE AdventureWorks2012; GO   ALTER DATABASE ENCRYPTION KEY ENCRYPTION BY SERVER CERTIFICATE TDE_Cert_2; GO -- Drop the old server certificate TDE_Cert USE [master]; GO   DROP CERTIFICATE TDE_Cert; GO   SELECT        name AS certificate_name,        pvt_key_encryption_type_desc AS pvt_key_encryption,        thumbprint FROM master.sys.certificates WHERE name LIKE 'TDE_Cert%'; GO

 

billet5_tde_dek_2

 

SELECT        DB_NAME(database_id) AS database_name,        key_algorithm,        key_length,        encryptor_type,        encryptor_thumbprint FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID('AdventureWorks2012')

 

billet5_tde_certificate_2

 

We perform again a new backup log:

BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_2.TRN' WITH INIT, STATS = 10;

 

billet5_tde_bckp_2

 

Finally we repeat the same steps as above a last time (rotate the server certificate and perform a new log backup) :

-- Create a new server certificate USE [master]; GO   CREATE CERTIFICATE TDE_Cert3 WITH SUBJECT = 'TDE Certificat 3';   -- Encrypt the DEK by the new server certificate TDE_Cert_3 USE AdventureWorks2012; GO   ALTER DATABASE ENCRYPTION KEY ENCRYPTION BY SERVER CERTIFICATE TDE_Cert_3; GO   -- Drop the old server certificate TDE_Cert USE [master]; GO   DROP CERTIFICATE TDE_Cert_2; GO   SELECT        name AS certificate_name,        pvt_key_encryption_type_desc AS pvt_key_encryption,        thumbprint FROM master.sys.certificates WHERE name LIKE 'TDE_Cert%'; GO

 

billet5_tde_certificate_3

 

SELECT        DB_NAME(database_id) AS database_name,        key_algorithm,        key_length,        encryptor_type,        encryptor_thumbprint FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID('AdventureWorks2012')

 

billet5_tde_dek_3

 

BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_3.TRN' WITH INIT, STATS = 10;

 

billet5_tde_bckp_3

 

So, we have achieved our backup strategy with a full backup and a sequence of 3 transaction logs backups before to initiate next a database corruption. In the same time we have perform the rotate of 3 server certificates as encryption keys. Now it’s time to corrupt a data page that belongs to the table Person.Person into the AdventureWorks2012 database:

-- First we check IAM page to get a page ID that belongs to the Person.Person table DBCC IND(AdventureWorks2012, 'Person.Person', 1); GO

 

billet5_tde_dbcc_ind_person_person

 

Then we take randomly page from the result with the ID 2840. Then to corrupt quickly the page we use the undocumented DBCC WRITEPAGE as following (/! Don’t use DBCC WRITEPAGE in production environment /!)

 

ALTER DATABASE AdventureWorks2012 SET SINGLE_USER; GO   DBCC WRITEPAGE(AdventureWorks2012, 1, 2840, 0, 2, 0x1111, 1); GO   ALTER DATABASE AdventureWorks2012 SET MULTI_USER; GO


We corrupt the page with ID 2840 by introducing at the offset 0 two bytes with a global value of 0x1111. The last directORBufferpool option allows page checksum failures to be simulated by bypassing the bufferpool and flush directly the concerned page to the disk. We have to switch the AdventureWorks2012 database in the single user mode in order to use this option.

No let’s trying to get data from the Person.Person table:

USE AdventureWorks2012; GO   SELECT * FROM Person.Person; GO

 

As expected a logical consistency I/O error with an incorrect checksum occurs during the reading of the Person.Person table with the following message:

 

billet5_tde_error_consistency

 

At this point we had two options:

  • Trying to run DBCC CHECKDB and the REPAIR option but we can likely loss data in this case
  • Restore the page ID 2840 from a consistent full back up and the necessary sequence of transaction log backups after taking a tail log backup

We are reasonable and we decide to restore the page 2840 from the necessary backups but first we have to take a tail log backup:

 

USE [master]; GO   -- tail log backup BACKUP LOG [AdventureWorks2012] TO DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_TAILLOG.TRN' WITH NORECOVERY, INIT, STATS = 10;

...

Now we begin our restore process by trying to restore the concerned page from the full backup but we encounter the first problem:

 

-- Restore the page ID 2840 from the full backup RESTORE DATABAE AdventureWorks2012 PAGE = '1:2840' FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.BAK' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_full_backup_error


According to the above error message we can’t restore the page from this full backup media because it is protected by a server certificate. The displayed thumbprint corresponds to the TDE_Cert certificate which has been deleted during the rotate operation. At this point we can understand why it is important to have a backup of the server certificate stored somewhere. We can remember here the basis of encryption and key management.

Of course we are safe and we performed a backup of each server certificate after their creation and thus we can restore the server certificate TDE_Cert:

 

USE [master]; GO

CREATE CERTIFICATE TDE_Cert

FROM FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert.cer' WITH PRIVATE KEY (        FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert.pvk',        DECRYPTION BY PASSWORD = 'P@$$w0rd' ); GO


Then if we try to restore the page from the full database backup it works now:

billet5_tde_restore_page_full_backup_success

 

To continue with the restore process we have now to restore the transaction log backup sequence with beginning with the ADVENTUREWORKS2012_DB.TRN media:

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB.TRN' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_full_backup_success


Then we try to restore the second transaction log backup ADVENTUREWORKS2012_DB_2.TRN and we face to the same problem as the earlier full backup. To open the backup media we have before to restore the certificate with the thumbprint displayed below:

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_2.TRN' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_tran_log_backup_1_success


Ok we have to restore the TDE_Cert_2 certificate …

 

CREATE CERTIFICATE TDE_Cert_2 FROM FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert_2.cer' WITH PRIVATE KEY (        FILE = 'E:SQLSERVERENCRYPTEDBACKUPTDE_Cert_2.pvk',        DECRYPTION BY PASSWORD = 'P@$$w0rd' ); GO


… And we retry to restore the second transaction log. As expected it works:

 

billet5_tde_restore_page_tran_log_backup_2_success

At this point, we have only two transaction log backups to restore: ADVENTUREWORKS2012_DB_3.TRN and the tail log backup ADVENTUREWORKS2012_DB_TAILLO.TRN. Fortunately, these last two backup Medias are encrypted by the TDE_Cert_3 which is the current server certificate that protects the DEK.

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_3.TRN' WITH NORECOVERY, STATS = 10; GO


billet5_tde_restore_page_tran_log_backup_3_success

 

RESTORE LOG [AdventureWorks2012] FROM DISK = 'E:SQLSERVERENCRYPTEDBACKUPADVENTUREWORKS2012_DB_TAILLOG.TRN' WITH RECOVERY, STATS = 10; GO


billet5_tde_restore_page_tran_log_backup_4_success

 

The restore process is now finished and we can now reading data from the Person.Person table without problem:

 

USE AdventureWorks2012; GO   SELECT * FROM Person.Person


billet5_tde_select_person_person_table

To summarize, in this post we have seen the importance of a good key management with the backup / restore strategy. Of course we took a paranoid scenario to highlight quickly the problem but you can transpose easily the same in a normal context with a fair rotate schedule of the encryptions keys either if it is a server certificate, an asymmetric key or a third party tool. And you, how do you manage your backup strategy with the rotate of encryption keys?



The consequences of NOLOGGING in Oracle

Sun, 2014-03-23 11:04

While answering to a question on Oracle forum about NOLOGGING consequences, I provided a test case that deserves a bit more explanation. Nologging operations are good to generate minimal redo on bulk operations (direct-path inserts, index creation/rebuild). But in case we have to restore a backup that was made before the nologging operation, we loose data. And even if we can accept that, we have some manual operations to do.

Here is the full testcase.

 

I create a tablespace and backup it:


RMAN> create tablespace demo datafile '/tmp/demo.dbf' size 10M;
Statement processed

RMAN> backup tablespace demo;
Starting backup at 23-MAR-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=30 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00005 name=/tmp/demo.dbf
channel ORA_DISK_1: starting piece 1 at 23-MAR-14
channel ORA_DISK_1: finished piece 1 at 23-MAR-14
piece handle=/u01/app/oracle/fast_recovery_area/U1/backupset/2014_03_23/o1_mf_nnndf_TAG20140323T160453_9lxy0pfb_.bkp tag=TAG20140323T160453 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 23-MAR-14

 

 

I create a table and an index, both in NOLOGGING


RMAN> create table demo ( dummy not null ) tablespace demo nologging as select * from dual connect by level Statement processed

RMAN> create index demo on demo(dummy) tablespace demo nologging;
Statement processed

 

Note how I like 12c for doing anything from RMAN...

Because I will need it later, I do a treedump of my index:


RMAN> begin
2>  for o in (select object_id from dba_objects where owner=user and object_name='DEMO' and object_type='INDEX')
3>   loop execute immediate 'alter session set tracefile_identifier=''treedump'' events ''immediate trace name treedump level '||o.object_id||'''';
4> end loop;
5> end;
6> /
Statement processed

 

Here is the content of my treedump trace file:


----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 2, level: 1)
   leaf: 0x140008c 20971660 (-1: nrow: 552 rrow: 552)
   leaf: 0x140008d 20971661 (0: nrow: 448 rrow: 448)
----- end tree dump

 

Because of the nologging, the tablespace is 'unrecoverable' and we will see what it means.


RMAN> report unrecoverable;
Report of files that need backup due to unrecoverable operations
File Type of Backup Required Name
---- ----------------------- -----------------------------------
5    full or incremental     /tmp/demo.dbf

 

RMAN tells me that I need to do a backup, which is the right thing to do after nologging operations. But here my goal is to show what happens when we have to restore a backup that was done before the nologging operations.

I want to show that the issue does not only concern the data that I've loaded, but any data that may come later in the blocks that have been formatted by the nologging operation. So I'm deleteing the rows and inserting a new one.


2> delete from demo;
Statement processed

RMAN> insert into demo select * from dual;
Statement processed

 

Time to restore the tablespace from the backup that has been done before the nologging operation:


RMAN> alter tablespace demo offline;
Statement processed

RMAN> restore tablespace demo;
Starting restore at 23-MAR-14
using channel ORA_DISK_1

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00005 to /tmp/demo.dbf
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fast_recovery_area/U1/backupset/2014_03_23/o1_mf_nnndf_TAG20140323T160453_9lxy0pfb_.bkp
channel ORA_DISK_1: piece handle=/u01/app/oracle/fast_recovery_area/U1/backupset/2014_03_23/o1_mf_nnndf_TAG20140323T160453_9lxy0pfb_.bkp tag=TAG20140323T160453
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished restore at 23-MAR-14

RMAN> recover tablespace demo;
Starting recover at 23-MAR-14
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 23-MAR-14

RMAN> alter tablespace demo online;
Statement processed

 

We can check the unrecoverable tablespace


RMAN> report unrecoverable;
Report of files that need backup due to unrecoverable operations
File Type of Backup Required Name
---- ----------------------- -----------------------------------
5    full or incremental     /tmp/demo.dbf

 

but we don't know which objects are concerned until we try to read from them:


RMAN> select /*+ full(demo) */ count(*) from demo;
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of sql statement command at 03/23/2014 16:05:03
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5: '/tmp/demo.dbf'
ORA-26040: Data block was loaded using the NOLOGGING option


RMAN> select /*+ index(demo) */ count(*) from demo;
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of sql statement command at 03/23/2014 16:05:04
ORA-01578: ORACLE data block corrupted (file # 5, block # 140)
ORA-01110: data file 5: '/tmp/demo.dbf'
ORA-26040: Data block was loaded using the NOLOGGING option

 

So I can't read from the table because of block (file # 5, block # 131) which is corrupted and I can't read from the index because of block (file # 5, block # 140) which is corrupted. The reason is that recovery was not possible on them as there was no redo to protect them from the time they were formatted (by the nologging operation).

Let's see which blocks were reported:


RMAN> select segment_type,header_file,header_block , dbms_utility.make_data_block_address(header_file,header_block) from dba_segments where owner=user and segment_name='DEMO';
SEGMENT_TYPE       HEADER_FILE HEADER_BLOCK
------------------ ----------- ------------
DBMS_UTILITY.MAKE_DATA_BLOCK_ADDRESS(HEADER_FILE,HEADER_BLOCK)
--------------------------------------------------------------
INDEX                        5          138
                                                      20971658
 
TABLE                        5          130
                                                      20971650
 
RMAN> select dbms_utility.make_data_block_address(5, 140) from dual;

DBMS_UTILITY.MAKE_DATA_BLOCK_ADDRESS(5,140)
-------------------------------------------
                                   20971660

 

The full scan failed as soon as it reads the block 131 which is the first one that contains data. The segment header block itself was protected by redo.

For the index the query failed on block 140 which is the first leaf (this is why I did a treedump above). The root branch (which is always the next after the segment header) seem to be protected by redo even for nologging operation. The reason why I checked that is because in the first testcase I posted in the forum, I had a very small table for which the index was so small that it had only one leaf - which is the root branch as well - so the index was still recovrable.

The important point to know is that the index is still valid:


RMAN> select status from all_indexes where index_name='DEMO';
STATUS  
--------
VALID   

 

And the only solution is to truncate the table:


RMAN> truncate table demo;
Statement processed


RMAN> select /*+ full(demo) */ count(*) from demo;
  COUNT(*)
----------
         0

RMAN> select /*+ index(demo) */ count(*) from demo;
  COUNT(*)
----------
         0

 

no corruption anymore, but no data either...

Last point: if you have only the indexes that are unrecoverable, you can rebuild them. But because the index is valid, Oracle will try to read it in order to rebuild it - and fail with ORA-26040. You have to make then unusable before.

The core message is:

  • Use nologging only when you accept to loose data and you accept to have some manual operations to do after recovery (so document it): truncate table, make indexes unusable and rebuild.
  • Backup the unrecoverable tablespaces as soon as you can after your nologging operations
  • If you need redo for other goals (such as standby database) use force logging.

Oracle APEX: redirect all pages to one

Fri, 2014-03-21 16:51

This blog will explain how to proceed in order to redirect all the application pages to one single page, prior to their processing and rendering, with Oracle APEX 4.0 and above.

In some cases, like security checks, you may want to redirect all calls of any page in your APEX application to one single page whith particular processing before the target page is rendering or even processing.

I ran into this need again, recently while developing some application.

 

The problem to solve:

Application user has to change his password after first login or password reset. No other page than password change should be accessed until successful change. This page is part of standard administration, as user can also choose to change his password at any time. It therefore contains already links to navigate within the application, which needs to be prevented. In addition any page call within the application, from direct URL entry, must be prevented as well (no cheating allowed).

 

Where would you put some code, in APEX, which needs to be accessed by all pages in the application?

Natural answer would be Page 0, also known as Global Page.

So what? Let's put some branch into the 'Before Header' part of the global page, which would redirect as soon as any page rendering starts, preventing the display of the current page.

Unfortunately, this doesn't work...

Why? Simply because APEX doesn't handle it. The first indicator is that you can't access any branching in the 'Before Header' section of the global page, unlike any normal page.

 

GlobalPage_NoBranch_Compare

 

You have to trick, using the 'Create - Add a page control to this page' wizard, to add such kind of branching. So, you shouldn't do it...

 

What else can we do?

APEX offers another nice feature called by any page of the application: Application Processes.

And that's the key. The branching is not happening as easy as when done declaratively in a page, but adding those few lines of PL/SQL in the process will make it:

 

  htp.init;
  owa_util.redirect_url('f?p=&APP_ID.:CHGPWD:&SESSION.');
  apex_application.stop_apex_engine;

 

In order to work you also need to tell him when to run. In our case 'On Load: Before Header (page template header)'.

 

ApplicationProcess_start_point

 

This application process will do the job and even overeager, because the target page itself will loop into death. So you need to set some condition on that process, in order to NOT call it when the current page is the target page. In my case I have also added the login page of the application as user is allowed to get back to it even though the password was not changed. In order to combine condition related to page and user state verification, execution of the application process is linked to the result of a function returning a boolean.

 

Condition

 

Below the code of the function called. To make it nice and avoid having hard coded page numbers, you can use page aliases and retrieve the page number as shown:

 

 FUNCTION is_change_password_required
    RETURN BOOLEAN IS
      v_username VARCHAR2(30);
      v_chg_pwd_required BOOLEAN;
      v_found NUMBER;
    BEGIN
      v_chg_pwd_required := TRUE;
      v_username := NVL(v('APP_USER'), USER);
     
      -- Check if current page is Login or change password
      -- to enforce login and avoid looping on change password
      SELECT COUNT(*) INTO v_found
      FROM apex_application_pages
      WHERE application_id = v('APP_ID')
        AND page_id = v('APP_PAGE_ID')
        AND page_alias IN ('LOGIN', 'CHGPWD');
     
      IF v_found > 0 THEN
        v_chg_pwd_required := FALSE;
      ELSE
        v_chg_pwd_required := is_new_or_reset_password(v_username);
      END IF;
     
      RETURN v_chg_pwd_required;
     
    END is_change_password_required;

 

I hope you will enjoy this solution for your own password change management or other purpose.

SQL Server 2014: Buffer pool extension - an interesting new feature

Thu, 2014-03-20 02:15

Buffer pool extension (aka BPE) is an interesting feature of SQL Server 2014. Database administrators already know that memory is one of the key assets available for SQL Server. This is especially true for buffer pool memory, which plays an important role in storing data page and reducing the I/O demand. Nowadays, even if the memory has become cheaper, we are often facing limitations such as mid-size hardware configurations and increasing memory. All this can become difficult and may require a hardware upgrade that can be expensive. In the same time, increasing the disk capacity is no longer a challenge and can be handled more flexibly. Furthermore, solid-state drive technology has become relatively affordable over time. In short, adding disks to a server potentially becomes more cost-effective than adding memory for the mid-size hardware configurations. BPE was designed to address this issue by using the non-volatile memory to extend the buffer pool.

As mentioned by Microsoft the buffer pool extension provides benefits for specific cases:

  • Increased random I/O throughput. Indeed, flash-based storage performs well with random IO but sequential IO can be less efficient than using a disk-based storage
  • Reduced I/O latency by using a smart L1 and L2 caches with the buffer manager. Pages are moved between the L2 cache (SSD) and L1 cache (memory) by the buffer manager before it has to fetch the page from the disk with generally more latency.
  • Only clean pages are written to the BPE. This ensures no data loss.

When I investigated this feature in more detail, I had the opportunity to read the official documentation and blog posts already published by Microsoft and the very interesting article Turbocharging DBMS Buffer Pool Using SSDs written by Do, J., Zhang, D., Patel, J., DeWitt, D., Naughton, J., and A. Halverson. I also listened to one of the interview of Evgeny Krivosheev (SQL Server Program Manager) about the buffer pool extension feature. I would like to precise that there are still blog posts on the web which claim some workloads work better with the buffer pool extension feature. Of course, the quick answer is yes for the reasons described above and this is not my goal here. This post is just an opportunity to share with you some of my investigations - but before, I have to warn you: We are still talking about CTP2 of SQL Server 2014 and some of the tools I used do not work perfectly. We will see this later in the post.

The first thing I noticed is the decision made to store data in SSD at the granularity level of pages rather than tables or files. This point is important because we can retrieve different pages of a table either in the memory, or in the SSD, or in the disk. As mentioned above, only clean pages are written to the buffer pool extension. There is still an exception that consists of writing a modified page (dirty) both into the database mdf file and into the BPE. This is a kind of optimization called dual-write optimization.

Let’s start with my first test in which I used a custom AdventureWorks2012 database with two bigTransactionHistory tables. Here some details about the size and number of rows of each of those tables:

The bigTransactionHistory table contains 31 billion of rows for a size of 2 GB approximately (data and indexes).

 

USE AdventureWorks2012; GO   EXEC sp_spaceused'dbo.bigTransactionHistory'; GO

 

billet4_1_adventureworks_bigtransactionhistory_size

 

EXEC sp_spaceused'dbo.bigTransactionHistory_2'; GO

 

billet4_2_adventureworks_bigtransactionhistory_2_size

 

In my test, the AdventureWorks2012 database is configured as follows:

 

USE AdventureWorks2012; GO   SELECT        name,        physical_name FROM sys.database_files; GO

 

billet4_3_adventureworks_db_configuration

 

The mdf and ldf files are stored both on a slow drive (E:) which is a WDC USB disk (5200 RPM) with SATA/300. This is not an optimal configuration but for this test it's ok.

The buffer pool maximum size on my SQL Server instance is voluntarily fixed to 700 MB.

 

SELECT        name,        value_in_use FROM sys.configurations WHERE name = 'max server memory (MB)'

 

billet4_5_bpe_configuration

 

Finally, I set up the maximum size of the buffer pool extension to 2 GB:

 

ALTER SERVER CONFIGURATION SET BUFFER POOL EXTENSION ON (        FILENAME = N'C:Program FilesMicrosoft SQL ServerMSSQL12.MSSQLSERVERMSSQLDATAssdbuffer_pool.bpe',        SIZE = 2048 MB ); GO   SELECT        [path],        state_description,        current_size_in_kb / 1024 AS current_size_mb FROM sys.dm_os_buffer_pool_extension_configuration;

 

billet4_5_bpe_configuration

 

The buffer pool extension file is stored on fast solid-state drive (C:) which is a Samsung SSD 840 EVO with a SATA/600

At this point, we can start the first test by loading the entire table bigTransationHistory_2 in the buffer pool with a cold cache. Actually, data pages are stored only on the volatile memory as expected after clearing the buffer cache.

 

-- Clean buffer pool + BPE DBCC DROPCLEANBUFFERS; GO   -- Check pages in BP for the AdventureWorks2012 database SELECT        CASE is_in_bpool_extension              WHEN 1 THEN 'SSD'              ELSE 'RAM'        END AS location,        COUNT(*) AS nb_pages,        COUNT(*) * 8 / 1024 AS size_in_mb,        COUNT(*) * 100. /(SELECT COUNT(*) FROM sys.dm_os_buffer_descriptors) AS percent_ FROM sys.dm_os_buffer_descriptors WHERE database_id = DB_ID('AdventureWorks2012') GROUP BY is_in_bpool_extension;

 

billet4_6_bpe_configuration

 

In the same time before the loading, I created an extended event session in order to track the page movement between the non-volatile memory (SSD) and the volatile memory (RAM) for the dynamic pool cache. During my test, I found out that some events are not tracked correctly (for example sqlserver.buffer_pool_extension_pages_read), probably because we are still in CTP2 of SQL Server 2014 …

 

CREATE EVENT SESSION [buffer_pool_extension] ON SERVER ADD EVENT sqlserver.buffer_pool_eviction_thresholds_recalculated, ADD EVENT sqlserver.buffer_pool_extension_pages_evicted, ADD EVENT sqlserver.buffer_pool_extension_pages_read, ADD EVENT sqlserver.buffer_pool_extension_pages_written ADD TARGET package0.ring_buffer WITH (        MAX_MEMORY = 4096 KB,        EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,        MAX_DISPATCH_LATENCY = 30 SECONDS ); GO
ALTER EVENT SESSION [buffer_pool_extension] ON SERVER STATE = START; GO

 

After loading the bigTransactionHistory_2, there are data pages that exist both in the memory and buffer pool extension. The bigTransactionHistory_2 table doesn’t fit entirely in the allocated non-volatile memory. This is why we retrieve some pages (64% of the total pages) in the buffer pool extension. However, we notice that SQL Server didn't allocate all the nonvolatile memory for the bigTransactionHistory_2 table before filling up in turn the buffer pool extension. The algorithm is smart enough to leave spaces in the non-volatile memory for future requests. Although the solid-state drive remains fast, non-volatile memory is faster (nanoseconds vs microseconds).

 

billet4_7_pages_in_bpe

 

Then, if we take a look at the pages in the buffer pool extension in more details with the same dynamic management view sys.dm_os_buffer_descriptors, we retrieve some page information like page id and file id - but nothing more, maybe because we are still in the CTP version …

SELECT * FROM sys.dm_os_buffer_descriptors WHERE is_in_bpool_extension = 1;

 

billet4_8_pages_in_bpe_2

 

Don’t panic! We have others ways to find out what pages belong to the bigTransactionHistory_2 in the buffer pool extension. For example, we can correlate information from the new dynamic management function sys.dm_db_database_page_allocation() that replaces the undocumented but useful command DBCC IND since SQL Server 2012 with the information provided by the dynamic management view sys.dm_os_buffer_descriptors. First, I get the pages allocated to the bigTransactionHistory_2 into a tempdb table. Then we have to clear the buffer pool to avoid loading data pages into the buffer cache and evict others pages that are concerned by our first test.

 

SET NOCOUNT ON;   SELECT allocated_page_page_id INTO #allocation_page_big_TransactionHistory_2 FROM sys.dm_db_database_page_allocation(DB_ID(),OBJECT_ID('dbo.bigTransactionHistory_2', NULL, NULL, 'DETAILED') WHERE is_allocated = 1; GO   DBCC DROPCLEANBUFFERS; GO   SELECT * FROM AdventureWorks2012.dbo.bigTransactionHistory_2; GO   SELECT        CASE is_in_bpool_extension              WHEN 1 THEN 'SSD'              ELSE 'RAM'        END AS location,        COUNT(*) AS nb_pages,        COUNT(*) * 8 / 1024 AS size_in_mb FROM #allocation_page_big_TransactionHistory_2 AS tmp        JOIN sys.dm_os_buffer_descriptors AS bd              ON tmp.allocated_page_page_id = bd.page_id GROUP BY is_in_bpool_extension;

 

billet4_9_pages_in_bpe_3

 

So, we can verify that the majority of data pages in the buffer pool and the buffer pool extension are concerned by the bigTransactionHistory_2. Now let’s take a look at the extended events session ring buffer target:

 

WITH xe_session_bpe AS (        SELECT              CAST(target_data AS XML) AS target_data_xml        FROM sys.dm_xe_sessions AS s                   JOIN sys.dm_xe_session_targets AS st                     ON s.[address] = st.event_session_address        WHERE s.name = 'buffer_pool_extension' ) SELECT        N.X.value('(./@timestamp)[1]', 'datetime') AS event_date,        N.X.value('(./@name)[1]','varchar(100)') AS event_type,        N.X.value('(./data/value)[1]', 'int') AS nb_pages,        N.X.value('(./data/value)[2]', 'bigint') AS first_page_file_id,        N.X.value('(./data/value)[3]', 'bigint') AS first_page_id FROM x_session_bpe AS bpe CROSS APPLY bpe.target_data_xml.nodes('/RingBufferTarget/event') AS N(X); GO


We can observe that during the load of the bigTransactionHistory_2 table, different data pages have been written into the buffer pool extension (event type = buffer_pool_exentesion_pages_written). Sometimes we have only one page or we have a bunch of pages written in one time.


billet4_10_xe


We can also check if the page with the id equal to 394453 in the first file of the database AdventureWorks2012 (file id = 1) belongs to the bigTransactionHistory_2 table :

 

SELECT        OBJECT_NAME(object_id) AS table_name FROM sys.dm_db_database_page_allocation(DB_ID(),OBJECT_ID('dbo.bigTransactionHistory_2', NULL, NULL, 'LIMITED') WHERE allocated_page_file_id = 1


billet4_11_bpe_page_id_check

      

After this first test we can push our thinking a little bit further by trying to find out if a second loading of the same table can trigger the reading of data pages from the buffer pool extension. I used the following useful perfmon counters:

  • SQL Server:Buffer Manager:Extension page reads/sec
  • SQL Server:Buffer Manager:Extension page writes/sec
  • SQL Server:Buffer Manager:Page reads/sec
  • SQL Server:Buffer Manager:Page writes/sec
  • SQL Server:Readahead pages/sec

 

-> Reading the bigTransactionHistory_2 table with a cold cache:

 

billet4_12_bpe_perfmon_1

 

 

Phase 1: After starting the loading of our table in the buffer pool, we observe that SQL Server fetches a data page from the disk in the buffer cache. SQL Server also uses read-ahead mechanisms by using the index allocation map page of the bigTransactionHistory_2. This is why we notice a constant value of the buffer cache hit ratio (98%) during all the loading table time. At the end of the phase 1, the page life time expectancy counter comes down quickly because the buffer pool (in non-volatile memory) is full.

Phase 2: During this phase, the buffer pool cache extension takes over. Data pages are written into the file on the solid-state disk (Extension page writes / sec). 


-> Reading the bigTransactionHistory_2 with a warm cache:


billet4_12_bpe_perfmon_2


The buffer pool extension is used as expected when we read the bigTransactionHistory_2 table a second time. Data pages are already both in the buffer pool in the non-volatile memory and in the buffer pool extension in the solid-state drive. Furthermore, we have a few reads from the mdf file rather than the first reading of the bigTransactionhistory_2 with a cold cache. However, even if the buffer pool extension is used correctly, I notice that reading the bigTransactionHistory_2 table from a warm cache is longer than reading the same table from a cold cache (respectively 39 seconds versus 1 minute and 4 seconds). It’s a very strange result from my point of view even if we are using sequential IO in the both cases (scanning the bigTransactionHistory_2 table triggers read-ahead mechanism which uses itself sequential IO by design).

 I decided to launch the same tests and to take a look at the resource monitor and disk usage from the sqlservr.exe. Here are the results:

-> With a cold cache (disk E: is used by SQL Server) :
The IO write throughput of the disk E is approximately 10 MB/s


billet4_13_reosurce_monitor_1


-> With a warm cache (disk C: is used by SQL Server) : The IO write throughput of the disk C: is approximately 7MB/s


billet4_13_reosurce_monitor_2


 

The IO write throughput of the disk C: is approximately 7MB/s. First, according to what we have seen, we can conclude that the duration of the second test with a warm cache is caused by the lower throughput of the solid-state drive C:. However, this throughput is relatively strange and I decided to verify if it’s really the maximum throughput we I can have with my solid state drive disk by using CrystalDiskMark.

-> Disk E: (disk-based storage) : 33 MB / s

 

billet4_crystaldiskmark_mssql_data

 

-> Disk C: (flash-based storage) : 308 MB / s

 

billet4_crystaldiskmark_bpe_data

 

You can see that we are far from reaching the maximum throughput of the solid-state drive. A mystery to be solved with Microsoft … Finally, I decided to launch a last test by using both tables bigTransactionHistory and the bigTransactionHistory_2 this time. The total size of the two tables is bigger than the total size of the buffer pool and its extension (3 GB versus 2.7 GB approximately).

 

billet4_14_perfmon

 

The buffer pool extension is also used as a warm cache (extension page reads / sec) in this case. However, the data pages are also written to the buffer pool extension (extension page written / sec) at the same time, because all data pages cannot be stored entirely in the buffer pool and the buffer pool extension. We can see that the data pages continue to be read from the mdf file during the test (readahead pages / sec and read pages / sec). According to the paper Turbocharging DBMS Buffer Pool Using SSDs, I guess SQL Server uses the LRU-2 algorithm to evict pages during the reading of both tables. If we take a closer look at the result of the extended event target, we can see a new event triggered during the test: buffer_pool_eviction_thresholds_recalculated.


billet4_15_xevent_eviction_threshold


When the buffer pool extension begins to fill itself up, SQL Server decides which pages will be evicted by dynamically calculating two thresholds. Those thresholds determine the movement of the data pages between the buffer pool (non-volatile memory), the buffer pool extension (solid-state drive) and the database data file (disk-based storage). As you can imagine, the location of a data page depends on its “temperature”. During my test, I noticed that the buffer_pool_eviction_thresholds_recalculated event does not provide all sets of action values like number_evictions_to_bpool_extension or page_not_on_LRU that could be very useful for troubleshooting or to understand how eviction occurs. Again, I guess maybe this is due to the CTP2 of SQL Server … Please feel free to share your experience about testing buffer pool extension!

Investigating Oracle lock issues with event 10704

Fri, 2014-03-14 02:44

Did you ever encounter unexplained Oracle lock issues? They may be coming from unindexed foreign keys (which is worse in 11g). It's not easy to monitor. Of course you can check Oracle locks from V$LOCKED_OBJECT, but that is a solution only for locks that remain. Some Oracle locks are there only for a short duration. How do you check which lock is acquired by a statement?

Event 10704 is the solution. I'll show some examples in order to explain which lines from the trace are interresting, and how to interpret them.

Troubleshooting: ORA-29283 when oracle is member of a group with read/write privileges

Mon, 2014-03-10 22:53

Recently, I worked on an Oracle database migration for a customer and I faced a strange behavior when accessing a file stored on the file system of the Oracle database. The database has been migrated on a new server, and consolidated with existing databases. Several jobs are running into the database and need to access the file system to read or write some files. A particular file is received daily from a remote job over FTP. This file is owned by the "dc_ftp" user in /home/dc_ftp.

 

First, this is the directory where the file is located:

 

SQL> select directory_name, directory_path from dba_directories where directory_name='MYDIR';
DIRECTORY_NAME            DIRECTORY_PATH
------------------------- -------------------------------------------------------
MYDIR                     /home/dc_ftp/out

 

The file is named myfile.txt and exists in /home/dc_ftp/out:

 

dc_ftp@srvora01$ ls /home/dc_ftp/out/
myfile.txt

 

During functional tests, I was able to successfully read the file from the database:

 

SQL> declare
  p_file utl_file.file_type;
begin
  p_file := utl_file.fopen ('MYDIR', 'myfile.txt', 'w');
end;
/
 
PL/SQL procedure successfully completed.

 

But members of the application team, connected from SQL Developer, were not able to read this file:

 

SQL> declare
  p_file utl_file.file_type;
begin
  p_file := utl_file.fopen ('MYDIR', 'myfile.txt', 'r');
end;
/  2    3    4    5    6
declare
*
ERROR at line 1:
ORA-29283: invalid file operation
ORA-06512: at "SYS.UTL_FILE", line 536
ORA-29283: invalid file operation
ORA-06512: at line 4

 

My first hypotheses, such as misconfiguration in SQL Developer tool, wrong entries in the TNS Names, or wrong permissions on the listener binaries, were quickly eliminated.

 

Both instance and host are correct, so the TNS entry seems to be OK:

 

SQL> select instance_name, host_name from v$instance;
INSTANCE_NAME    HOST_NAME
---------------- ----------------------------------------------------------------
DB11G            srvora01

 

The listener is owned by oracle, as expected:

 

oracle@srvora01$  ps -ef | grep tnslsnr
oracle    9684     1  0 16:33 ?        00:00:01 /u00/app/oracle/product/11.2.0/db_3_0/bin/tnslsnr LISTENER -inherit

 

And to finish, the problem occurs with or without SQL Developer.

 

The next step was to check permissions. First at the DB level:

 

SQL> select grantee, privilege, table_name from dba_tab_privs where table_name='MYDIR';
GRANTEE                        PRIVILEGE                                TABLE_NAME
------------------------------ ---------------------------------------- ---------------
SYSTEM                         WRITE                                    MYDIR
SYSTEM                         READ                                     MYDIR

 

Then at the operating system level:

oracle@srvora01$ ls -altr /home/dc_ftp/out
total 12
-rw-r-----. 1 dc_ftp dc_ftp   16 Feb 24 16:48 myfile.txt
drwxr-x---. 2 dc_ftp dc_ftp 4096 Feb 24 16:48 .
drwxr-xr-x. 5 dc_ftp dc_ftp 4096 Feb 24 16:54 ..

 

We can see that files and directory are not directly owned by oracle, but oracle is member of the group dc_ftp. It gets the read privilege on the file from the group dc_ftp:

 

oracle @ srvora01$ id oracle
uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba),54325(oper),54330(dc_ftp)

 

Finally, oracle user can clearly access the file from an operating system point of view, since it can order the "cat" command on the file:

 

oracle@srvora01$ cat /home/dc_ftp/out/myfile.txt
test
testéàè

 

Now let's see which problem occurs when Oracle tries to access the  directory from the session.

 

SQL> set define #
SQL> column spid new_value unix_pid
SQL> select spid from v$process p join v$session s on p.addr=s.paddr and s.sid=sys_context('userenv','sid');
SPID
------------------------
10914

 

SQL> host strace -e trace=open -p #unix_pid & echo $! > .tmp.pid
Process 10914 attached - interrupt to quit

 

SQL> declare
  p_file utl_file.file_type;
begin
  p_file := utl_file.fopen ('MYDIR', 'myfile2.txt', 'w');
end;
/
  2    3    4    5    6 
 
declare
*
ERROR at line 1:
ORA-29283: invalid file operation
ORA-06512: at "SYS.UTL_FILE", line 536
ORA-29283: invalid file operation
ORA-06512: at line 4

 

open("/home/dc_ftp/out/myfile2.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = -1 EACCES (Permission denied)

 

The previous message clearly indicates that it is a permission issue. Oracle has no privilege to access the directory to read or write...

We can verifiy if it works when setting the permission READ to the file mytext.txt for non owner or group (i.e.644):

 

root@srvora01# chmod 644 /home/dc_ftp/out/myfile.txt
root@srvora01# ls -altr /home/dc_ftp/out
total 12
-rw-r--r--. 1 dc_ftp dc_ftp   16 Feb 24 16:48 myfile.txt
drwxr-xr-x. 2 dc_ftp dc_ftp 4096 Feb 24 16:48 .
drwxr-xr-x. 5 dc_ftp dc_ftp 4096 Feb 24 16:54 ..

 

oracle@srvora01$ sqlplus system/****@DB11G

 

SQL> declare
  p_file utl_file.file_type;
begin
  p_file := utl_file.fopen ('MYDIR', 'myfile.txt', 'r');
end;
/  2    3    4    5    6
 
PL/SQL procedure successfully completed.

 

This is the workaround.

But what if the file must not be read by the rest of the world? The file is generated with 640 permissions for a good readon here. So I had to fix the issue definitely.

Fortunately, I fell on a Metalink note which helped me find the solution to my problem. According to the note ID 832323.1, if the file or directory is accessed through a group on which oracle is member of (here dc_ftp group has read privileges on the file), adding oracle to that group while the database is running may cause the error ORA-29283 to occur.

Fine! But the database was restarted several times during the migration process, because of some initialization parameters to apply. I was totally sure that the database had been restarted after oracle user had been added to dc_ftp group! Moreover, I got no error when connected locally on the server. The error only occured when connected remotely through SQLNET, using sqlplus system/****@DB11G for instance through the listener...

That's why I had an idea. If the database processes are starting with the oracle user's permissions and if the database need to be restarted in order to make changes to take effect, the listener should be affected too! And the listener was never restarted during the migration. It was only reloaded to reflect changes from the new instance. This could explain why only sessions opened through the listener are affected...

I restarted the listener and... all problems went away!!

 

oracle@srvora01$ lsnrctl stop listener
oracle@srvora01$ lsnrctl start listener
oracle@srvora01$ sqlplus system/*****@DB11G

 

SQL> declare
  p_file utl_file.file_type;
begin
  p_file := utl_file.fopen ('MYDIR', 'myfile.txt', 'r');
end;
/  2    3    4    5    6
 
PL/SQL procedure successfully completed.

 

To summarize, do not forget that the listener can also be impacted when changing permissions while it is running. It would be better to restart it instead of a simple reload when you make these kinds of modifications.

SQL Server 2014: Deleting files of a hekaton table

Thu, 2014-03-06 00:49

A recurrent question I have often heard about Hekaton objects is the following: Is an accidental deletion of the compilation files of a hekaton table on the file system irreversible and could this compromise the execution of SQL Server?

To check the SQL Server behaviour in such situation, we can perform the following test with an in-memory optimized table:

 

CREATE TABLE [dbo].[StorageTestTable] (        [c1] [int] NOT NULL,        [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL,   CONSTRAINT [index91113] PRIMARY KEY NONCLUSTERED HASH (        [c1] )WITH ( BUCKET_COUNT = 1048576) )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA ) GO

During the creation of the hekaton table, several files are also created on the file system. We can easily identify that the files belong to the hekaton table by using the naming convention:

xtp_[database_id]_[object_id]

 

billet_3_procmon_create_files_with_compil_5

 

The database id is 8 and the object id is 277576027. We can retrieve the object name by using T-SQL:

 

SELECT
    name AS table_name,
    type_desc,
    is_memory_optimized,
    durability_desc
FROM sys.tables
WHERE [object_id] = 277576027;
GO

 

 billet_3_select_object_name

 

Now, let’s try to delete all the files by using the command line del:

 

billet_3_cmd_delete_file

 

We can see that all files except the dll file have been deleted. If we take a look at this file by using the process explorer tool from sysinternals to see if the dll is handled by a specific application, we notice that SQL Server is handling the dll file (as expected).


billet_3_procexplorer

 

 

At this point, if we try to select data from the Hekaton table dbo.StorageTestTable, we notice it still works … that’s great !


SELECT * FROM dbo.StorageTestTable

 

billet_3_select_data_hekaton_table


 

Let's try something else and delete the dll file after shutting downing the SQL Server instance.

 

SHUTDOWN WITH NOWAIT;


This time, there is no "access denied" error message.

 

billet_3_delete_dll_file

 

Now it’s time to restart the SQL Server instance and then take a look at the folder. What a surprise! All files have been recreated …

 

billet_3_procmon_create_files_with_compil_5


 

... and if we try to select data from the corresponding hekaton table a second time, we can see it still works, but how? In fact, SQL Server only needs the dll file associated with the hekaton table to use it. Each time the SQL Server instance or a database is restarted, the hekaton table dll is recompiled and injected to the SQL Server process. In order to do this, SQL Server uses both a C compiler and a linker located to the path MSSQL12.MSSQLBinnXtpVCBin. On my lab machine, it i:

C:Program FilesMicrosoft SQL ServerMSSQL12.MSSQLSERVERMSSQLBinnXtpVCbin


billet_3_cl_link_exe


 

As a reminder, creating a dll in C is a multistage process divided into two steps: compilation and linking. We can check how SQL Server compiles and generates the different files of the hekaton table during the start / restart phase by using the procmon tool from sysinternals. We just need to filter the process name for sqlservr.exe, cl.exe link.exe.

First when a Hekaton database is shutting down the hekaton tables, files are deleted as we can see on the following picture:

 

billet_3_procmon_delete_files

 

The procmon tool traces give us the following result:

 

billet_3_procmon_close_files


The createFile() method is called with the following desired access:

 

billet_3_procmon_create_file_delete_flag


...

The delete flag tells us that the file will immediately be deleted after all of its handles get closed (issued later by each CloseFile() function) . This means that all the hekaton table compilation files are opened and closed with this specific flag and will be deleted during the database shutdown.

Next, during the restart of the SQL Server instance, a procmon captures give us an interesting picture. For convenience, I’ve broken down the result on three pictures with explanations:

 

billet_3_procmon_create_files_with_compil

 

...

 

 billet_3_procmon_create_files_with_compil_1

 

...

 

billet_3_procmon_create_files_with_compil_2

 

SQL Server recreates the folder hierarchy and the files associated with the hekaton table (disposition = Create) but that’s not all. Later in the procmon trace, we can see that the C compiler is called by the sqlserver.exe process as showed in the below picture:

 

 billet_3_procmon_create_files_with_compil_3

 

In my test screeshot above the process ID (PID) 1144 is my SQL Server instance. Finally, we later see that the compiler calls the linker. The process id (PID) 1560 is the C compiler (cl.exe)

 

billet_3_procmon_create_files_with_compil_4

 

Once my SQL Server instance is up and running all the files are finally recreated. In short, we showed an existing process about C compilation. My purpose is not to give a C compilation course but just enough for you to get to know how a dll is compiled. Here is a very simplified process schema for the hekaton dll compilation:

Source code file (.c) --> compiler (cl.exe) --> object file (.obj) --> linker (link.exe) -->

dynamic library (.dll)


billet_3_procmon_create_files_with_compil_5

 

How about other files? The pdb file is created by the linker and is related to the target executable or the DLL. This file contains the complete debug information and among them the symbols that we can use with windbg (one of my favorite tool to try to understand how SQL Server works). Then, the file with the extension .out is a verbose file that contains log messages for troubleshooting and finally the xml file contains MAT representation (Mixed Abstract Tree) transformed by SQL Server into PIT (Pure Imperative Tree) representation in order to transform SQL Like- data types into C-like data types. We retrieve the transformation in the C file.

What can we conclude after seeing the SQL Server process using only the dll file during its execution? I would say it’s good news because an accidental deletion of these files will not compromise the good execution of SQL Server and hekaton objects. However, as usual, let's keep the good habits alive and don’t touch them! It's still bad! Smile

Oracle 12c In-Memory option: waiting for 12.1.0.2

Tue, 2014-03-04 03:31

Oracle has announced a big feature that should come in the next 12c patch set: the In-Memory option. It will soon be in beta version, so nothing will be published about it until it comes to production. Before that phase, I'm going to explain what we can expect from that feature, besides the 'In-Memory','columnar storage', and 'hybrid' buzzwords.

First, it is a redundant storage that is aimed to improve the query performance.
But this is not new: indexes and materialized views have the same goal.

Second: Redundant data will be stored In-Memory and will be optimized for it.
That is not new either if we consider TimesTen. TimesTen is an In-Memory relational database that can also be used in front of Oracle Database when used as an In-Memory Database Cache.

What is new is that it is a columnar storage. Columnar storage is already used by Oracle for compression (Hybrid Columnar Compression) but here, the goal is different. Very different. HCC uses CPU to save I/O and disk, and is mainly for read-only data. In-Memory option is for actively updated data residing in memory and the goal is to use less CPU. Let's explain this.

I'll use a well known relational table: SCOTT.EMP where I highligh the columns and rows that I'll need for my query:

 

EMPNOENAMEJOBMGRHIREDATESALCOMMDEPTNO 7369 SMITH CLERK 7902 17-DEC-80 800   20 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30 7566 JONES MANAGER 7839 02-APR-81 2975   20 7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30 7698 BLAKE MANAGER 7839 01-MAY-81 2850   30 7782 CLARK MANAGER 7839 09-JUN-81 2450   10 7788 SCOTT ANALYST 7566 19-APR-87 3000   20 7839 KING PRESIDENT   17-NOV-81 5000   10 7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30 7876 ADAMS CLERK 7788 23-MAY-87 1100   20 7900 JAMES CLERK 7698 03-DEC-81 950   30 7902 FORD ANALYST 7566 03-DEC-81 3000   20 7934 MILLER CLERK 7782 23-JAN-82 1300   10

 

This is a table. It has two dimensions: rows and columns. How do you read it ? There are two ways:

  • Read it by column: I have employee numbers 7369, 7499, 7521, ... Their names are respectively SMITH, ALLEN, WARD, ... The first one is a CLERK, the second and third ones are SALESMEN, ...
  • Read it by row: I have an employee that I identify as number 7369, his name is SMITH, he is a CLERK, his manager is... I have another employee which is number 7499, his name is ALLEN...

Which way do you choose ? You probably read it by row. Because it is easier to think about it when we group the entities we are talking about. When you have a new employee, you fill a form with his name, job, hiredate, etc. When you update some information, you lock the row, saying 'I'm working on employee 7369', and update the attributes you want to change. You do that because you prefer to cluster together the information that are related, it's easier to manage for our mind. Note that when you program, you often have the same logic. You manage objects, you implement CRUD operation to change their attributes. And your computer knows that, so it tries to be more efficient on clustered data, with prefetching or cache.

Relational Database did the same choice when physically manipulating data.
The SQL language let you have both approaches. You update one row, such as:
   UPDATE EMP SET ... WHERE EMPNO=7369;
or you can update one column, such as
   UPDATE EMP SET SAL=SAL*1.1;


But physically, data is stored by row and is manipulated by row, as when you read the table row by row. And it is a very efficient approach when you manipulate several attributes from one row: you do only one I/O to get the block from disk, you store only one block in cache, and the CPU is able to process everything in one block. Great.

But now let's see what happens when you don't need all the columns. Let's say you want to find who has a salary higher than 3000.

SELECT ENAME from EMP where SAL>=3000;

You have to read the whole table (full table scan) and read each block, and in each block read each row, and for each row find the salary, compare it to 3000 and if it is higher, find the name.

Note that I'm not talking about I/O. Even when my table fit in memory I have a lot of information to read, I have to transfert each value from memory to CPU, and use CPU cycles to process each value.

The first problem is that you have read information about 14 employees where you're interrested only in 3 of them. So you build an index on SAL. From the index you get the rowid for the employees that have SAL>=3000. And then for each of them you go to the row to get its name. Ok, that's not bad, but when you have a lot of queries like that, you will have to create a lot of indexes. You can create an index for each column but then there is the cost to combine to result. We are on OLTP actively updated data, so bitmap index is not a solution here. Or you can create an index for each kind of filtering you can do, but that's a lot of indexes to maintain.

But then there is the second problem. Even if you have to process only the 3 employees you are interrested in, you will have to read the whole row, which is stored in variable length, from begining to end, in order to find the only columns that you need. This is: read row directory, follow pointer to row, read EMPNO length, bypass it, read ENAME, get it, ... This is in memory, and has to be transferred to CPU to be processed, word by word.

To address that problem, you can add to your index all the columns you're interrested in. But then you will have too many indexes to maintain. And you know how hard it is to determine which indexes can be useful or not.

Now let's see how that works with the columnar approach we have seen above. Without any indexes, we will read the SAL column and the result is that item 8,9 and 13 are above 3000. Then we read the ENAME column and just have to get the 8,9 and 13 items. And that's all!

Without any additional structure we had to read only what we are interrested in. And there is more. On current processors when we have the same operation to do on a several values we can do in in a whole with a vector operation. Just pass a vector of salaries to the CPU and ask it to compute the '>=3000' on all of them. This is known as 'Single Instruction Multiple Data'. This is less roundtrips between memory and CPU.

The first CPU operation is a 'find >=3000' on the following vector:

800 1600 1250 2975 1250 2850 2450 3000 5000 1500 1100 950 3000 1300

The second CPU operation is a 'find values from previous result index' on the following vector:

SMITH ALLEN WARD JONES MARTIN BLAKE CLARK SCOTT KING TURNER ADAMS JAMES FORD MILLER

Compare that with the row approach, navigating within the block, each pointer or each column value being a CPU operation... It's hundreds of CPU cycles that will be saved. This is something that was possible before only on supercomputers. Today our processors can do Single Instruction Multiple Data vector processing, and this is why in-memory columnar storage is coming everywhere.

And the volume to transfer can be decreased when we compress those vectors. Remember when I've read the ENAME in column: the second and third employees are SALESMAN. This is RLE compression where you don't repeat previous values. You can do that with columns, probably not within a row (or your data model is not that relational I guess).

This is the goal of columnar storage: increase efficiency of memory-cpu processing for queries that don't need to read the whole row, and without having to create indexes. Oracle has always been able to handle mixed workload quite well (OLTP and reporting in the same database), thanks to the read consistency implementation. So it is a natural evolution to go to the 'hybrid' approach that adds analytic capabilities to pure OLTP data models. Of course, this is not the kind of thing that is easy to implement. The columnar storage must be maintained, and must be optimized for an efficient CPU processing. The values must be easy to compare byte per byte in a single CPU operation, even when compressed. And CPU vector operation often requires that the values have the same size, so this may require fixed size storage instead of the usual variable length storage that reduces space.

So behind the feature that is supposed to accelerate our application just by switching it on, we can expect some complex technology that is able to bring In-Memory capabilities while staying in our Oracle OLTP database. In relational databases, the first goal for years was to optimize the access to the storage. And this is done by regrouping all information related to the same object, in order to get them in one I/O. Now, the goal is to improve the transfer between memory and CPU by regrouping similar information from several objects, in order to process them in vectors by the CPU. This is an orthogonal approach: row storage vs. columnar storage. Oracle will implement both. And we, the DBAs, have to understand how it works in order to use it in the right place.

OOW 2013: Solving customer issues with the 12c Optimizer (III) - Histogram on varchar2 column

Sun, 2014-03-02 23:30

It has been a long time since the last Oracle 12c Optimizer article, but this one is the last on the series about customer issues in Oracle 11g. The goal of that series is to show how the new version 12c of Oracle Database helps solving these issues. This blog posting focuses on a well know Oracle behavior when gathering histograms on a varchar2 column.

The three cases have been presented during the session "Solving critical customer issues with the Oracle 12c Optimizer" at the Oracle Open World 2013, the UKOUG Tech13 and the dbi's 12c appetizers.

SQL Server 2014 : SELECT ALL USERS SECURABLES & DB admins

Wed, 2014-02-26 04:35

Microsoft will introduce four new security permissions in SQL Server 2014. One of them called SELECT ALL USERS SECURABLES is the subject of this post.  As explained by Microsoft SQL Server 2014 will allow a database administrator to manage data without seeing sensitive data or personally identifiable information. We can achieve a greater compliance but we must take care what is said because we could be wrong about the terms “manage without seeing sensitive data”. Let me explain.

If we have a naïve approach we could misunderstand this terms and think “oh great … with SQL Server 2014 we will prevent a database administrator to see all data (sensitive or not) but in fact I think (personal point of view) on reflection it is not exactly the goal of this server permission.

Basically “SELECT ALL USERS SECURABLES” permission is designed, when granted, to allow a user to view data in all databases he can connect. For auditing purposes, the new permission is very interesting and it can be also used to prevent someone to read. Furthermore, SQL Server 2005 introduced the concept of “securable” that can be the server.  One of the new server permission provided is CONTROL SERVER often associated with sysadmin fixed role because it will grant full server level permissions and automatically grant full access to all databases but unlike sysadmin fixed role it allow a more granular approach to granting and denying access to individual securable (login in our case) without bypassing the permission check algorithm. Thus, applying security principles to DBA team staff begins by avoiding to use sysadmin role and to prefer CONTROL SERVER permission. At this point I know, we could claim this is not that easy as CONTROL SERVER has multiple caveats but let me demonstrate by a practical example where we can use the new server permission “SELECT ALL USERS SECURABLES” to restrict database administrators to view data in all databases.

First we can create a server role introduced with SQL Server 2012 for the database administrator team:

 

CREATE SERVER ROLE dba; GO

 

Then we create a database administrator login dba1:

 

CREATE LOGIN dba1 WITH PASSWORD = 'dba1', CHECK_POLICY = OFF; GO

 

We add the login as member of the dba1 fixed role :

 

ALTER SERVER ROLE dba ADD MEMBER dba1; GO

 

Now it’s time to play with both the server permission CONTROL SERVER and SELECT ALL USER SECURABLES to prevent a database administrator to read data in all databases.

 

GRANT CONTROL SERVER TO dba; GO   DENY SELECT ALL USER SECURABLES TO dba; GO

 

We can now ensure it works by connecting with the login dba1.

 

SELECT SUSER_NAME() AS login_name, USER_NAME() as [user_name]; GO

 

In my configuration I have a database named SensitiveDB with a table named SensitiveTable.

 

billet2_newsecuritysql14_1

 

USE SensitiveDB; GO   SELECT * FROM INFORMATION_SCHEMA.TABLES; GO

 

billet2_newsecuritysql14_2

 

SELECT * FROM INFORMATION_SCHEMA.COLUMNS; GO

 

billet2_newsecuritysql14_3

 

SELECT        SPECIFIC_CATALOG,        SPECIFIC_SCHEMA,        SPECIFIC_NAME,        ROUTINE_TYPE FROM INFORMATION_SCHEMA.ROUTINES;

 

billet2_newsecuritysql14_4

 

And so on … At this point we can view all objects in the database and their definitions. Now if we try to read data in the table SensitiveTable or from the view V_SensitiveTable :

 

SELECT * FROM dbo.SensitiveTable;   SELECT * FROM dbo.V_SensitiveTable;

 

billet2_newsecuritysql14_5

 

As expected we don’t have the permissions to read data in the database or from anywhere. Ok great, we’ve done a good job. Database administrators can manage the SQL Server instance but they cannot read data! But is it really over? The answer is of course no. Why? At this point we have to remember about the server permission CONTROL SERVER. It will grant full permissions on the server and automatically on all the databases. It means a person which have this server permission can grant himself permissions to a securable even if it exists an explicit DENY for this securable. It is easy to retrieve which permissions are denied : 

 

-- Check if dba1 is member of a server role SELECT        sp.name,        sp.principal_id as login_principal_id,        sp.type_desc,        sp2.name AS role_name,        sp2.principal_id as role_principal_id FROM sys.server_principals AS sp        LEFT JOIN sys.server_role_members AS sr              ON sp.principal_id = sr.member_principal_id        LEFT JOIN sys.server_principals AS sp2              ON sp2.principal_id = sr.role_principal_id WHERE sp.name = 'dba1';

 

billet2_newsecuritysql14_6

 

 -- Check permissions for all concerned principal_id SELECT        grantee_principal_id,        class_desc,        [permission_name],        state_desc FROM sys.server_permissions WHERE grantee_principal_id IN(267, 268);

 

billet2_newsecuritysql14_7

 

The login dba1 can now grant itself the “SELECT ALL USER SECURABLES” permission for the securable dba …

 

USE [master]; GO   GRANT SELECT ALL USER SECURABLES TO dba; GO

 

… and read data in the SensitiveTable

 

billet2_newsecuritysql14_8

 

What a surprise !!! Is it a bug, a mistake or a Microsoft marketing lie ? I admit I had this reaction when I saw it and I'm pretty sure others also ! After a long discussion with some others SQL Server MVPS and friends, I don’t think because event if a database administrator does not need the read data to perform its tasks there can be some cases where it may be necessary, for example troubleshooting a long request with a specific set of data. If you need a more restrictive role, you have to look at granting only those subsets of CONTROL SERVER.  I remember a discussion with one of my customer about the implementation of the separation of duties for database administrators particularly one question: Is the goal to prevent DBA staff to perform theirs tasks? Of course not but we want to audit action when a database administrator needs to elevate its privileges. Fortunately SQL Server provides an audit feature which can help to detect a malicious use of sysadmin permissions.

We could setup an audit trail that can track the event GRANT SELECT ALL USER SECURABLES permission and records events into the Windows security log or into the audit files which the database administrator has no access. Furthermore since SQL Server 2012 we can fail an action if it cannot be recorded to the audit target (ON FAILURE = FAIL_OPERATION)

 

USE [master] GO   -- Creation of audit target CREATE SERVER AUDIT [HIGH_PERMISSIONS_GRANTED] TO FILE (      FILEPATH = N'E:AUDIT'        ,MAXSIZE = 1024 MB        ,MAX_FILES = 10        ,RESERVE_DISK_SPACE = ON ) WITH (      QUEUE_DELAY = 1000        ,ON_FAILURE = FAIL_OPERATION ) WHERE ([statement] LIKE 'GRANT SELECT ALL USER SECURABLES%'); GO   ALTER SERVER AUDIT [HIGH_PERMISSIONS_GRANTED] WITH (STATE = ON) GO   -- Creation of audit specification target CREATE SERVER AUDIT SPECIFICATION [HIGH_PERMISSIONS_GRANTED_specification] FOR SERVER AUDIT [HIGH_PERMISSIONS_GRANTED] ADD (        SERVER_PERMISSION_CHANGE_GROUP        -- To be completed by others usefull actions to track ) WITH (STATE = ON); GO

 

Now if dba1 need to grant the server permission “SELECT ALL USER SECURABLES” automatically the event is recorded to the corresponded trace file:

 

billet2_newsecuritysql14_9

 

The audit trace in this case ca be used to detect quickly the malicious use of the some grant actions and limit the damages. Of course behind the scene a robust process must be in place before implementing the tool.

To summarize, SQL Server 2014 facilitates the segregation of duties by implementing new server permissions but segregation of duties does not mean prevent in all cases some actions. To be efficient the segregation of duties consists of the implementation of automatic and preventives control. Feel free to share your opinion on this subject !

Oracle Partitioned Sequences - a future new feature in 12c?

Thu, 2014-02-20 19:46

Disclaimer: I'll talk about an undocumented feature that appeared in Oracle 12c - undocumented except if you consider that being exposed in DBA_SEQUENCES as the PARTITION_COUNT is a kind of documentation. So, as the syntax is not documented (I got it only by guessing), you should not use it in production: there are probably some reasons why Oracle did not expose an interresting feature like that.

In these days, you probably have most of primary keys coming from a sequence: a generated number that is always increasing in order to be sure to have no duplicates. And of course you have a (unique) index on it. But what do you see when you have a high activity inserting rows concurrently to that table ?

Here is what you can see:

partitioned-sequence-capture-1_20140220-150723_1.PNG

That came from a retail database where we did a load test simulating the order entry rate that was expected for production.
The AWR report shows the segments concerned:

partitioned-sequence-capture-2.PNG

No doubt: index hot block contention on primary key.

It is a well know issue. Because the index is sorted, the value coming from the sequence is always going to the last block. All sessions have to access to the same block and there is a 'buffer busy wait' contention on it. And when you are in RAC that block has to be updated by the different nodes and it's worse, showing all those 'gc' wait events.

In order to illustrate the different solution, I have created a DEMO table with 1 million rows. And then I insert 100000 additional rows and check how many blocks where touched:

create table DEMO_TABLE (id,n , constraint DEMO_PK primary key(id) ) 
 as select rownum,rownum from 
  (select * from dual connect by level<=1000),
  (select * from dual connect by level <=1000);

create sequence DEMO_SEQUENCE start with 1e7;

insert into DEMO_TABLE select DEMO_SEQUENCE.nextval,rownum 
 from (select * from dual connect by level<=1000),
 (select * from dual connect by level <=100);


And here is the number of index blocks touched:

select object_name,data_object_id,sum(tch) from sys.x$bh 
 join user_objects on user_objects.data_object_id=x$bh.obj 
 where object_name like 'DEMO%' 
 group by object_name,data_object_id order by 1 desc;
OBJECT_NAME DATA_OBJECT_ID SUM(TCH) DEMO_TABLE 97346 243 DEMO_PK 97347 243

And I'm interrested about index block split as well:

select name,value from v$mystat 
 join v$statname using(statistic#) 
 where name like 'leaf node%splits';
NAME VALUE leaf node splits 200 leaf node 90-10 splits 200


I touched 200 blocks only (the size of the index entry is approximatively the same as the table row size). The highest block is filled with an increasing value and only when it is full the insert allocates a new block and goes to it. This is optimal for one session, but it is the cause of contention on multi-threaded inserts because all sessions are touching the same block at the same time.

So what's the solution ? Of course, you want to have your primary key value distributed.
Reverse index ? Let's see:

alter index DEMO_PK rebuild reverse;
Index altered.

Then I run the same inserts and here are the statistics about index blocks:

OBJECT_NAME DATA_OBJECT_ID SUM(TCH) DEMO_TABLE 97349 247 DEMO_PK 97352 4392
NAME VALUE leaf node splits 182 leaf node 90-10 splits 0


Now I touched 4000 blocks (without any block split because each new value fit in the 5% pctfree after my index rebuild). Great: no contention. I mean... no buffer contention.

But think about it. Because of my reverse distribution, I touch now all the index blocks. Do they fit in my buffer cache anymore ? Probably not. And one day, when I have more data, I encounter i/o issues.

If you want to see an illustration of the contentions I am talking about here, you can check the slides 14 to 19 from the Oracle Real Performance Tour where Tom Kyte, Andrew Holdsworth & Graham Wood have shown a nice demo of that.
So we want to spread the values on several index blocks, but not on all index blocks. Hash partitioning can be used for that. Let's have 4 partitions:

alter table DEMO_TABLE disable constraint DEMO_PK;
create index DEMO_PK on DEMO_TABLE(id) global partition by hash(id) partitions 4;
alter table DEMO_TABLE enable constraint DEMO_PK;

and the result is quite good:

OBJECT_NAME DATA_OBJECT_ID SUM(TCH) DEMO_TABLE 97353 245 DEMO_PK 97357 76 DEMO_PK 97358 76 DEMO_PK 97359 76 DEMO_PK 97360 76
NAME VALUE leaf node splits 213 leaf node 90-10 splits 213

I've distributed my inserts over 4 partitions, having 4 hot blocks instead of one. This is a way to prevent buffer busy waits when having a few concurrent sessions inserting new values.

But the distribution is done on the value coming from the sequence. So each session will touch sequentially each of the 4 hot blocks. Even if this reduces the probablility of contention, it is not optimal. And if you're going in RAC you will see those 'gc' wait events again, with the hot blocks being accessed by all nodes.

The actual solution to the problem is not to distribute the insert based on the value, but having the distribution key based on the session identification. If each session has its own index block to insert into, then all contention is gone.

This is exactly what will be addressed by the 'partitioned sequence' feature that is not (yet) documented.

It generates sequence values in different ranges of value. And that range depends on the session (probably a hash function on the instance number and the session id).
I come back to my non-partitioned no-reverse index, and alter the sequence as:
alter sequence DEMO_SEQUENCE partition 4;
And here is the stats about index blocks: OBJECT_NAME DATA_OBJECT_ID SUM(TCH) DEMO_TABLE 97361 397 DEMO_PK 97364 404
NAME VALUE leaf node splits 351 leaf node 90-10 splits 351


First, you see that my table is larger. This is because the sequence number from the partitioned sequence is larger. It is build by prefixing the sequence number with a partition value, and that makes the binary representation larger. The table here have only two columns, but on a real table, the difference will not be so important. The index is bigger, that's a fact. However if you compare it with the reverse index (that has a lot of free space in the blocks) it is much better here. And you can reduce the sequence max value if you want a smaller id. But the very good thing is that the instances and sessions will work on different ranges, avoiding block contention, while keeping the index maintenance optimal in buffer cache.

Here is my sequence:

select sequence_name,min_value,max_value,last_number,partition_count from user_sequences;
SEQUENCE_NAME MIN_VALUE MAX_VALUE LAST_NUMBER PARTITION_COUNT DEMO_SEQUENCE 1 9999999999999999999999999999 10100000 4

Ok, that feature is really nice, but you can't use it until it is documented in a future release. So you have to do it yourself: concatenate a value hased from the instance/session in front of the number coming from the sequence. Of course, at the Real Performance Tour the undocumented solution was not raised, and the solution presented was prefixing the sequence. Here is just as an example:

insert into DEMO_TABLE 
 select 1e30*mod(to_number(sys_context('userenv','sid')),4)+DEMO_SEQUENCE.nextval,...
and here are the index statistics: OBJECT_NAME DATA_OBJECT_ID SUM(TCH) DEMO_TABLE 97365 351 DEMO_PK 97366 351
NAME VALUE leaf node splits 433 leaf node 90-10 splits 371

So while dreaming about a feature that you may be able to use in the future, you can acheive the same goal if you're ready to change your code. Anyway, achieving scalability, and good performance on high load often requires to touch the code a bit.

SQL Server 2014: Hekaton memory optimized tables, hash indexes, and bucket counts

Thu, 2014-02-13 23:16

For my first blog post at dbi services, I think it could be a good opportunity to begin by discussing around SQL Server 2014, Hekaton memory optimized tables, and hash indexes. When you create a memory optimized table you have to consider the number of buckets that you have to reserve for its associated hash index. It’s an important aspect of configuration because after creating your table, you cannot change the number of buckets without recreating it.

For the SQL Server database administrators world, it is not usual because with other ordinary indexes (clustered, nonclustered indexes) or other special indexes (xml or spatial indexes) this parameter does not exist. Furthermore, Microsoft recommends to reserve a number of buckets equal at least two times the number of distinct values in the table! The first time I read the documentation I asked myself a few questions:

  • Why reserve a fixed number of buckets?
  • Why round the number of buckets to the power of 2?

In order to find the answers, I first had to understand how a hash index works.

Let's start from a basic definition of a hash table: it is a data structure used to implement an unsorted associative data array that can map a key to a value. To compute this key a hash function is used. The hash result is then stored into an array of buckets as shown to figure 1 (simplified schema).

 

bucket_hash_index

 

Notice that the hashing keys are not sorted. For lookup operations it does not matter, but for range value scans the hash index does not perform well. In the hash table world, the ideal situation is one key for one unique bucket to allow constant time for lookups, but in reality it is not the case because of hash collisions (two distinct pieces of data can have the same hash value). However, the hash function should be efficient enough to spread keys into buckets uniformly to avoid this phenomenon as much as possible.

There are several strategies to resolve hash collisions and the lack of buckets. SQL Server uses a method that consists of a separated chaining. When a hash collision occurs or when the number of buckets is not sufficient, the row data is added to the row chain for the selecting bucket. In consequence, the cost of the table operation is that of scanning the number of entries in the row chain of the selecting bucket for the concerned index key. If the distribution is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket also called the load factor. The load factor is the number of total entries divided by the number of buckets. The larger the load factor, the more the hash table will be slow. It means for a fixed number of buckets the time for lookups will grow with the number of entries.

 

bucket_hash_collision

 

Now let’s start with a practical example of a memory-optimized table :

 

CREATE TABLE [dbo].[HekatonTableDurable] (        [c1] [int] NOT NULL,        [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL    CONSTRAINT [pk_HekatonTableDurable] PRIMARY KEY NONCLUSTERED HASH        (              [c1]        ) WITH ( BUCKET_COUNT = 1000) ) WITH (        MEMORY_OPTIMIZED = ON ,        DURABILITY = SCHEMA_AND_DATA ) GO

 

We create a simple memory optimized table with two columns and 1000 buckets to store the hash index entries. After creating this table, we can see the number of buckets allocated and we notice the final number is not the same, but rounded to the nearest power of two.

 

SELECT        OBJECT_NAME(object_id) AS table_name,        name AS index_name,        index_id,        type_desc,        [bucket_count] FROM sys.hash_indexes GO

 

number_of_buckets_after_creating_hk_table

 

This fact is very interesting because using a power of two to determine the position of the bucket pointer in the array is very efficient. Generally, with hash tables the hash function process is done in two steps to determine the index in the buckets array:

  • hash value = hash function (key)
  • position in the array = hash value % array size

The modulo operator in the case could be very expensive and can fortunately be replaced by a bitwise AND operator (because the array size is a power of two) which reduces the operation to masking and improves speed. To verify, we can do a simple test with a console application in C#:

 

using System; using System.Diagnostics;   namespace ConsoleApplication1 {    class Program    {        static void Main(string[] args)        {            int i = 0;            int result = 0;            int maxloop = 1000000000;              Stopwatch timer = new Stopwatch();              // Modulo operator            timer.Start();              while (i < maxloop)            {                result = modulo_operation(i, 4);                               i++;            }              timer.Stop();              Console.WriteLine("With modulo : (ms) ");            Console.WriteLine(timer.ElapsedMilliseconds);              i = 0;              // Bitwise AND            timer.Restart();              while (i < maxloop)            {                result = bitwiseand(i, 4);                  i++;            }              timer.Stop();              Console.WriteLine("With bitwise and : (ms)");            Console.WriteLine(timer.ElapsedMilliseconds);              Console.ReadLine();        }          // modulo        public static int modulo_operation (int x, int y)        {            return x % y;        }          // bitwise        public static int bitwiseand (int x, int y)        {            return (x & (y -1));        }    } }

 

By executing this console application on my dev environment (Intel Core i7-3610QM 2.30 GHz), we can notice that the bitwise and function performs faster than the modulo function for an iteration of 1 billion:

 

csharp_test

 

Then we can play with different combination of data rows and hash index buckets to understand why the number of bucket is very important.


Test 1: Filling up the table with 500 distinct rows and 1024 buckets


declare @i int = 0;   while @i < 500 begin        insert HekatonTableDurable values (@i, 'test_' + cast(@i as varchar(50)));        set @i += 1; end

 

By using the DMV sys.dm_db_xtp_hash_index_stats we can see useful information about buckets usage.

 

hash_index_stats

 

We retrieve the total number of buckets allocated for the index hash equal to 1024. Furthermore, we notice that we have 545 empty buckets and in consequence 1024 – 545 = 479 buckets in use. Normally, with a perfect distribution of key values in the bucket array, we should have 500 buckets in use, but we notice that we certainly already have some hash collisions. We can verify this in the value of the column max_chain_length which tells us there is at least one bucket with a row chain length of 2. In other words, we have at least one bucket that contains two rows with the same index hash. Notice that the load factor is equal to 500 / 1024 = 0.49


Test 2: Filling up the memory-optimized table with a load factor of 3. We will insert 3072 rows for a total of bucket count equal to 1024


 

number_of_buckets_after_with_load_factor_3_hk_table


 

As expected, all buckets are used (empty_bucket_count = 0). We have an average row chain length of 3 (avg_chain_length = 3) in the bucket array equal to the load factor. Some of the rows chain lengths are equal to five, certainly due to a hash collision (max_chain_length = 5).


Test 3: Filling up the memory-optimized table with duplicate rows

 

Before filling up the table, some changes in the schema are required. Let's create a second non-unique hash index (ix_OrderDate):

 

IF OBJECT_ID('dbo.HekatonTableDurable') IS NOT NULL        DROP TABLE [dbo].[HekatonTableDurable]; GO   CREATE TABLE [dbo].[HekatonTableDurable] (        [c1] [int] NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH ( BUCKET_COUNT = 1024) ,        [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL INDEX ix_OrderDate HASH WITH (BUCKET_COUNT=1024) ) WITH (        MEMORY_OPTIMIZED = ON ,        DURABILITY = SCHEMA_AND_DATA ) GO

 

Then we can fill up the table with 500 rows that contain some duplicate rows:

 

declare @i int = 0;   while @i < 500 begin        insert dbo.HekatonTableDurable values (@i, case @i % 4 WHEN 0 THEN 'test_' + cast(@i as varchar(50)) ELSE 'test' END);          set @i += 1; end

 

Here is an overview of the number of duplicates rows in the table:

 

select        c2,        count(*) from dbo.HekatonTableDurable group by c2 order by count(*) desc

 

number_of_duplicate_rows_after_loading_hk_table

 

We can now check how the buckets are used for the two hash indexes:

 

select        object_name(i.object_id) as table_name,        i.name as index_name,        his.total_bucket_count,        his.empty_bucket_count,        his.avg_chain_length,        his.max_chain_length from sys.dm_db_xtp_hash_index_stats as his        join sys.hash_indexes as i              on his.index_id = i.index_id                     and his.object_id = i.object_id where i.object_id = object_id('dbo.HekatonTableDurable')

 

We retrieve the same number of used buckets for the primary hash index key (545) as first tests. This is because the hash function used by SQL Server is deterministic. The same index key is always mapped to the same bucket in the array. The index key is relatively well distributed. However, we can see a different result for the second hash index. We only have 119 buckets in use to store 500 rows. Furthermore, we notice that the maximum row chain length is very high and corresponds in fact to the number of duplicates values we inserted (375).


Test 4: Filling up the table with an insufficient number of buckets to uniformly store the hash indexes

 

For this last test, we will use 10 buckets to store 8192 distinct rows (a load factor = 819).

 

IF OBJECT_ID('dbo.HekatonTableDurable') IS NOT NULL        DROP TABLE [dbo].[HekatonTableDurable]; GO     CREATE TABLE [dbo].[HekatonTableDurable] (        [c1] [int] NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH ( BUCKET_COUNT = 10) ,        [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL INDEX ix_OrderDate HASH WITH (BUCKET_COUNT=10) ) WITH (        MEMORY_OPTIMIZED = ON ,        DURABILITY = SCHEMA_AND_DATA ) GO

 

number_of_buckets_worse_case_hk_table

 

In this case, only one bucket is used (no empty buckets) and the value of the avg_chain_length and the max_chain_length column are pretty close. This is an indication of a lack of buckets with a big row chain.

In this post, we have seen the basic concepts of the hash table, the importance of a correct configuration of the buckets and how to use the dynamic management view sys.dm_db_xtp_index_hash_stats to get useful statistics for hash indexes.

Delayed Transaction Durability: A new feature in SQL Server 2014 CTP2

Thu, 2014-02-13 06:59

Delayed Transaction Durability is a new SQL Server 2014 feature that was not included in SQl Server 2014 CTP1 but discreetly published in CTP2. It is a very interesting functionality that helps reducing the IO contention for transaction log writes.

In other databases engine, you will find equivalents such as:  

  • Oracle: 'COMMIT WRITE BATCH WAIT|NOWAIT' (link here)
  • MySQL: 'group_commit' since version 4.x (link here)
  • PostgreSQL: 'group_commit' introduce in version 9.2 (link here)

Oracle et Dbvisit Replicate pour migrer sans arrêt de service... et sans stress

Thu, 2014-02-13 03:35

Je viens d'assister au Webinar Zero Downtime Migration pour Oracle, présenté par Chris Lawless qui est récemment passé de Product Manager Golden Gate à Product Manager Dbvisit.  Je vais détailler ici un point très important évoqué par Chris Lawless. La migration par réplication n'est pas seulement envisagée pour éviter un arrêt de service. Arrêt nécéssaire lors de la migration proprement dite (qui peut aller de quelques minutes à plusieurs heures en fonction de la taille et de la complexité de la solution) mais aussi pour les test avant la décision de Go-NoGo de réouverture du service sur la cible.

 

Migrer par réplication

Une migration par réplication, c'est d'abord pour éviter du stress, des coûts et du risque.

Un upgrade classique (dbua par exemple) a l'avantage de garantir l'intégrité des données sans avoir à impliquer les équipes applicatives: pas besoin d'avoir la liste des schemas, des synonymes ou db links, ni de vérifier s'il y a des opérations particulières (nologging) ou des types de données particuliers (XMLTYPE). Pas besoin de se poser des questions sur les triggers, les delete on cascade, etc.

Bien sûr les equipes applicatives seront impliquées pour les tests. Mais lors de la migration le DBA peut s'en occuper sans savoir ce qu'il y a dans sa base. C'est la raison pour laquelle c'est la méthode la plus utilisée. Il n'y a que lorsque la durée de l'arrêt de service peut poser un problème qu'on doit envisager une autre solution. Car un upgrade classique peut prendre une heure (suivant la taille du dictionnaire).

Une migration (lorsque il n'y a pas que la version d'Oracle qui change, mais le stockage, le serveur, la plateforme) peut être plus longue. Des solutions existent pour diminuer cette durée (Transportable Tablespaces, par exemple) mais bien sûr, l'opération sera plus complexe. Et plus complexe veut dire plus de temps, plus de risque, plus de stress.

Alors vient l'idée de migrer par réplication, sans arrêt de service, sans risque, sans stress.

Là, le DBA ne peut pas le faire à l'aveugle. C'est un projet à mettre en place avec les equipes applicatives pour déterminer quoi et quand répliquer. C'est d'ailleurs à mon avis un avantage: la migration est aussi une bonne occasion de faire un peu de ménage. Et pourquoi pas en profiter pour passer en UTF-8 aussi ?

Et il est de toute façon assez facile de tester si notre application supporte la réplication logique. Car bien sûr, le but n'est pas de passer 2 mois à résoudre des problèmes de réplication sur des triggers, contraintes on cascade, vues matérialisées, types de colonnes non supportés, etc.

Alors, pourquoi moins de risque et moins de stress ? Parce qu'on peut mettre en place la réplication sans déranger la production, et la laisser tourner plusieurs jours voir semaines. On ne va décider de basculer l'application sur la cible que lorsque tout est prêt et validé.

La migration se passe en 3 phases:

  • Mise en place de la réplication.
    Ne nécessite pas d'arrêt de service, seulement un bref verrou TM Share sur les tables pour s'assurer qu'il n'y a pas de données non committées au point de départ de la copie des données. On aura bien sûr préparé toute la configuration avant sur un environnement de test.
    Une fois le verrou posé et relâché, les données sont copiées puis la réplication s'applique.
  • Validation de la cible.
    Tout le temps nécessaire pour tester l'application sur l'environnement cible, avec des vraies données de production. On peut comparer les deux environnements pour vérifier la non regression. On peut valider les performances puisque la charge réelle de production arrive sur la cible. On peut même déjà y connecter les applications de production qui travaillent en read-only: reporting, exports,...
    Quand à la source, l'impact sur la production est très faible puisqu'il s'agit seulement de lire le redo généré, et ça peut être fait sur un autre serveur.
    Il n'y a pas de deadline pour ces tests autre que la date souhaitée de migration. C'est la grosse différence avec une migration classique où la durée des tests - avant de réouvrir le service - doit être à la fois rapide et fiable.
  • Basculement de la prod.
    Les applications vont maintenant se connecter à la cible. On va le faire en même temps pour que tout le monde voit toutes les mises à jour. Mais de toute façon la réplication est toujours là: aucune donnée ne serait perdue si on en oubliait une. 
    Ceci se fait sans stress vu que tout l'environnement a été validé avant, pendant plusieurs jours, sur de la production normale. C'est un arrêt de quelques secondes, le temps que l'application se reconnecte. Et là pas de surprise: l'environnement cible a été validé et optimisé bien avant.

 

Licence temporaire et support 12c

Chris Lawless a aussi répondu à deux questions sur les licences et sur le support de la 12c

Au niveau du coût, un produit comme Dbvisit Replicate, qui a déjà un prix de licence tout à fait correct, peut aussi se louer pour une durée limitée si on le souhaite: par exemple pour les 3 mois du projet de migration. Et la version trial de 30 jours peut même permettre de faire un Proof Of Concept sans frais afin de valider la compatibilité de l'application avec la réplication logique. 

Alors pourquoi ne pas passer 1 jour à tester la réplication de notre application ? Dbvisit Replicate peut s'installer en une heure et se configurer rapidement pour répliquer un schema en temps réel. Ce qui permettra d'évaluer la charge de configuration en fonction du nombre d'exceptions à gérer.

Oracle 12c étant une cible supportée, c'est une solution envisageable en venant de n'importe quelle version supérieure à 9.2. Une cible Pluggable Database n'est pas encore supportée, mais de toute façon, Dbvisit s'adresse surtout à des coûts réduits, en Standard Edition. 

 

Quand migrer en 12c ?

Alors, pour apaiser la peur de passer en 12c (peur historique des premières releases d'une nouvelle version), pourquoi ne pas répliquer sa prod 10g ou 11g pendant un mois (version trial) et tester tranquillement son appli sur la cible 12c avec des données de prod ?