Skip navigation.

Yann Neuhaus

Syndicate content
All blog entries from http://www.dbi-services.com/
Updated: 13 hours 1 min ago

Tempdb enhancements with SQL Server 2014

Sun, 2014-05-25 11:40

SQL Server 2014 is definitively designed for performance and I will try to demonstrate it during this blog post. I like to talk about hidden performance features because generally it does not require any changes for applications unlike in-memory tables for instance (aka hekaton tables).

Since SQL Server 2005 version some improvements have been made for tempdb. Tempdb caching is one of them and allows to reduce the page allocation contention. Basically to create a table SQL Server must first build the system catalog entries related to system pages. Then, SQL Server has to allocate an IAM page and find a mixed extent in an SGAM page to store data and mark it in the PFS page (as a reminder, by default mixed extent is chosen unless to force uniform extent by using the traceflag 1118). Finally the allocation process must be recorded to system pages. When a table is dropped SQL Server has to revert all it have done for creating the table. It implies some normal locks and latches during all the allocation process, same ones used for creating and dropping a temporary table. However in tempdb tables are created and dropped very quickly and it can generate page allocation contention especially for PFS, SGAM and GAM system pages (the famous PAGELATCH_UP wait type against the concerned pages). The bottom line is that SQL Server can cache some of the metadata and page allocations from temporary objects for easier and faster reuse with less contention.

 

In addition, to allow a temporary table to be cached it must be first used into a stored procedure but some situations can prevent this caching like:

 

  • Using named constraints
  • Using DDL after the temporary table creation
  • Create a table in a different scope
  • Using a stored procedure with recompile option

By executing the following T-SQL script with SQLQueryStress we can easily show that temporary tables are not reused by SQL Server.

use [AdventureWorks2012] go   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;

 

I used 8 concurrent threads with 100 iterations during this test.

 

Blog_9_-sqlstress_test1


 

In the same time, I enabled the following perfmon counters:

 

Counter name

Min value

avg value

Max value

Average latch wait time (ms)

1,043

3,327

7,493

Latch wait / sec

110,014

242,468

965,508

Temp tables creation rate / sec

4,001

16

21,146

Cache objects in Use

0

0

0


Now if I rewrite the same ad-hoc T-SQL statement into a stored procedure and then I perform the same test we can notice some speed improvements:

use [AdventureWorks2012] go   create procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;

 

blog_9_-_sqlstress_test2

 

counter name

min value

avg value

Max value

Average latch wait time (ms)

0

0,855

1,295

Latch wait / sec

0

4405,145

5910,304

Temp tables creation rate / sec

0

0

0

Cache objects in Use

0

7,048

8

 

As expected, this improvement is due to the tempdb caching mechanism. We can notice here that SQL Server reuses caching objects (“Cache objects in Use” counter > 0) that are in fact the temporary table into the stored procedure. Using caching objects decrease drastically the temporary table creation rate (Temp Tables creation rate / sec is equal to 0 here).

The cached objects themselves are visible by using the system table sys.tables in the tempdb context. For example during the first test we can easily observe that SQL Server does not deallocate completely a temporary table used into a stored procedure. The relationship can be made with the object id column value with a negative number. When SQL Server uses a temporary table the name of table is #test and when SQL Server doesn’t use it without deallocating the associated pages the name is composed of a 8-character hexadecimal string that maps in fact to the object id value. #AF42A2AE is the hexadecimal representation of the #test temporary table with the object id equal to -1354587474.

 

blog_9_-_tempdb_caching

 

...

 

blog_9_-_tempdb_caching_2

 

Furthermore we can notice several records in the above results because I used SQLQueryStress with 8 concurrent threads that imply concurrent executions of the stored procedure with separate cached objects in tempdb. We can see 4 records (I didn’t show completely the entire result here) but in fact we retrieved 8 records.

As I said earlier, DDL statements after the creation of the temporary table inhibits the ability to cache the temporary objects by SQL Server and can decrease the global performance of the stored procedure (we can ask here what is a DDL statement .. because DROP TABLE #table is apparently not considered as such because tempdb caching mechanism is not impacted). In my sample, SQL Server proposes to create the following index on the ProductID column to improve the query statement:

 

create nonclustered index idx_test_transaction_product_id on #test (        ProductID )


Go ahead, we trust SQL Server and we will add the creation of the index after the creation of the temporary table into the stored procedure:

use [AdventureWorks2012] go   create procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   --create index for ProductID predicate create nonclustered index idx_test_transaction_product_id on #test (        ProductID )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;

 

However, the result is not as good as we would expect ...

 

blog_9_-_sqlstress_test3

 

If we take a look at the perfmon counters values:

Counter name

min value

avg value

Max value

Average latch wait time (ms)

0,259

0,567

0,821

Latch wait / sec

0

2900

4342

Temp tables creation rate / sec

3,969

5,09

8,063

temp tables for destruction

0

27,02

58

Cache objects in Use

6

7,9

8


 

For this test I added a new perfmon counter: temp tables for destruction that indicates clearly that the temporary tables will be destroyed by SQL Server because they cannot be used in this case: the index creation DDL prevents the tempdb caching mechanism.

Here comes a new SQL Server 2014 feature that introduces a new way for declaring nonclustered indexes directly into the table creation DDL which can be a good workaround to the preceding test.

 

alter procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID int index idx_test_transaction_product_id, --< index created "on the fly"        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 1000000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;

 

After running the test we can notice that the temp tables creation rate and temp tables for destruction counters value are again equal to zero. SQL Server used the temporary table during the testing as showing the "Cache objects in User" counter.

 

Counter name

min value

avg value

Max value

Average latch wait time (ms)

0

0,262

0,568

Latch wait / sec

0

1369

3489

Temp tables creation rate / sec

0

5,09

8,063

temp tables for destruction

0

0

0

Cache objects in Use

6

7,9

8

 

However if we can still use the tempdb caching mechanism with SQL Server 2014 and this new tips the above result is contrasted with the total duration of execution as showed by the following picture:

 

blog_9_-_sqlstress_test5


 

The global execution time is larger than the test first with the stored procedure, the temporary table without any nonclustered index (02:44 vs 00:21) in my case. This is because inserting data into a table with a nonclustered index can take more time than a table without any indexes but in a real production environment we will probably encounter situations where the cost for inserting data into a table with an index would be substantial compared to the gain made for the following readings. If you have some examples please feel free to share with us Laughing


Another interesting feature since many versions is the concept of eager writes that prevent flooding the buffer pool with pages that are newly created, from bulk activities, and need to be written to disk. Eager write is another background process that helps to reduce the pressure of the well-known lazy writes and checkpoint background processes as well as increasing the IO performance by gathering pages before writing to disk. Basically, SQL Server tracks these pages into a circular list in memory. When the list is full old entries are removed by writing them to disk if still dirty.

Let me show you with the following T-SQL script on the SQL Server 2012 instance. I used the traceflag 3917 to show eager writes activity (thanks to Bod Dorr for this tip).

 

use AdventureWorks2012; go   -- create procedure sp_test_tempdb_2 -- bulk activity by using select into #table CREATE procedure sp_test_tempdb_2 as   select        bth.*,        p.Name AS ProductName,        p.Color into #test from AdventureWorks2012.dbo.bigTransactionHistory as bth        join AdventureWorks2012.dbo.bigProduct as p              on bth.ProductID = p.ProductID where p.Color in('White')        and p.Size = 'M' option (maxdop 1);   select        TransactionDate,        ProductID,        ProductName        Quantity        --Quantity * ActualCost AS total_individual_sale from (        select              ROW_NUMBER() OVER (PARTITION BY TransactionDate ORDER BY Quantity DESC) AS num,              *        from #test ) transaction_production_sales_top_ten where num option (maxdop 1);   drop table #test go   -- using of traceflag 3917 to show eager write activity (be carefull the ouput may be verbose) dbcc traceon(3917); dbcc traceon(3605); go   -- cycle errorlog for next easy read exec sp_cycle_errorlog; go   -- execution of the stored procedure dbo.sp_test_tempdb_2; exec dbo.sp_test_tempdb_2; go   -- Reading the error log file exec xp_readerrorlog;


Below a sample of the SQL Server error log:

 

blog_9_-_sql12_eager_writes

 

We can notice that SQL Server writes up contiguous 32 dirty pages to disk in my test.

Even if this process is optimized to write pages efficiently to disk, we have still IO activity. SQL Server 2014 enhances this process by relaxing the need to flush these pages to disk as quickly as the older versions. SQL Server recognizes the bulk activity and the concerned pages are loaded, queried and released without any flushing disk activity.

The same test performed on the SQL Server 2014 environment gives the following result:

 

blog_9_-_test_select_into_eager_write_sql14

 

The eager write process was not triggered this time. So let’s compare with a simulating workload by using ostress this time. Ostress is a stress tool provided by the RML utilities. This time I used ostress with 4 threads and 1000 iterations each. SQLQueryStress generated a bunch of ASYNC_IO_NETWORK during my tests which potentially distorts the final result.

So, I used the following script for the both environment (SQL Server 2012 and SQL Server 2014):

"C:Program FilesMicrosoft CorporationRMLUtilsostress.exe" -Slocalhost -dAdventureWorks2012 -Q"exec dbo.sp_test_tempdb_2" -n4 -r1000 -N –q

 

SQL Server 2012

 

blog_9_-_ostress_sql12

 

… the corresponding io file stats:

 

SELECT        d.name AS database_name,        f.name AS [file_name],        f.physical_name,        f.type_desc,        vf.num_of_reads,        vf.num_of_writes FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vf INNER JOIN sys.databases AS d ON d.database_id = vf.database_id INNER JOIN sys.master_files AS f ON f.file_id = vf.file_id    AND f.database_id = vf.database_id where f.database_id = db_id('tempdb')

 

blog_9_-_ostress_sql12_tempdb_io

 

… and the corresponding wait types:

 

Wait type

Total wait

ms

Total wait count

Avg wait time

ms

PAGEIOLATCH_UP

452737834

3333841

135

PAGEIOLATCH_EX

343071451

4696853

73

PREEMPTIVE_OS_ENCRYPTMESSAGE

929

29527

0

PAGELATCH_SH

603

201

3

 

SQL Server 2014

 

blog_9_-_ostress_sql14

 

...

 

blog_9_-_ostress_sql14_tempdb_io

 

...

 

Wait type

Total wait

ms

Total wait count

 style="margin-bottom: 0.0001pt; line-

Oracle 12c extended datatypes better than CLOB?

Fri, 2014-05-23 08:15

12c has introduced character strings that can can go above 4000 bytes. In the previous versions, in PL/SQL only we were allowed to have VARCHAR2 up to 32k. In SQL the VARCHAR2 datatype was limited to 4000 and CHAR was limited to 2000. That became a bit small especially when lot of applications needed to store unicode characters.

From 12c we can have SQL datatypes having up to 32k bytes for VARCHAR2, CHAR and RAW. It's not allowed by default. We need to set max_string_size=extended and recompile views with utl32k. Nice improvement. But is it a good idea to use that new feature when we already have CLOB for large character strings ? The New Features documentation is clear about that: extended datatypes have been introduced to be compatible with other databases - not to replace existing features.

I will not go into the details how they are stored. Information about that is available elsewhere. See for example @ludodba recent blog post Where are Extended Data Types stored?. Extended datatypes are stored as chained rows if you just extend an existing table, or as a LOB if you defined them on a new table. Chained rows is clearly not a good option, so, given that you (re)create the tables, their storage is similar to CLOB.

But there is something that I don't like with LOBS: they are fetched row by row. When you select a row you get only the handle. And you get the CLOB later when you access to it through the handle. Did you ever try to datapump a table with LOBs through network_link? Huge amount of roundtrips and very bad performance. It's one rare case where doing expdp/impdp with a dumpfile is better. For very large objects, you will do several roundtrips anyway, so this is not an issue. But with character strings that are just a few kilobytes having them as LOB introduces an ineffective overhead.

Let's compare the fetch behaviour with those new extended datatypes. For my demo, I'll use a table with a clob column "C" and an extended varchar2 column "E", and insert same data into both columns.

 

SQL> create table TEST ( C clob , E varchar2(9000) );
Table created.
SQL> insert into TEST select lpad(rownum,9000,'x'),lpad(rownum,9000,'x') from dual connect by level 10 rows created.

 

Here is the autotrace when reading the CLOB from 10 rows:

 

SQL> set autotrace trace stat
SQL> select C from TEST;

 10 rows selected.

 Statistics
 ----------------------------------------------------------
           2  recursive calls
           0  db block gets
          27  consistent gets
          20  physical reads
           0  redo size
       93936  bytes sent via SQL*Net to client
        2722  bytes received via SQL*Net from client
          22  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
          10  rows processed

 

For only 10 rows I've made 22 roundtrips. This is the problem with LOBs. Too many roundtrips. Well there is another problem that I'll not show here, which is the fact that you can fetch the lob a long time after, even when the cursor is closed. It does consistent read so you have to set your undo_retention accordingly.

Now here is the same data from the extended varchar2 column:

 

SQL> select E from TEST;

 10 rows selected.

 Statistics
 ----------------------------------------------------------
           1  recursive calls
           0  db block gets
          56  consistent gets
           0  physical reads
           0  redo size
       90501  bytes sent via SQL*Net to client
         492  bytes received via SQL*Net from client
           2  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
          10  rows processed

 

Here I got the same volume (10 times 9000 characters) but this time I did only 2 roundtrips.

Let's go further and trace with sql_trace. LOB calls are instrumented since 11g so we can see them from the trace file:

 

PARSING IN CURSOR #139894737850360 len=18 dep=0 uid=103 oct=3 lid=103 tim=8952647276 hv=844696927 ad='77e1a518' sqlid='132sh6wt5k3az'
select C from TEST
END OF STMT
PARSE #139894737850360:c=0,e=82,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895260
EXEC #139894737850360:c=0,e=69,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=897416
FETCH #139894737850360:c=0,e=97,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=848200
LOBREAD: c=999,e=10893,p=2,cr=1,cu=0,tim=8952659696
FETCH #139894737850360:c=0,e=43,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=860372
LOBREAD: c=1000,e=1614,p=2,cr=1,cu=0,tim=8952662447
FETCH #139894737850360:c=0,e=47,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=863495
LOBREAD: c=2000,e=657,p=2,cr=1,cu=0,tim=8952664615
FETCH #139894737850360:c=0,e=40,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=865575
LOBREAD: c=0,e=706,p=2,cr=1,cu=0,tim=8952666808
FETCH #139894737850360:c=0,e=37,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=867552
LOBREAD: c=1000,e=949,p=2,cr=1,cu=0,tim=8952669193
FETCH #139894737850360:c=0,e=92,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=869825
LOBREAD: c=0,e=844,p=2,cr=1,cu=0,tim=8952671276
FETCH #139894737850360:c=0,e=68,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=872168
LOBREAD: c=1000,e=756,p=2,cr=1,cu=0,tim=8952673521
FETCH #139894737850360:c=0,e=44,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=874712
LOBREAD: c=999,e=914,p=2,cr=1,cu=0,tim=8952676180
FETCH #139894737850360:c=0,e=64,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=877352
LOBREAD: c=0,e=931,p=2,cr=1,cu=0,tim=8952678875
FETCH #139894737850360:c=0,e=52,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=879774
LOBREAD: c=1000,e=795,p=2,cr=1,cu=0,tim=8952681136
FETCH #139894737850360:c=1000,e=7,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=891850
STAT #139894737850360 id=1 cnt=10 pid=0 pos=1 obj=100085 op='TABLE ACCESS FULL TEST (cr=16 pr=0 pw=0 time=60 us cost=5 size=20980 card=10)'
CLOSE #139894737850360:c=0,e=31,dep=0,type=0,tim=8952684289

 

And the sql_trace with the same data from the extended datatype.

 

PARSING IN CURSOR #139895028091224 len=18 dep=0 uid=103 oct=3 lid=103 tim=8954178349 hv=1829009117 ad='7b48ba08' sqlid='4kq232tqh8xqx'
select E from TEST
END OF STMT
PARSE #139895028091224:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895345
EXEC #139895028091224:c=0,e=36,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895435
FETCH #139895028091224:c=1000,e=896,p=0,cr=8,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=899458
FETCH #139895028091224:c=4000,e=3660,p=0,cr=48,cu=0,mis=0,r=9,dep=0,og=1,plh=1357081020,tim=89535
STAT #139895028091224 id=1 cnt=10 pid=0 pos=1 obj=100085 op='TABLE ACCESS FULL TEST (cr=56 pr=0 pw=0 time=670 us cost=5 size=20010 card=10)'
CLOSE #139895028091224:c=0,e=13,dep=0,type=0,tim=8954214996

 

So there is is one big advantage over CLOB: the column values are returned without additional roundtrips.

That would mean that if you have character strings that may be between 4k and 32k then extended datatypes can be a good option. It's a new feature however, and designed for another goal (easy migration from other databases). So it's something to test carefully and the tests must integrate all you infrastructure components (backups, exports, replication, etc).

Check SQL Server Error Log size with a policy

Fri, 2014-05-23 00:52

In many blogs or articles, we can read that it is recommended to recycle the error log to be sure that the size is not too big

In the Best Practice Analyzer (BPA) for SQL Server, this is a rule.

But what is a good size for the error log file?

I have often heard “it depends”, but in my opinion, this is not a good answer

SQL Server 2014 : New incremental statistics

Tue, 2014-05-20 20:27

Do you use partitioning with SQL Server? If yes this post is for you because there are good news about partitioning with SQL Server 2014. I remember an old job with a healthcare customer who we decided to implement a partitioning strategy for a big transaction archive table for medical records. We used SQL Server 2005 and we already faced to some challenges like statistics updates. Database administrators who use partitioning with version older than SQL Server 2014 know what I mean.

The first news that concern this blog post is the new incremental statistic strategy provided with SQL Server 2014. This feature allows to update statistics without reading the entire table that can be a problem when your table becomes very big! Imagine for instance you have only one small active partition with recent records and a lot of changes (update, insert or delete) against 30 read only big partitions with billions of records. With SQL Server you have different strategies to update statistics by using either the sample method or the full scan method. The latter is more reliable with partitioned tables because in this case we have to deal with skewed data but requires more time (and more resources) because SQL Server have to read all the table records. With big partitioned tables we talk about potentially many hours. Adding a new non empty partition could be also problematic because we have to deal with same issue. New data are not represented into the concerned statistic.

Using incremental statistics it is possible to update only one or several partitions as necessary. The information is then gathered or merged with existing information to create the final statistic. We will see later how works this new mechanism.

During my tests I will use a modified transaction history table in the AdventureWorks2012. The table records are generated from the Adam Machanic’s T-SQL script. This table will contain 49270382 records for a total size of 2 GB. We will partition the bigTransactionHistory table as following:

USE AdventureWorks2012; GO   -- Partition function with a partitionning strategy by year. -- The first day of the year will be used as boundary of the next partition CREATE PARTITION FUNCTION BigTransactionRange(DATETIME) AS RANGE RIGHT FOR VALUES (    '20050101', '20060101', '20070101', '20080101',    '20090101', '20100101' ); GO   -- Partition scheme -- All partitions will be in the primary filegroup CREATE PARTITION SCHEME TransactionsPS1 AS PARTITION BigTransactionRange ALL TO ( [PRIMARY] ); GO

 

Now we can create a clustered index by using the partition scheme TransactionPS1 and the new instruction STATISTICS_INCREMENTAL.

CREATE CLUSTERED INDEX idx_bigtransactionhistory_transactiondate ON [dbo].[bigTransactionHistory]([TransactionDate]) WITH (SORT_IN_TEMPDB = ON, STATISTICS_INCREMENTAL = ON); GO

 

The sys.stats system view has a new is_incremental column which indicates if a statistic is incremental or not.

SELECT        stats_id,        name AS stat_name,        is_incremental FROM sys.stats WHERE object_id = object_id('bigtransactionhistory')

 

blog_10_-_0-_incremental_stats_-_sys_stats

 

The new partition configuration of the bigTransactionHistory table is the following:

SELECT        p.object_id,        o.name AS table_name,        p.partition_number,        p.rows,        au.total_pages,        au.total_pages / 128 AS total_size_mb,        au.type_desc,        g.name AS [filegroup_name],        RVL.value AS left_range_boundary,        RVR.value AS right_range_boundary        --PF.[name], RV.boundary_id, RV.[value] FROM sys.partitions AS p        JOIN sys.objects AS o              ON o.object_id = p.object_id        JOIN sys.indexes i        ON p.object_id = i.object_id                     AND p.index_id = i.index_id        JOIN sys.allocation_units AS au              ON p.hobt_id = au.container_id        JOIN sys.filegroups AS g              ON g.data_space_id = au.data_space_id        LEFT JOIN sys.partition_schemes AS PS              ON ps.data_space_id = i.data_space_id        LEFT JOIN sys.partition_functions AS PF              ON PF.function_id = ps.function_id                    LEFT JOIN sys.partition_range_values AS RVL              ON RVL.function_id = PF.function_id                     AND RVL.boundary_id + 1 = p.partition_number        LEFT JOIN sys.partition_range_values AS RVR              ON RVL.function_id = PF.function_id                     AND RVR.boundary_id = p.partition_number WHERE p.object_id = object_id('bigtransactionhistory')        AND p.index_id = 1;

 

blog_10_-_1-_incremental_stats_-_config_partitionnement

 

Now let’s take a look at the statistics by using the DBCC SHOW_STATS command:

DBCC SHOW_STATISTICS('bigtransactionhistory','idx_bigtransactionhistory_transactiondate');

 

blog_10_-_2-_incremental_stats_-_stat_header

 

...

 

blog_10_-_3-_incremental_stats_-_stat_density

 

...

 

blog_10_-_4-_incremental_stats_-_stat_histogramm

 

We retrieve the same information as the older versions. In fact we’re trying to read statistic data from the final statistic. You will understand why later in the blog post. Note here that the final statistic can have a maximum of 200 steps. In others words currently new incremental statistics does not change the story: we already have a maximum of 200 steps for the entire table. I hope we will see some improvements maybe in the next version.

Now let’s fill up the table with some records which will have a transaction date greater than 2011-01-01 in order to move them to the last partition.

INSERT dbo.bigTransactionHistory SELECT TOP 1        TransactionID + 1, ProductID, dateadd(year, 1, TransactionDate), Quantity + Quantity, ActualCost * 1.2 FROM dbo.bigTransactionHistory ORDER BY TransactionID DESC

 

The new row has been moved to the correct partition number 8:

 

blog_10_-_5_-_incremental_stats_-_config_partitionnement_after_insert_one_record

 

If we take a look at the statistic histogram of the index idx_bigtransactionhistory_transactiondate we can notice that the new record is not propagated to the existing histogram.

 

blog_10_-_6_-_incremental_stats_-_stat_histogramm_apres_insertion_data

 

Of course, an update statistics is mandatory here. Before SQL Server 2014 we had to update statistics from the entire table but as said earlier we can now use a new option: WITH RESAMPLE ON PARTITIONS (). The concerned partition is the number 8 in my case. The word RESAMPLE is important here because all pages in the new statistic tree structure must be aligned with the same sample.

UPDATE STATISTICS dbo.BigTransactionHistory(idx_bigtransactionhistory_transactiondate) WITH RESAMPLE ON PARTITIONS(8);

 

blog_10_-_7_-_incremental_stats_-_stat_histogramm_apres_maj_incremental_statsjpg

 

Good news: the incremental statistics works ! In addition the following table gives us an idea of the performance we could obtain by using the incremental statistic (I used and compared the both methods with full scan and with resample on partitions):

 

Update statistics option

Elapsed Time (ms)

CPU Time (ms)

WITH FULLSCAN

3085

20935

WITH RESAMPLE ON PARTITIONS(8)

6

0

 

I guess you can easily imagine the result with this new method on a real production environment …

What about auto update statistic with incremental statistics? As you certainly know SQL Server uses a specific algorithm to update automatically statistics when the table has more than 500 rows. The update operation is triggered when the number of rows reaches 20% of the total existing rows + 500 rows. With tables that have a billions of rows, we can spend much time without any automatic update statistics operation. Besides, adding a new partition that does not modify more than 20% of the total rows will not issue an automatic update statistic operation and no information about it will not be available. Fortunately incremental statistics changes the story here. The update statistic can be triggered per partition when the number of modification reaches the threshold value = (total rows / number of partitions) * 20%.

Let’s try with the following test. In my case the bigTransactionhistory table contains 46895292 rows. According to the above formula the theoretical threshold should be: 46895292 / 8 * 0.2 = 1172383. Thus, I will update 1172892 rows in the partition number 7 to be sure to issue an automatic update statistic operation by using a SELECT statement with a predicate on the TransactionDate column.

UPDATE TOP (1172892) bigTransactionHistory SET TransactionDate = DATEADD(dd, 1, TransactionDate) WHERE $PARTITION.BigTransactionRange(TransactionDate) = 7;

 

I can confirm the number of changes is over the threshold by viewing the rowmodctr column from the legacy system view sysindexes:

SELECT        indid, o.name AS table_name,        i.name AS index_name,    STATS_DATE(o.id,i.indid) AS last_date_updated,        rowmodctr AS number_of_changes,        st.is_incremental FROM sysobjects o        JOIN sysindexes i              ON i.id = o.id        JOIN sys.stats st              ON st.object_id = o.id                     AND st.stats_id = i.indid WHERE xtype = 'U' AND i.name IS NOT NULL        AND o.id = object_id('bigtransactionhistory') ORDER BY last_date_updated GO

 

blog_10_-_8_-_incremental_stats_-_rowmodctr

 

Then, I perform a SELECT statement with a predicate on the TransactionDate column …

SELECT TOP 10 * FROM dbo.bigTransactionHistory WHERE TransactionDate > '20110101';

 

… and I can conclude the SELECT statement has issued an automatic update statistic by viewing the rowmodctr column value equal to 0 for the idx_bigtransactionhistory_transactiondate column.

 

blog_10_-_9_-_incremental_stats_-_rowmodctr

 

The automatic update statistic operation has been issued for only 2.5% of total rows change in the entire table in my case.

Now as promise I will give you more information about statistic pages with incremental statistics. First, I would like to thank you Frédéric Pichaud, Senior Escalor Engineer at Microsoft in France, for giving us the following information. We can use the new internal dynamic management function to figure out how statistic pages are organized with incremental statistics.

SELECT        s.name,        s.is_incremental,        sp.node_id,        sp.first_child,        sp.next_sibling,        sp.last_updated,        sp.modification_counter,        sp.rows,        sp.rows_sampled,        sp.steps,        sp.left_boundary,        sp.right_boundary FROM sys.stats as s CROSS APPLY sys.dm_db_stats_properties_internal(s.object_id, s.stats_id) AS sp WHERE s.object_id = object_id('bigtransactionhistory')        AND s.name = 'idx_bigtransactionhistory_transactiondate' ORDER BY s.name, node_id;

 

blog_10_-_10_-_incremental_stats_-_dmf_internal

 

We have to focus on 3 columns here: node_id, first_child and next_cibling columns. node_id is an identifier for a page statistic. The page with the node_id equal to 1 is the final statistic page we’ve seen earlier by using DBCC SHOW_STATISTICS. This page contains the total number of rows in the bigTransactionHistory table (column rows = 46895292). We can notice that each partition, easily identifiable by the left and right boundaries, has its own page statistic (node_id 3 to 9). The first_child and the next sibling columns help us to rebuild the complete tree of statistics objects as shown below:

 

blog_10_-_12_-_incremental_stats_-_statistic_object_tree

 

From my understanding, the green squares are the statistic page at the leaf level linked to a partition and red squares are the binary merge pages. For instance pages with node_id = 8 (4790286 rows) and 9 (1 row) are merged to the page with node_id = 10 (4790286 + 1 = 4790287). In turn, the page with node_id = 7 (7672943 rows) and the page with node_id = 10 are merged into the page with node_id = 12 (7672943 + 4790287 = 12463230) and so on… The page with node_id = 1 is the final page statistic that merges all the others pages. The page with node_id = 2 is a reserved page for future needs.

Earlier in the blog post we issued an automatic update statistic and the partition with the number 10 was concerned. The page with node_id = 8 is linked to this partition and we can notice the update statistic propagation throughout the statistic objects tree. Only the page statistic of this partition is updated with all the merge pages.

 

blog_10_-_13_-_incremental_stats_-_dmf_internal_2

 

Finally we can use the traceflag 2309 to see the detail of a page in the statistic tree. After enabling this traceflag we can use the DBCC SHOW_STATISTICS command with an additional third parameter node_id

DBCC TRACEON(2309); GO   DBCC SHOW_STATISTICS('bigtransactionhistory','idx_bigtransactionhistory_transactiondate', 8);

 

blog_10_-_14_-_incremental_stats_-_show_statistics_detail_page

 

The results is the same as a classic DBCC SHOW_STATISTICS command but the detail is only for the concerned page in the tree. Enabling this traceflag is only for debugging purpose. For instance if we suspect a general update statistic issue we can try to identify which pages are relevant.

Database administrators who faced to statistics problems with partitioned tables will probably have fun to try this new feature. See you soon for the next good news about partitioning ;-)

Oracle policy managed databases: Policies and policy sets

Tue, 2014-05-20 15:32

In this post, I will continue my experiments with Oracle policy managed databases with Grid Infrastructure, and present one of the new features introduced with Grid Infrastructure 12c last June: the cluster configuration policies and policy sets. It allows the administrator to dynamically reconfigure all server pools at a time, with a single command, according to the business workload.

 

To take a simple example, imagine you have two servers allocated for your production instances, and two servers allocated for you development instances. This repartition is conform to your daily workload. But you have very intensive batches running every night on your production databases. The production servers are strongly used, and you have two development servers, up and running, but not used at all. What a great resources wasting!

 

In this case you have several solutions:

 

1) Shut down development servers during the night because they are not used. It will at least allow energy saving :-)

2) Review your needs and allocate more server(s) to the production system

3) Create a crazy script to reconfigure each server pool, one by one, to accomodate the configuration to the real needs, with srvctl commands run in serial.

4) Use policies and policy sets

 

As you may expect, I will develop the last solution!

 

A policy set is a "document" (Oracle definition) which contains one or more policies. Each policy contains the configuration of all server-pools in the cluster: MIN_SIZE and MAX_SIZE, IMPORTANCE, server list, etc.

It will make the administrator able to activate a policy or another, according to the workload or business needs, to dynamically and immediately change all the configuration of the cluster server-pools. For instance: use two servers for production and two servers for development during working days, and use three servers for production and one server for development during nights and week ends.

 

First of all, there is my current configuration:

 

[oracle@node01 ~]$ srvctl config srvpool
Server pool name: Free
Importance: 0, Min: 0, Max: -1
Category:
Candidate server names:
Server pool name: Generic
Importance: 0, Min: 0, Max: -1
Category:
Candidate server names:
Server pool name: srvprod
Importance: 0, Min: 2, Max: 2
Category: hub
Candidate server names:
Server pool name: srvtest
Importance: 0, Min: 2, Max: 2
Category: hub
Candidate server names:

 

I have two nodes running in my srvprod pool and two nodes running in the srvtest pool.

An instance DB1 (prod) is running on nodes 1 and 4:

[oracle@node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01
Instance DB1_2 is running on node node04

 

And an instance DB2 (dev) is running on nodes 2 and 3:

[oracle@node01 ~]$ srvctl status database -d DB2
Instance DB2_1 is running on node node02
Instance DB2_2 is running on node node03

 

The simplest way to benefit of policies and policy sets is to create a text file with the wanted configuration. Following steps explain how to proceed.

 

Step 1: Create a new policy set

The crsctl tool offers an option to generate a new policy set file, based on the current active configuration. We must connect as grid:

 

[grid@node01 ~]$ mkdir -p /u00/app/oracle/admin/policies
[grid@node01 ~]$ crsctl create policyset -file /u00/app/oracle/admin/policies/new_policy

 

The file has been generated:

[grid@node01 ~]$ ll /u00/app/oracle/admin/policies
total 4
-rw-r--r-- 1 grid oinstall 305 Apr 29 16:21 new_policy

 

There is the file content:

[grid@node01 ~]$ cat /u00/app/oracle/admin/policies/new_policy
SERVER_POOL_NAMES=Free ora.srvprod ora.srvtest
POLICY
  NAME=Default
  SERVERPOOL
    NAME=ora.srvprod
    IMPORTANCE=0
    MAX_SIZE=2
    MIN_SIZE=2
    SERVER_CATEGORY=ora.hub.category
  SERVERPOOL
    NAME=ora.srvtest
    IMPORTANCE=0
    MAX_SIZE=2
    MIN_SIZE=2
    SERVER_CATEGORY=ora.hub.category

 

The line "NAME=Default" corresponds to the name of the policy in the policy set. The policy set itself is represented by the file "new_policy" created above.

 

Step 2: Update the current policy set

To update the current policy set, we just have to edit our policy set file "new_policy" and to add new policies above the Default one.

As an example, imagine we want to allocate one server more for production during the nights, and to allocate all servers to the production during week ends, to disable the development instances. We append the following policies:

 

POLICY
  NAME=Nights
  SERVERPOOL
    NAME=ora.srvprod
    IMPORTANCE=0
    MAX_SIZE=3
    MIN_SIZE=3
    SERVER_CATEGORY=ora.hub.category
  SERVERPOOL
    NAME=ora.srvtest
    IMPORTANCE=0
    MAX_SIZE=1
    MIN_SIZE=1
    SERVER_CATEGORY=ora.hub.category
POLICY
  NAME=WeekEnds
  SERVERPOOL
    NAME=ora.srvprod
    IMPORTANCE=0
    MAX_SIZE=4
    MIN_SIZE=4
    SERVER_CATEGORY=ora.hub.category
  SERVERPOOL
    NAME=ora.srvtest
    IMPORTANCE=0
    MAX_SIZE=0
    MIN_SIZE=0
    SERVER_CATEGORY=ora.hub.category

 

Once the file is edited, it must be loaded into the policy set configuration using crsctl:

[grid@node01 ~]$ crsctl modify policyset -file /u00/app/oracle/admin/policies/new_policy

 

Step 3: Enable a new policy

Our policy set now contains three policies (Default, Nights and WeekEnds). The currently enabled policy is "Default":

[grid@node01 ~]$ crsctl status policyset | grep -i last
LAST_ACTIVATED_POLICY=Default

 

We are going to enable the "Nights" policy and wee what happens. To enable a policy, use the following crsctl command as grid:

 

[grid@node01 ~]$ crsctl modify policyset -attr "LAST_ACTIVATED_POLICY=Nights"
CRS-2673: Attempting to stop 'ora.db2.db' on 'node02'
CRS-2677: Stop of 'ora.db2.db' on 'node02' succeeded
CRS-2672: Attempting to start 'ora.db1.db' on 'node02'
CRS-2676: Start of 'ora.db1.db' on 'node02' succeeded

 

In the previous output, we can see on node02 that Oracle has stopped DB2 and started DB1 instead. We can confirm with srvctl:

 

[grid@node01 ~]$ srvctl status srvpool
Server pool name: Free
Active servers count: 0
Server pool name: Generic
Active servers count: 0
Server pool name: srvprod
Active servers count: 3
Server pool name: srvtest
Active servers count: 1

 

[grid@node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01
Instance DB1_2 is running on node node04
Instance DB1_3 is running on node node02

 

[grid@node01 ~]$ srvctl status database -d DB2
Instance DB2_2 is running on node node03

 

Now I enable the "WeekEnds" policy. The remaing development server will be allocated to the production server pool, and 4 production instances will be running:

 

[grid@node01 ~]$ crsctl modify policyset -attr "LAST_ACTIVATED_POLICY=WeekEnds"
CRS-2673: Attempting to stop 'ora.db2.db' on 'node03'
CRS-2677: Stop of 'ora.db2.db' on 'node03' succeeded
CRS-2672: Attempting to start 'ora.DB1.db' on 'node03'
CRS-2676: Start of 'ora.DB1.db' on 'node03' succeeded

 

All nodes are now part of the production pool:

[grid@node01 ~]$ srvctl status srvpool
Server pool name: Free
Active servers count: 0
Server pool name: Generic
Active servers count: 0
Server pool name: srvprod
Active servers count: 4
Server pool name: srvtest
Active servers count: 0

 

Four production instances are opened:

[grid@node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01
Instance DB1_2 is running on node node04
Instance DB1_3 is running on node node02
Instance DB1_4 is running on node node03

 

And the service for development is completely down:

[grid@node01 ~]$ srvctl status database -d DB2
Database is not running.

 

To reset to the initial policy, just use the same command with the "Default" policy attribute:

[grid@node01 ~]$ crsctl modify policyset -attr "LAST_ACTIVATED_POLICY=Default"

 

Conclusion

This feature represents an easy way to configure a whole policy managed cluster at will. I have demonstrated how to edit a policy set using a text file generated using crsctl, but Oracle also documented how to manually definea policy set with command line. In my opinion, this is most fastidious and the text file is a better choice.

I only regret that Oracle has not created a way to enable policies according to the database workload, or to schedule policies over the time. Here, we must manually enable the policy or use the crontab...

Oracle SQL Monitoring reports in flash, html, text

Thu, 2014-05-15 13:28

I have recently posted on the way I like to extract execution plans. When we have Tuning pack, I suggest to get them with SQL Real-Time Monitoring as an active report. However, Martin Preiss said in a comment that he prefers the text format - easier to search, copy, and paste. And that's a very good remark. We still need plain text.

My point is that if you send me the active report, then I can extract the text version from it. The opposite is not possible. It's a bit tricky, I'll show how to do.

The active report is very nice: it is an html file with an embedded xml data in it. When you open it, it loads a flash application from the oracle site and displays that nice report:

 

CaptureSQLMon001.PNG

 

Its easy to navigate, drill dow, etc. I often get to the html source in order to see some information that is hidden in the xml but which is not displayed, such as the optimizer parameters used by the query. Here is an extract from the html source:

 

CaptureSQLMon002.PNG

 

Yes. I used to do that in 11g. And once, I asked someone to send me his 12c report and I had a bad surprise. The xml is compressed and I can't see anything:

 

CaptureSQLMon005.PNG

 

Oracle 12c introduced that nice optimization in order to have smaller files, but I prefered the previous one. But look at the attributes: it's base 64 encoded and compressed with zlib.

So don't worry, you have linux utilities to uncompress that:

 

CaptureSQLMon007.PNG

 

From the sqlmon12c.html, I have extracted to sqlmon12c_decoded.html the report with xml data as I've shown above. Good. I got my xml back :)

Then back to Martin's comment. The active report is good to navigate on it. The xml is good to find some hidden information. But If I want to paste a plan to the report I deliver to my customer, I prefer text. A table in text is ok for a Word document.

First I extract only the xml and include it within 'report' tags:

 

CaptureSQLMon008.PNG

[sorry for the code as screenshot, but the html tags in the awk script are hard to display here] 

 

From that xml, I apply the sqlmonitorText.xsl XSLT template which is found in ORACLE_HOME:

 

xsltproc $ORACLE_HOME/rdbms/xml/orarep/sqlmonitor/sqlmonitorText.xsl sqlmon12c_extracted.xml > sqlmon12c_text.html

 

And here is the result, a nice HTML table that perfectly fits in my .doc document:

 

CaptureSQLMon003.PNG

 

But there is something else that we cannot do with that. On large execution plans, I often have to bring the cursor up and down in order to go from a child operation to its parent (example: from a high 'execs' number within a nested loop I go to the rowsource that provided that number of rows). So I need text. Plain text.

No need to call the DBMS_REPORT functions for that. Linux has text web browsers. For example:

 

elinks -dump -dump-width 150 sqlmon12c_text.html | grep -v "^ |--" > sqlmon12c_text.txt

 

...which gives me something useful for notepad.exe or vim:

 

CaptureSQLMon006.PNG

 

So this is why I prefer to have the SQL Monitoring report in the active html format: I can read it immediately, and convert it to what I want later. I have a script for that. A very dirty script. I'll share it anyway. Just call it with the name of the active html report file. here it is: decode-sqlmon.zip

But if someone wants to make a clean script for that, don't hesitate to share :)

ESPC 2014: Upgrading To SharePoint 2013

Wed, 2014-05-14 23:19

I have decided to make a resume of the presentation "Upgrading SharePoint 2013" by Sam Hassani that I have followed at the European SharePoint Conference 2014. This is meant for the people who did not have the chance to attend to the event taking place from 5 to 8 May 2014 in Barcelona!

 b2ap3_thumbnail_ESPC.JPG

For more details on the Conference presentations, have a look at the agenda.

 

History of upgrades in SharePoint

Historically, there are three upgrade methods:

  • In-place upgrade: SharePoint Services are overwritten by the newest version, and the content databases are changed.
  • Gradual upgrade: Each group of site collections is upgraded. The data in the groups is copied from the original database to a new database before the data is upgraded to the new SharePoint Services. The original SharePoint Services data is maintained in the original database until it is deleted.
  • Database-attach: In this upgrade, you need to create and configure the new SharePoint farm first. Then, you need to copy the content and service application databases from the old SharePoint farm as well as attach and upgrade the databases: the data is then upgraded to the new version. in a second step, site owners can upgrade the individual site collections.

The available upgrade methods depend on the SharePoint version:


SharePoint 2003 to 2007

  • Gradual upgrade
  • In-place upgrade
  • Database attach


SharePoint 2007 to 2010

  • In-place upgrade
  • Database attach


SharePoint 2010 to 2013

  • Database attach

 
As you can see, for SharePoint 2013, upgrade methods are no longer available.

 

Supported databases

These are the supported databases:

  • Content database
  • Search admin database
  • Profile database
  • Social database
  • Managed Metadata database
  • Subscription database
  • PerformancePoint database
  • Secure Store database
  • Business Data Connectivity database

 

Upgrade - general process

PPT_Workshop_template_dbi_InSite.pptx---PowerPoint_20140507-085150_1.png

Before upgrading a Site Collection

First, run Site Collection Health Checks. It will look for common known issues and inform you on the feasibility of your upgrade.

You can also create an Evaluation Site Collection to preview the site content in the latest version of SharePoint Online.

Then, you must Upgrade Throttling to prevent overload from self-service site collection upgrade. If an upgrade is not possible due to a requested throttling, it is queued. Finally, you should migrate to Claims before upgrading to SharePoint 2013. Convert the web application to use claims, then perform user migration.

 

Site Collection upgrade

You cannot do “Big Bang” upgrades. First, upgrade SharePoint databases, then upgrade site collections. It is two different steps that you have to follow.

Moreover, it is better to provide a self-service site collection upgrade capability to site collection admins. They can easily do the upgrade, moving gradually over to 2013 experience: this is much more comprehensive for end users.

Furthermore, site collection admins can make the choice to not upgrade a site collection if it is required.

 

Unsupported with Site Collection in SharePoint 2010 mode

If site collection admins make the choice to not upgrade a site collection, it is impossible to access

  • All new SharePoint 2013 features
  • 2010 Web Analytics
  • 2010 Office Web applicationsService Applications Migration Process

 

Service Applications Migration Process

PrtScr-capture_67_20140507-085215_1.png

Customizations

If you want to add a solution to the farm, use Add-SPSolution.

If you want to deploy a solution to the farm, use Install-SPSolution.

Globally, you have three typical scenarios:

  • An existing solution works for both 2010 and 2013 mode sites.
  • An existing solution works for 2010 mode sites, but additional functionalities are required for 2013 mode sites.
  • An existing solution works for 2010 mode sites, but a new solution is required for 2013 mode sites.

 

Conclusion

Always ensure that the environment is fully functioning before you begin your upgrade. Cleanup and preparation are required.

You must treat pre-production testing like a production. Formulate and validate an Upgrade Plan!

You should test carefully, without skipping any errors or warnings: be methodic and procedural.

One day @ Bob Ward's SQLOS Workshop

Wed, 2014-05-14 23:07

We had the opportunity with David Barbarin, to fly to Wroclaw (Poland) to participate in the SQLDay 2014 (annual conference of the Polish SQL Server User Group) for the first time. We don't speak Polish but we met Bob Ward who held a very interesting level 500 workshop titled "Inside SQLOS and Memory". This pre-conference seminar session was for us the opportunity to benefit from Bob Ward's Expertise.

 b2ap3_thumbnail_BW4.jpg

This event was held at the football stadium in Wroclaw where the European football 2012 cup took place

 b2ap3_thumbnail_BW2.jpg

I will not explain or expose the entire content of the workshop but I will just give you the basics of the workshop program.

 

Inside SQLOS

After a good review of fundamentals like preemptive and non-preemptive scheduling, we started with an introduction of SQLOS.

As a reminder, SQLOS is not just an executable or a Windows Service, but an abstraction layer of SQL Server and an API with two main libraries:

  • SQLDK.dll (SQL Server Development Kit)
  • SQLOS.dll (main dll)

Of course, we have other libraries like sqllang.dll that contains the T-SQL language and the Query Processor and sqlmin.dll that contains well-known components of SQL Server such as LazyWriter, the checkpoint process, the lock monitor etc.

After a basic presentation of the SQLOS, Bob Ward showed us how to perform some debug stuff using WindDbg.

His debugger comes from the Debugging Tools for Windows here.

After the installation, you find the executable windbg.exe in C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x64.

In addition to this debugger, you must use "symbols". Symbols are used to decode memory addresses to their corresponding function names then we can easily understand the memory. The command line to use is:

 

windbg -y srv*g:\symbols*http://msdl.microsoft.com/download/symbols -pn sqlservr.exe

 

I will not explore this point further.

Then, after seeing the context switching (SQLOS and Windows) and how SQLOS executes threads, a good point is to know how many worker threads are available for SQL Server processes. Based on the number of logical CPUs and platforms, the instance setting is 'Max worker threads'.

By default, this value is 0 and let SQL Server to SQL Server to automatically choose the maximum number of worker threads.

  • For 64 bits and less than 64 CPUs, SQL Server calculates 512 + (16* (number of CPU-4))
  • For 64 bits and more than 64 CPUs, SQL Server calculates 512 + (32* (number of CPU-4))
  • To optimize max worker threads, you may want to read this msdn article.

During the break, there was a quick exchange between David and Bob, with the computer sticking out his tongue … it is of course a pure coincidence Tongue out

 b2ap3_thumbnail_BW1.jpg

  Inside Memory

The next part of the workshop concerned the memory management of SQLOS. As you maybe know, there are some fundamental changes with SQL Server 2012. First, Bob Ward explained that the single and multi-pages allocators concepts do not exist anymore.

Now, we have virtual allocators, workspaces, fragment manager, block allocators and buffer pool (this last on does not act as an allocator by itself). These fundamental changes provide a more consistent model for all memory management. Max server memory now really means max server memory and does not concern only the buffer pool!

Then, Bob Ward gave us a good overview of the memory nodes with NUMA architecture. There are some improvements with SQL Server 2012 (affinity, mapping of SQL Server memory numa nodes directly to the hardware NUMA nodes, etc.)

Unfortunately, after this chapter, it was the time to go back home. We did not have the opportunity to see the last part of the workshop but we left with a lot a good things in mind!

 b2ap3_thumbnail_BW3.jpg

Finally, I would like to thank our Polish colleagues for this event and Bob for this wonderful workshop day.

I hope that in the future, we can organize the same event in Switzerland or in France with the participation of Bob Ward.Cool

Exploring Oracle SE & EE performance statistics with Orachrome Lighty

Mon, 2014-05-12 04:58

At dbi services, we really like Lighty for Oracle, a tool which helps exploring performance statistics whether they originate from AWR (available only in Enterprise Edition with Diagnostic Pack option) or from Statspack (available via Standard Edition). As a matter of fact, we like it so much that we have became Orachrome Partner.

If you want to try it out for yourselves, there is a free trial here. If you have Statspack installed (see spdoc.txt in ORACLE_HOME/rdbms/admin to install it), you can immediately use Lighty.

Let me show you how to install it. First, you have to define a connection (shortcut: Ctrl-A):

 

CaptureLighty001.PNG 

 

It is a jdbc connection without the need to install anything else. You just need the listener host, port, and service:

 

CaptureLighty002.PNG

 

As you are on Oracel Standard Edition, you don't have Diagnostic Pack nor tuning Pack.

But you have installed Statspack, and you will install L-ASH very soon:

 

CaptureLighty003.PNG

 

Now, you can install L-ASH as documented here. But the goal of this post is to show what you can get it immediately. Go to the menu: Reports > DB Statistics > Past Activity (Statspack)

You can select a day and see the snapshots. The color bar immediately shows when the activity was high:

 

CaptureLighty004.PNG

 

Then you can choose the graphical equivalent for my preferred Statspack sections:

Top Events:

 

CaptureLighty005.PNG

 

Top Segments (change the Statspack level to 7 for that):

 

CaptureLighty007.PNG

 

And of course Top SQL Statements:

 

CaptureLighty006.PNG

 

For each statement (click on its sql_id) you will have everything: SQL text, execution plan, statistics.

If you like the tool, it's time to go further and install L-ASH so that you can mine from the Activity Viewer tab, which is the most valuable feature of Lighty for Oracle.

Implementing policy-managed Oracle databases

Thu, 2014-05-08 18:43

Policy-managed databases appeared with Oracle 11g R2 but are not very common yet. I never had the opportunity to administer these kind of databases. Since Oracle now presents policy-databases as the default installation mode and as a best practice, I wanted to know more about this feature which fully automatizes the databases repartition within an Oracle cluster. In this post, I will describe how to implement and handle policy-managed databases, and in a future post I will detail new Oracle 12c improvements.

 

Understanding the server-pool concept

The first and maybe the major concept to understand is the server-pool. A server pool is a logical group of servers within the cluster, sharing resources for databases and/or services. The following schema shows two server pools, having two nodes each.

 

rac_pmdb

 

A node can be part of several server pools, or be attached to a single named server pool, depending on the configuration. However, a node can run on only one server pool at a time.

With server pools, the database instance is not assigned anymore to a named server, but to a server pool, and the cluster will manage the instance placement over the nodes.

Scalability and availability of applications (databases, services...) within a server-pool can be dynamically defined using MIN_SIZE, MAX_SIZE and MAXIMUM parameters. These parameters are defined at server pool level:

  • MIN_SIZE represents the minimal number of nodes which should compose the server-pool. If the number of nodes falls below the MIN_SIZE value, Oracle moves servers from other server-pools - if available - to satisfy this parameter.
  • MAX_SIZE represents the maximal number of nodes available in a server-pool. MAX_SIZE can be considered as a kind of target, or as a "must have". Oracle will satisfy this parameter only if all server-pool MIN_SIZE are reached first.
  • IMPORTANCE controls the availability of an application. Its value can range from 0 (not important) to 1000. The server pool with the higher importance value will have resources allocated in priority: new available nodes are assigned to the most important server-pool first, and nodes can be de-assigned from a less important server-pool to satisfy the MIN_SIZE of the most important server-pool.

As you may expect, the IMPORTANCE parameter will be used to guarantee availability of business critical services, to the detriment of lower important applications, such as development.

 

Creating a server-pool

In this example, I will use a cluster composed of four identic nodes. Two are up and running, the two remaining have cluster services not running at this time.

First, I will create two server-pools. One for production databases, and one for development/test databases. As a first example, I set MIN and MAX values at 1, and IMPORTANCE value at 0, the minimum. I use the srvctl tool as oracle user.

It is possible to create a server pool with or without specifying a list of nodes. If you specify one or several nodes as part of a server pool, only these listed nodes would be able to migrate to this server pool. We speak about "Candidate server names" which is an implicitely excluding list. If you do not specify any node name, all nodes would be able to migrate to the server pool. It means that all nodes are "candidates" for this pool. You will see later that the "candidate server names" field shows the nodes which have been explicitely affected to a server pool.

Here I will not specify candidate servers, so all nodes within my cluster can migrate to any server pools.

 

[oracle@node01 ~]$ srvctl add srvpool -g srvtest -l 1 -u 1 -i 0
[oracle@node01 ~]$ srvctl add srvpool -g srvprod -l 1 -u 1 -i 0

 

This is the list of options used:

  • -g: name of the server pool
  • -l: minimum number of nodes
  • -u: maximum number of nodes
  • -i: importance value

 

We can get the status of the different server pools with the following command:

 

[oracle@node01 ~]$ srvctl status srvpool -detail
Server pool name: Free
Active servers count: 0
Active server names:
Server pool name: Generic
Active servers count: 0
Active server names:
Server pool name: srvprod
Active servers count: 1
Active server names: node02
NAME=node02 STATE=ONLINE
Server pool name: srvtest
Active servers count: 1
Active server names: node01
NAME=node01 STATE=ONLINE

 

As we can see, there are four server-pools: free, generic, srvtest and srvprod

  • The free server-pool contains unassigned servers. It is empty since the two available servers are already used in srvprod and srvtest pools.
  • The generic server pool contains servers running admin-managed databases. It is empty because there is no admin-managed database running in this cluster.
  • The srvprod and srvtest pools have been created above, and contain one node each due to the min/max value set to 1.

 

Assigning a database to a server-pool

We can assign a new database to a server-pool with dbca. In the following example, I have created the DB1 database in the existing srvprod server-pool. Note that policy managed is the default choice proposed starting with Oracle 12c.

 

01_policy_managed_dbca

 

02_policy_managed_dbca

 

Once the database is created, we can see that it has started on available nodes of the srvprod server-pool:

 

[oracle @ node01 ~]$ srvctl status database -d DB1
Instance DB1_1 is running on node node01

 

As for an admin managed database, the instance gets a suffix number but the "underscore" character is also used to define the name of the instance. Here DB1 gets DB1_1 instance name.

 

I have created a second database DB2 in the srvtest server-pool.

 

[oracle@node01 ~]$ srvctl status database -d DB2
Instance DB2_1 is running on node node02

 

Impact of server-pool parameters

Let's study the impact of MIN_SIZE, MAX_SIZE and IMPORTANCE parameters. In this example, I will first change the value of these parameters for the srvprod server pool, and then I will start cluster services on the two remaining nodes.

 

The srvprod pool will have a MIN_SIZE of 2 and a MAX_SIZE of 3:

[oracle@node01 ~]$ srvctl modify srvpool -min 2 -max 3 -serverpool srvprod

 

We have now the following configuration:

 

[oracle@node01 ~]$ srvctl config srvpool
...
Server pool name: srvprod
Importance: 0, Min: 2, Max: 3
Category: hub
Candidate server names:
Server pool name: srvtest
Importance: 0, Min: 1, Max: 1
Category: hub
Candidate server names:

 

("Candidate server names" is empty since I have not defined any server as part of a server pool.)

 

Now I start cluster services on node 03 and node 04.

 

[root@node03 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

 

[root@node04 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

 

After few minutes, I see that srvprod has now 3 nodes, satisfying the MAX_SIZE parameter. Srvtest has still 1 node, and it is conform to the MIN_SIZE and MAX_SIZE defined:

 

[oracle@node01 ~]$ srvctl status srvpool
...
Server pool name: srvprod
Active servers count: 3
Active server names: node01,node03,node04
NAME=node01 STATE=ONLINE
NAME=node03 STATE=ONLINE
NAME=node04 STATE=ONLINE
Server pool name: srvtest
Active servers count: 1
Active server names: node02
NAME=node02 STATE=ONLINE

 

The test database DB2 is running on node 02:

 

[oracle@node01 ~]$ srvctl status database -d DB2
Instance DB2_1 is running on node node02

 

Example 1: MIN_SIZE and MAX_SIZE

Let's see what happens if I stop cluster services on node 02 in the srvtest pool, with DB2 instance running.

 

[root@node02 bin]# date
Mon Apr 28 11:49:09 CEST 2014[root@node02 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node02'
CRS-2673: Attempting to stop 'ora.crsd' on 'node02'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'node02'
CRS-2673: Attempting to stop 'ora.db2.db' on 'node02'
...

 

To see what is going on, I consult the crsd.log file on the master node of the cluster.

Tip

To identify the master node within a cluster, look at the "OCR MASTER" string value with a grep command on the crsd.log on any node. You will get "I AM THE NEW OCR MASTER" on the primary node, or "NEW OCR MASTER IS N" on secondary nodes, where N is the number of the node returned by 'olsnodes -n' command:

 

[grid@node01 ~]$ cat /u00/app/12.1.0/grid/log/node01/crsd/crsd.log | grep -i 'ocr master'
...
...
2014-04-28 04:53:16.388: [  OCRMAS][2831439616]th_master:12: I AM THE NEW OCR MASTER at incar 1. Node Number 1

 

[grid@node01 ~]$ olsnodes -n
node01  1
node02  2
node03  3
node04  4

 

The crsd.log file in /u00/app/12.1.0/grid/log/node01/crsd/crsd.log shows following events:

 

2014-04-28 11:50:50.878: [   CRSPE][2296276736] {1:4797:769} Pool ora.srvtest [min:1][max:1][importance:0] NO SERVERS ASSIGNED cannot be a donor  for pool ora.srvtest [min:1][max:1][importance:0] NO SERVERS ASSIGNED
2014-04-28 11:50:50.878: [   CRSPE][2296276736] {1:4797:769}Server [node04] has been un-assigned from the server pool: ora.srvprod
2014-04-28 11:50:50.880: [   CRSPE][2296276736] {1:4797:769}Server [node04] has been assigned to the server pool: ora.srvtest
...
2014-04-28 11:50:50.881: [   CRSPE][2296276736] {1:4797:769} Server [node04] has changed state from [ONLINE] to [RECONFIGURING]
...
2014-04-28 11:50:50.882: [   CRSPE][2296276736] {1:4797:769} CRS-2673: Attempting to stop 'ora.db1.db' on 'node04'
...
2014-04-28 11:51:15.708: [   CRSPE][2296276736] {1:4797:769} CRS-2677: Stop of 'ora.db1.db' on 'node04' succeeded
2014-04-28 11:51:15.708: [   CRSPE][2296276736] {1:4797:769} Updated state details for server node04 from [STOPPING RESOURCES] to : []
2014-04-28 11:51:15.708: [   CRSPE][2296276736] {1:4797:769} Server [node04] has changed state from [RECONFIGURING] to [ONLINE]
...
2014-04-28 11:51:15.718: [   CRSPE][2296276736] {1:4797:769} CRS-2672: Attempting to start 'ora.db2.db' on 'node04'
...
2014-04-28 11:51:54.749: [   CRSPE][2296276736] {1:4797:769} RI [ora.db2.db 1 1] new internal state: [STABLE] old value: [STARTING]
2014-04-28 11:51:54.749: [   CRSPE][2296276736] {1:4797:769} RI [ora.db2.db 1 1] new external state [ONLINE] old value: [OFFLINE] on node04 label = [Open]
2014-04-28 11:51:54.749: [   CRSPE][2296276736] {1:4797:769} Set LAST_SERVER to node04 for [ora.db2.db 1 1]
2014-04-28 11:51:54.750: [   CRSPE][2296276736] {1:4797:769} Set State Details to [Open] from [ Instance Shutdown] for [ora.db2.db 1 1]
2014-04-28 11:51:54.750: [   CRSPE][2296276736] {1:4797:769} CRS-2676: Start of 'ora.db2.db' on 'node04' succeeded

 

Node 04 has been deassigned from srvprod server pool, and has been assigned to srvtest serverpool. We can see that the instance of DB1 running on node04 has been shut down, and then the DB2 instance has started. We can confirm that with srvctl commands:

 

[oracle@node01 crsd]$ srvctl status database -d DB2
Instance DB2_1 is running on node node04

 

[oracle@node01 ~]$ srvctl status srvpool
...
Server pool name: srvprod
Active servers count: 2
Active server names: node01,node03
NAME=node01 STATE=ONLINE
NAME=node03 STATE=ONLINE
Server pool name: srvtest
Active servers count: 1
Active server names: node04
NAME=node04 STATE=ONLINE

 

What can we infer of that? Oracle has taken one node from the srvprod server pool, because there were three servers in this pool (MAX_SIZE), while the mandatory number of server (MIN_SIZE) is 2. MIN_SIZE value remains satisfied for both srvprod and srvtest pools and both databases DB1 and DB2 are accessible.

 

Example 2: IMPORTANCE parameter

To demonstrate how this parameter works, I will edit properties for the srvprod server pool and set an IMPORTANCE of 1. I will then stop cluster services on node 03, which is part of srvprod server pool.

 

First, I set an IMPORTANCE value of 1 for the srvprod server pool:

 

[oracle@node01 ~]$ srvctl modify srvpool -importance 1 -serverpool srvprod
[oracle@node01 ~]$ srvctl config srvpool -serverpool srvprod
Server pool name: srvprod
Importance: 1, Min: 2, Max: 3
Category: hub
Candidate server names:

 

Now, I stop cluster services on node 03. The number of remaining nodes in the srvprod server pool will fall under the MIN_SIZE value, with 1 server remaining.

 

[root@node03 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node03'
CRS-2673: Attempting to stop 'ora.crsd' on 'node03'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'node03'
...
CRS-4133: Oracle High Availability Services has been stopped.

 

In crsd.log file, that is what we can see:

 

2014-04-28 06:27:29.222: [   CRSPE][2296276736] {1:4797:910}Server [node04] has been un-assigned from the server pool: ora.srvtest
2014-04-28 06:27:29.223: [   CRSPE][2296276736] {1:4797:910}Server [node04] has been assigned to the server pool: ora.srvprod
...
2014-04-28 06:27:29.224: [   CRSPE][2296276736] {1:4797:910} Created alert : (:CRSPE00165:) :  Server Pool ora.srvtest has fallen below minimum size
...
2014-04-28 06:27:29.224: [   CRSPE][2296276736] {1:4797:910} Server [node04] has changed state from [ONLINE] to [RECONFIGURING]
2014-04-28 06:27:29.224: [   CRSPE][2296276736] {1:4797:910} Evaluating stop of [ora.db2.db] on [node04] with parameters: Options:0x1;Identity: root;Reason:system;
...
2014-04-28 06:27:29.226: [   CRSPE][2296276736] {1:4797:910} CRS-2673: Attempting to stop 'ora.db2.db' on 'node04'
...
2014-04-28 06:28:00.352: [   CRSPE][2296276736] {1:4797:910} CRS-2677: Stop of 'ora.db2.db' on 'node04' succeeded
2014-04-28 06:28:00.352: [   CRSPE][2296276736] {1:4797:910} Updated state details for server node04 from [STOPPING RESOURCES] to : []
2014-04-28 06:28:00.352: [   CRSPE][2296276736] {1:4797:910} Server [node04] has changed state from [RECONFIGURING] to [ONLINE]
...
2014-04-28 06:28:00.360: [   CRSPE][2296276736] {1:4797:910} CRS-2672: Attempting to start 'ora.db1.db' on 'node04'
...
2014-04-28 06:28:25.563: [   CRSPE][2296276736] {1:4797:910} CRS-2676: Start of 'ora.db1.db' on 'node04' succeeded

 

Oracle has moved node 04 from srvtest server pool to srvprod server pool:

 

[oracle@node01 ~]$ srvctl status srvpool
...
Server pool name: srvprod
Active servers count: 2
Active server names: node01,node04
NAME=node01 STATE=ONLINE
NAME=node04 STATE=ONLINE
Server pool name: srvtest
Active servers count: 0
Active server names:

 

[oracle@node01 ~]$ srvctl status database -d DB2
Instance DB2_1 is running on node node01
Instance DB2_2 is running on node node04

 

[oracle@node01 ~]$ srvctl status database -d DB2
Database is not running.

 

As we can see, there is no server available on the srvtest server pool. No DB2 instance is running and the application is no more accessible. However, the MIN_SIZE is still satisfied for the srvprod server pool: two nodes are running and two database instances are accessible.

Because of the srvprod IMPORTANCE parameter greater than srvtest value, Oracle has prefered to keep service available for production, even if it means that test pool becomes orphaned...

 

Conclusion

With two examples, I have shown how MIN_SIZE, MAX_SIZE and IMPORTANCE parameters can impact the application availability within a policy-managed database cluster. It shows how much it is important to define these parameters according to your business rules, because it really helps to keep available resources for a critical business.

In a next post, I will show how we can use these parameters to dynamically adapt the available resources to the workload of an application.

Improving your SharePoint performance using SQL Server settings

Thu, 2014-05-08 00:26

SharePoint performance is a recursive problem and preoccupation. As a Database Administrator, we have to deal with SharePoint when configuring SQL Server databases.

In this article, I will propose a list of best practices in SQL Server settings aimed to reduce SharePoint performance issues.

This article is inspired from the presentation Optimizing SQL Server for Speedy SharePoint by Sahil Malik held at the European SharePoint Conference 2014 in Barcelona.

SQL Server 2014: SELECT INTO operators can run in parallel

Mon, 2014-05-05 18:34

A couple of weeks ago, I had to provide some best practices in term of architecture for a "heavy writes" environment that consists in importing data from different sources into SQL Server tables. At the customer place, I was asked about interesting new SQL Server features that could be used to increase the performance speed of the data import process. Indeed, SQL Server 2014 provides an interesting enhancement of SELECT INTO statement that is often used in ETL environments. It is now possible to run in parallel insert operations by using the SELECT INTO command.

First, let me say that for versions prior to SQL Server 2014, inserting data in parallel into a single table is possible by using SSIS packages (for instance with several import data tasks in parallel), but the destination must exist. However, it is the creation of the destination table “in-fly” that counts here.

After the release of SQL Server 2014 RTM I decided to test this new enhancement on my lab environment. The laptop I used for my tests is a Lenovo T530 with the following specifications:

  • A quad core Intel Core I7 3630QM CPU @ 2,40 GHz. Hyper threading is enabled
  • 16 GB RAM (max memory = 12GB for SQL Server)
  • Windows 8 Pro x64

In addition I used two databases to simulate my data import workload process:

  • An Adventureworks2012 database and a bigTransactionHistory table with a size of approximately 6GB and 156317184 of data rows.
  • An empty AdventureWorks2012_DW database which we will import data from the bigTransactionHistory table in the AdventureWorks2012 database
  • A simple recovery model for the both databases will be used during the tests
  • Each database file is distributed as follows:

 

database_name

file_id

type_desc

logical_name

Drive

size_MB

AdventureWorks2012_DW

1

ROWS

AdventureWorks2012_DW

C:

6252

AdventureWorks2012_DW

2

LOG

AdventureWorks2012_DW_log

D:

8000

AdventureWorks2012_DW

65537

FILESTREAM

AdventureWorks2012_DW_xtp_ckpt

C:

0

AdventureWorks2012

1

ROWS

AdventureWorks2012_Data

C:

6413

AdventureWorks2012

2

LOG

AdventureWorks2012_Log

I:

8000

 

The C: drive is a solid-state Samsung 840 EVO drive with SATA 600. I decided to place both database mdf files on this disk because solid-state drives perform well for random IO. Unfortunately, I don’t have a second one to isolate each mdf database file, but in my context, it will be sufficient.

The D: drive is an mechanic disk with 7200 RPM and SATA 600. The AdventureWorks2012_DW transaction log file is placed here.

Finally, the I: drive is a USB drive with 5200 RPM only. It hosts the AdventureWorks2012 source database. In my context, the rotate speed of the I: drive is not important because during the test, we will only read the bigTransactionHistory table.

The hardware description of the environment is important here because the performance gain depends on many factors like the number of available processors, the amount of memory, the performance of the overall storage, the location of the database files, etc. For example, I did not notice a positive performance impact during my first tests because my mdf database files were located on slow disks, which was the main bottleneck regardless of the possibility to use (or not) the select into command in parallel.

So let’s talk about the test: First, I populated the bigTransactionHistory table based on the Adam Machanic’s script to have a sufficient size of data for a good comparison.

 

blog_7_-_bigTransactionHistoryTable_configuration

 

Then I used the following script to bulk insert data into a bigTransactionHistory2 in the AdventureWorks2012_DW database for each test:

 

blog_7_-_bigTransactionHistoryTable_to_bigTransactionHistory2

 

select * into [AdventureWorks2012_DW].dbo.bigTransactionHistory2 from dbo.bigTransactionHistory option (maxdop 4)

 

For the first test, I changed the database compatibility of AdventureWorks2012 database to 110 in order to disable SELECT INTO statement in parallel before to lauch the import data script.

 

ALTER DATABASE AdventureWorks2012 SET COMPATIBILITY_LEVEL = 110; GO   select * into [AdventureWorks2012_DW].dbo.bigTransactionHistory2 from dbo.bigTransactionHistory option (maxdop 4)

 

For the second test, I changed the database compatibility of the same database to 120 (like in the first test) in order to enable the new parallel capability of the SELECT INTO statement. No need to clear the procedure cache here because changing the compatibility level invalidates it automatically.

 

ALTER DATABASE AdventureWorks2012 SET COMPATIBILITY_LEVEL = 120; GO   DROP TABLE [AdventureWorks2012_DW].dbo.bigTransactionHistory2; GO   select * into [AdventureWorks2012_DW].dbo.bigTransactionHistory2 from dbo.bigTransactionHistory option (maxdop 4)

 

Here are the different results that I noticed:

 

blog_7_-_tableresulttest

 

First, I noticed an important reduction of the execution time between the two tests (60%). Nevertheless, as expected in the second test, the total working time is greater than in the first test (43%), but the CPU overhead is minimal compared to the overall performance gain of the SELECT INTO statement.

I also noticed a bigger use of the C: disk bandwidth between the two tests:

  • Test 1 (SELECT INTO serialized - max throughput around 100 MB/s)


blog_7_-_disk_ssd_througput_nonparallel


  • Test 2 (SELECT INTO in parallel - max throughput around 250 MB/s)


blog_7_-_disk_ssd_througput_parallel


As I have said above, the disk performance is important here. Indeed, we are trying to increase the speed of the insert operation that implies increasing the disk throughput. If we can reach the maximum bandwidth of the disk with a serialized SELECT INTO operator, performing an import of data with a parallelized SELECT INTO will not guarantee a greater performance.

So, using the second test, I tried to confirm that the INSERT operator is used in parallel just for fun Smile. We can confirm the INSERT operation in parallel by viewing the corresponding execution plan:

  • Test 1 (SELECT INTO serialized)


blog_7_-_bigTransactionHistory2_plan_nonparallel


By using the sys.dm_os_tasks, sys.dm_os_threads and sys.dm_exec_requests we can have a good overview of the dedicated threads used for the statement on a certain scheduler:

select        ta.scheduler_id,        wo.os_thread_id,        ta.pending_io_count,        ta.pending_io_byte_average,        ta.task_state from sys.dm_os_tasks as ta        join sys.dm_os_threads as wo                   on wo.worker_address = ta.worker_address where ta.session_id =(select session_id                        from sys.dm_exec_requests as r                                          cross apply sys.dm_exec_sql_text(r.sql_handle) as t                                      where t.text like '%bigTransactionHistory2%'                                         and t.text not like '%sys.dm_exec_query_stats%')


blog_7_-_scheduler_task_nonparallel


As expected, one thread was used for the first test on scheduler ID 3.

  • Test 2 (SELECT INTO in parallel)


blog_7_-_bigTransactionHistory2_plan_parallel

You will notice that all included Table Inserts are running in parallel.


blog_7_-_scheduler_task_parallel


Several threads mapped to a scheduler are also used by the statement during the data processing import.

In this blog post I tried to demonstrate that using SELECT INTO in parallel can improve insert operation performances. In my opinion, this enhancement presents an interesting investment for many case scenarios with SQL Server. I'm very excited to test it in the real world with customer workloads. Please feel free to share your thoughts or your experience with me.

Oracle: ROWNUM vs ROW_NUMBER() and 12c fetch first

Mon, 2014-05-05 01:24

Prior to Oracle 12c, there were two ways to do 'top-n' queries:

  • use rownum after sorting rows with "order by"
  • use row_number() over (order by)

Top-n queries are usually required for result pagination. The application shows the first 10 rows, then the next 10 on the following page, etc. In these queries, user interactions are often stateless, so we cannot just open a cursor and fetch 10 rows on each user request.

Oracle 12c CDB - metadata & object links internals

Wed, 2014-04-30 05:38

Warning: this is only geek stuff about internals on multitenant database dictionary, metadata, and object links. It has nothing to do with the operations that you can do on your database. Don't try that in production or you can corrupt the whole dictionary.

In 12 multitenant database (aka CDB) we know that each pluggable database is isolated in order to act as a standalone database. But they share some common resources because that's the main reason to consolidate into one CDB. No problem for cpu, memory, redo and undo resources. They are managed by the instance which is at CDB level only. And most of the time the container can be identified from the service, from the tablespace, etc. It's not difficult either for data because each PDB has it's own set of tablespaces and the pluggable feature is just an extension of the transportable tablespace.

What was probably more challenging for Oracle 12c architects is the way to implement the sharing of the dictionary.

First, each PDB has its own metadata describing its own data. But metadata for the dictionary itself must be shared. For example all the dbms_xxx packages are stored only in CDB$ROOT container and PDB have metadata links for them.

In addition to that, some dictionary data must be shared as well, such as some reference tables (AUDIT_ACTIONS) or common repositories (AWR data exposed as DBA_HIST_). They are stored only in CDB$ROOT container and each PDB defines a view with is just an object link for them.

Finally, the CDB$ROOT container must be able to query data from all PDB, for example using the new CDB_ views. They are exposed as container data objects, but they actually query data that is stored in each PDB.

But that sounds a bit magic isn't it? The documentation doesn't go very far on how it's implemented internally. Fortunately, Oracle scripts in ?/rdbms/admin give a few clues. It shows how the SQL syntax is extended when running under "_ORACLE_SCRIPT"=true.

So, geek stuff is coming now. Let's try to create metadata and object links ourselves...

The following is done after setting "_ORACLE_SCRIPT"=true in our session.
You will see these new syntax elements: cdb$view(), sharing=metadata, sharing=object, common_data

 

Container data objects

First, let's see how the root can view data from other containers.

I am in the root container:


SQL> alter session set container=cdb$root;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT

SQL> show con_id
CON_ID
------------------------------
1

 

and create a regular table:

 

SQL> create table DEMO_REG_TABLE sharing=none as select 111 dummy from dual;
Table created.

SQL> select * from DEMO_REG_TABLE;
     DUMMY
----------
       111

 

Then I do exactly the same (but different data) in a PDB

 

SQL> alter session set container=pdb1;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
PDB1

SQL> show con_id
CON_ID
------------------------------
3

SQL> create table DEMO_REG_TABLE sharing=none as select 999 dummy from dual;
Table created.

SQL> select * from DEMO_REG_TABLE;
     DUMMY
----------
       999

 

Now back in the root container, I'll use the CDB$VIEW table function in order to see data from all PDB:

 

SQL> select * from  cdb$view(DEMO_REG_TABLE) where con_id in (1,3);
     DUMMY     CON_ID
---------- ----------
       999          3
       111          1

 

This is how container objects are defined. They use CDB$VIEW to run a query in each PDB, consolidate the result and add a CON_ID column to show where the data comes from.

You want to know how it is implemented? It seems that it is doing a parallel query on each PDB.
Here is the proof. Previously I used 'con_id in (1,3)' because I've not created my table in all PDB.

 

SQL> select * from  cdb$view(DEMO_REG_TABLE);
select * from  cdb$view(DEMO_REG_TABLE)
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P002
ORA-00942: table or view does not exist

 

When the table is not found in a PDB the error comes from a parallel process.

 

Metadata links

Now I will create a function in both the CDB$ROOT container and in a PDB. But I don't want to have the code stored twice. I'll use SHARING=METADATA to define a metadata link.

 

SQL> alter session set container=cdb$root;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT

SQL> show con_id

CON_ID
------------------------------
1

SQL> create function DEMO_MDL_FUNCTION sharing=metadata
  2  return varchar2 as dummy varchar2(100); begin select max(dummy) into dummy from DEMO_REG_TABLE; return dummy; end;
  3  /
Function created.

SQL> select DEMO_MDL_FUNCTION from dual;
DEMO_MDL_FUNCTION
------------------------------
111

 

This is my function in CDB$ROOT, showing content from my regular table in CDB$ROOT.
Now doing exactly the same in a PDB:

 

SQL> alter session set container=pdb1;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
PDB1

SQL> show con_id
CON_ID
------------------------------
3

SQL> create function DEMO_MDL_FUNCTION sharing=metadata
  2  return varchar2 as dummy varchar2(100); begin select max(dummy) into dummy from DEMO_REG_TABLE; return dummy; end;
  3  /
Function created.

SQL> select DEMO_MDL_FUNCTION from dual;
DEMO_MDL_FUNCTION
------------------------------
999

 

Here it is. I have the same function in my PDB, showing content from the regular table in PDB.

I can see the metadata for my function from the SYS.SOURCE$ dictionary table.
Here in CDB$ROOT:

 

SQL> alter session set container=cdb$root;
Session altered.

SQL> select * from source$ where obj# in (select obj# from obj$ where name like 'DEMO%');

      OBJ#       LINE SOURCE
---------- ---------- ------------------------------
     95789          1 function DEMO_MDL_FUNCTION

 

but let's see what I have in my PDB:

 

SQL> alter session set container=pdb1;
Session altered.

SQL> select * from source$ where obj# in (select obj# from obj$ where name like 'DEMO%');

no rows selected

 

Nothing is stored in the PDB. Only the information (in OBJ$) that the object is a metadata link.

But there is another magic if I query the DBA_SOURCE view:

 

SQL> select * from dba_source where name like 'DEMO%';

OWNER NAME              TYPE      LINE TEXT                        ORIGIN_CON_ID
----- ----------------- --------- ---- --------------------------- -------------
SYS   DEMO_MDL_FUNCTION FUNCTION     1 function DEMO_MDL_FUNCTION              1

 

DBA_SOURCE shows information from the CDB$ROOT, following the metadata links, and adds a ORIGIN_CON_ID column to see if the row is coming from the PDB dictionary or the CDB$ROOT dictionary. I'll will detail that later by creating a 'common data view'.

 

Object links

We have seen how the CDB$ROOT can store metadata for all the PDB. We will use it to create a table with metadata link. But in addition to that we will create an object link so that the table in CDB$ROOT will store the data for all PDB. I'll use SHARING=METADATA to create the table and SHARING=OBJECT when creating the view.

First I create the table in both containers:

 

SQL> alter session set container=cdb$root;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT

SQL> show con_id
CON_ID
------------------------------
1

SQL> create table DEMO_MDL_TABLE sharing=metadata as select 111 dummy from dual;
Table created.

SQL> alter session set container=pdb1;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
PDB1

SQL> show con_id

CON_ID
------------------------------
3

SQL> create table DEMO_MDL_TABLE sharing=metadata as select 999 dummy  from dual;
Table created.

 

So the table is created in both containers. I've inserted different data in order to understand what happens. Let's use db CDB$VIEW to show data from each container:

 

SQL> alter session set container=cdb$root;
Session altered.

SQL> select * from  cdb$view(DEMO_MDL_TABLE) where con_id in (1,3);

     DUMMY     CON_ID
---------- ----------
       999          3
       111          1

 

I have two tables with same structure (because it is a metadata link). The CDB$ROOT one contains '111' and the PDB one contains '999'.

 

I'll create a view on it, defining it as an object link so that the data is shared:

 

SQL> alter session set container=cdb$root;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT

SQL> show con_id
CON_ID
------------------------------
1

SQL> create view DEMO_OBL_VIEW sharing=object as select * from DEMO_MDL_TABLE;
View created.

SQL> select * from DEMO_OBL_VIEW;
     DUMMY
----------
       111

 

The view in CDB$ROOT shows data from the table in CDB$ROOT. Now let's do the same in a PDB:

 

SQL> alter session set container=pdb1;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
PDB1

SQL> show con_id
CON_ID
------------------------------
3

SQL> create view DEMO_OBL_VIEW sharing=object as select * from DEMO_MDL_TABLE;
View created.

SQL> select * from DEMO_OBL_VIEW;
     DUMMY
----------
       111

 

The view in the PDB shows data from the table in CDB$ROOT. The query followed the object link instead of accessing the current container table.

How data is stored in that container table? It is stored from the container. Think about AWR which run at CDB level and stores its data in WRM$ tables. Then each PDB can query them with the DBA_HIST_ views. But just in case you want to try, you can't insert into an object link:

 

SQL> insert into DEMO_OBL_VIEW select 9999 dummy from dual;
insert into DEMO_OBL_VIEW select 9999 dummy from dual
            *
ERROR at line 1:
ORA-02030: can only select from fixed tables/views

 

This is a clue about the implementation. Object links are accessed through fixed tables. And if you explain plan from the PDB you will see it:

 

---------------------------------------------
| Id  | Operation        | Name             |
---------------------------------------------
|   0 | SELECT STATEMENT |                  |
|   1 |  FIXED TABLE FULL| X$OBLNK$aed0818c |
---------------------------------------------

 

Common data views

Finally, let's see how a PDB can show data coming from the CDB$ROOT. Dictionary tables such as DBA_SOURCE must show common metadata as well as PDB metadata. It is defined as a 'common data view' and I'll create one here using COMMON_DATA:

 

SQL> alter session set container=cdb$root;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT

SQL> show con_id
CON_ID
------------------------------
1

SQL> create or replace view DEMO_INT_VIEW common_data (dummy,sharing) as select dummy,case when dummy='222' then 0 else 1 end from DEMO_MDL_TABLE;
View created.

SQL> select * from DEMO_INT_VIEW;

     DUMMY    SHARING
---------- ----------
       111          1
       222          0

 

I've added a 'SHARING' column, that is required when using COMMON_DATA, in order to flag rows that are shared to other containers (1) and rows that are not (0). Here the line with '222' is private to my container and the others ('111') can be seen by the PDB. And as usual, I'm doing exactly the same in the pdb:

 

SQL> alter session set container=pdb1;
Session altered.

SQL> show con_name
CON_NAME
------------------------------
PDB1

SQL> show con_id
CON_ID
------------------------------
3

SQL> create or replace view DEMO_INT_VIEW common_data (dummy,sharing) as select dummy,case when dummy='222' then 0 else 1 end from DEMO_MDL_TABLE;
View created.

SQL> select * from DEMO_INT_VIEW;
     DUMMY    SHARING ORIGIN_CON_ID
---------- ---------- -------------
       999          1             3
       111          1             1

 

When in the PDB the COMMON_DATA view shows the shared rows from the CDB$ROOT table in addition to the the rows from PDB table.
Of course, having read what is above, you expect to see a parallel process and a fixed table:

 

SQL> set autotrace on
SQL> select * from DEMO_INT_VIEW;

     DUMMY    SHARING ORIGIN_CON_ID
---------- ---------- -------------
       111          1             1
       999          1             3

Execution Plan
----------------------------------------------------------
Plan hash value: 3158883863

--------------------------------------------------------------------------------------------
|Id  | Operation               | Name            |Pstart|Pstop |   TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT        |                 |      |      |       |      |            |
|  1 |  PX COORDINATOR         |                 |      |      |       |      |            |
|  2 |   PX SEND QC (RANDOM)   | :TQ10000        |      |      | Q1,00 | P->S | QC (RAND)  |
|  3 |    PX PARTITION LIST ALL|                 |    1 |    2 | Q1,00 | PCWC |            |
|  4 |     FIXED TABLE FULL    | X$COMVW$e40eb386|      |      | Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------

 

The fixed table returns each container data as a partition, each gathered by a parallel process and returned to the query.

This is enough about multitenant dictionary internals investigation.
If you want more, have a look at ?/rdbms/admin/noncdb_to_pdb.sql which exposes all the magic that is done to transform a standalone dictionary to a linked one.
If you want more conventional information about pluggable databases, and manipulate them as they are designed to, please come to our Oracle 12c New Features workshops.

SQL Server 2014: In-memory tables, Bw-Tree, and storage

Sun, 2014-04-27 21:30

SQL Server 2014 introduces hash indexes with in-memory optimized tables. I described some of their aspects in a previous blog post. These indexes are very efficient with lookup search operations, but have some limitations with operations like range scans, inequality predicates or scan order operations. These limitations are linked to the hash index design that stores rows in the index in a random order. Fortunately, non-clustered indexes for in-memory optimized tables (aka range indexes or Bw-Tree) solve this issue and, like hash indexes, involve a data row chain with their own structure on the leaf level.

In this blog post I would like to share with you an interesting consideration on Bw-Tree storage. Like a traditional b-tree index structure, accessing in-memory table data rows themselves first requires to go through the Bw-Tree structure prior to retrieving the corresponding row data in the chain.

At the leaf level of the Bw-Tree we have different pointers to the data (concerned in the first place) in the row chain. The pages in the index are linked together by logical pointers (called page identifiers or PID).

These PID are then translated to a physical address or a flash offset of a page in-memory or to a stable media storage. The Bw-Tree layer interacts with the cache layer that "abstracts" the physical page by using a table that maps the logical pages with the physical pages. Beside, others non-leaf pages in the Bw-Tree structure use the same mechanism. Here is a simplified schema for retrieving the data through the Bw-Tree structure:

  • Bw-Tree --> Mapping Table --> Physical Page

Abstracting the physical layer has some advantages. Indeed, changing the physical location of a page only requires to change the corresponding mapping to the mapping table. The Bw-Tree path will not be affected by this change because each page is logically linked to their own PID.

Now let’s demonstrate an interesting behavior of Bw-Tree (range indexes) with several extreme tests. For this purpose, we create an in-memory optimized table named hekaton_table with the following definition:

 

CREATE TABLE dbo.hekaton_table (        id INT NOT NULL PRIMARY KEY NONCLUSERED HASH WITH ( BUCKET_COUNT = 1048576 ),        col1 CHAR(100) COLLATE French_BIN2 NOT NULL,        INDEX idx_hekaton_table_col1 NONCLUSTERED ( col1 ) ) WITH (        MEMORY_OPTIMIZED = ON,        DURABILITY = SCHEMA_AND_DATA )

 

This table contains a hash index as a primary key on the id column and a nonclustered range index idx_hekaton_table_col1 that concerns the col1 column.

We will use the DMVs sys.dm_db_xtp_nonclustered_index_stats, sys.dm_db_xtp_memory_consumers and sys.dm_db_xtp_xtp_memory_stats during the test.

The first DMV sys.dm_db_xtp_nonclustered_index_stats includes statistics about operations on non-clustered indexes in memory-optimized tables.

select        o.name as table_name,        i.name as index_name,        nis.internal_pages,        nis.delta_pages,        nis.leaf_pages from sys.dm_db_xtp_nonclustered_index_stats as nis        join sys.objects as o              on o.object_id = nis.object_id        join sys.indexes as i              on i.object_id = nis.object_id                     and i.index_id = nis.index_id where i.name = 'idx_hekaton_table_col1';

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats

 

The second DMV sys.dm_db_xtp_memory_consumers provides useful information on the memory consumers. We are concerned by the Range index heap consumer in this example.

 

select        o.name as table_name,        i.name as index_name,        xmc.memory_consumer_desc,        xmc.allocated_bytes / 1024 as allocated_kb,        xmc.used_bytes from sys.dm_db_xtp_memory_consumers as xmc        join sys.indexes as i              on xmc.object_id = i.object_id                     and xmc.index_id = i.index_id where i.name = 'idx_hekaton_table_col1';

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_2

 

The third DMV sys.dm_db_xtp_table_memory_stats provides information about the size of an in-memory table inside the memory.

 

select        memory_used_by_table_kb,        memory_used_by_indexes_kb from sys.dm_db_xtp_table_memory_stats where [object_id] = object_id('hekaton_table')

 

For the first test, we will introduce 200K rows of data in the col1 column with a high cardinality. All rows will have a distinct value.

 

declare @i int = 1;   while @i 200000 begin        insert dbo.hekaton_table values (@i, 't_' + cast(@i as varchar(10)));        set @i += 1; end

 

Then we take a look at the result of all DMVs:

  • sys.dm_db_xtp_nonclustered_index_stats

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_3

 

  • sys.dm_db_xtp_memory_consumers

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_4

 

  • sys.dm_db_xtp_xtp_memory_stats

 

billet_6_-_sys_dm_db_xtp_table_kb_7

 

In the second test, we will introduce the same number of data rows in the col1 column with a low cardinality. All rows will have the same value. Here is the results of the second test:

  • sys.dm_db_xtp_nonclustered_index_stats

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_5

 

  • sys.dm_db_xtp_memory_consumers

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_6

 

  • sys.dm_db_xtp_xtp_memory_stats

 

billet_6_-_sys_dm_db_xtp_table_kb_8

 

Wow! Only one leaf page with 400 bytes is used during the second test versus 5363 pages allocated with 26485848 bytes during the first test. We retrieved the same total size for the in-memory data itself (29687 KB).

What’s the matter?

In fact, there is a big difference between traditional non-clustered indexes and Bw-Tree. Unlike traditional non-clustered indexes where there is a pointer per row regardless of the uniqueness of the index key, the Bw-Tree has only one pointer per unique index value. This changes the amount of used storage.

For a table stored in-memory, this is an interesting optimization Laughing

How about the data inserted during the second test?

Rows with the same value are grouped in the same row chain as shown below:

 

billet_6_-_nonclustered_range_index

 

The use of the sys.dm_db_xtp_nonclustered_index_stats DMV also introduces new pages concepts. I advise you to have a look at the Microsoft research on Bw-Tree.

As shown above, there are two others page type columns:

  • The internal pages that represent the top level of the Bw-Tree (root and non-leaf pages in a classical B-Tree structure)
  • The delta pages that are “special pages” containing an operation code (insert, delete) and a memory value which is the memory address in the first row in a chain of records

These pages are part of a special mechanism that allows to incrementally update the page state in a latch-free manner.

Remember: With in-memory tables, locks and latchs do not exist anymore! I will soon prepare another blog post dedicated to this interesting mechanism.

AWR Formatter: plug-in Google Chrome pour rapports AWR

Sun, 2014-04-27 18:56

Chacun d'entre nous à une façon bien différente d'appréhender le rapport AWR (Oracle Automatic Workload Repository). Nous utilisons nos propres processus de lecture et notre propre appoche. Dans ce blog je vous propose de découvrir AWR Formatter, un outil qui permet de concentrer les informations en quelques clics. Il s'agit d'un plug-in (AWR-Format.crx) pour Google Chrome fort sympatique, qui permet de reprendre un rapport AWR et de le rendre plus lisible à la première lecture.

 

AWR Formatter 1.6

AWR Formatter fonctionne et s'installe que sur le browser Google Chrome pour des instances Oracle Standalone et RAC.

C'est un outil, il ne remplace pas ADDM en tant que tel et n'est pas conçu pour optimiser votre base de données.

Rappel: L'utilisation de l'AWR requiert obligatoirement la license DIAGNOSTIC+TUNING Pack.

Son rôle est de vous permettre une lecture pertinente et rapide d'un rapport en accèdant aux différents résultats en quelques clics.

 

Fonctionalités de AWR Formatter

Que peut-on faire avec AWR Formatter? Voici les fonctionalités les plus importantes:

  • Convertir dynamiquement en KB / MB / GB / TB: il suffit de cliquer sur le texte en orange pour faire défiler les unités.
  • Convertir dynamiquement en Heures / Minutes / Secondes: il suffit de cliquer sur le texte en orange pour faire défiler les unités.
  • Afficher les Event Waits communs et leurs définitions.
  • Afficher un tableau HTML permettant une consultation et un tri des données.
  • Afficher un tableau HTML du Top SQL par elapsed, read, gets et exécutions avec la possibilité d'obtenir un Query Plan, un Color SQL et un AWR SQL Report.

Nous avons à notre disposition les onglets suivants:

  • Overview
  • Top SQL Combined
  • Top Events
  • Exadata Stats
  • Memory
  • I/O Graphs
  • Observations

 

Overview

Dans l'onglet overview, nous pouvons retrouvons les composants principales de l'instance:

b2ap3_thumbnail_synthe.jpg

 

Top Sql

L'onglet Top SQL affiche les requêtes SQL avec différents filtres:

 filtre_20140404-081013_1.png

Dans ce tableau il est possible par simple clic de classer chaque colonne dans l'ordre croissant ou décroissant.

D'autre part, il est possible de filtrer chaque colonne à l'aide d'opérateurs.

Ici, nous avons volontairement mis un filtre sur la colonne elapsed(s) afin de voir toutes les requêtes >500 secondes.

Ainsi, nous allons pouvoir récupérer le SQL_ID pour afficher:

  • Son SQL_TEXT formatté
  • La syntax du SQL_plan, SQL_color et l'AWR SQL Report

b2ap3_thumbnail_top-sql.jpg

Les différents onglets permettent de récupérer l'ordre SQL et de l'exécuter sous SqlPlus.

 

Top Events

Cet onglet permet de visualiser les Waits. Toujours dans la même optique, des filtres peuvent être appliqués sur les différents colonnes.

Nous trouvons les events principaux: Top 5 Timed Foregrouns Events, Time Model Statistics, Foreground Wait Class, Foreground Wait Events.

b2ap3_thumbnail_wait.jpg

Je fais volontairement l'impasse sur les onglets Exadata Stats et Memory car ceux-ci font état de l'utilisation des Reads/Writes ainsi que de la mémoire.

 

I/O format GRAPH

Ce graphique montre l0I/O des Tablespaces, il est très explicite et lisible rapidement. Le fait de cliquer sur un point nous donne directement le nom et les informations relatives au tablespace.

Dans cet exemple, nous avons un Average Read Time de 10.3 millisecondes pour le tablespace SYSTEM.

b2ap3_thumbnail_tbs.jpg

Dans cet exemple nous avons les Average Reads par secondes pour le tablespace ORCL_PART117 (points oranges) et nous pouvons également vérifier les Averages Writes par Secondes (points bleus).

 

Observations

Cette vue répertorie et nous informe sur des recommandations. Nous pouvons accéder à la section en cliquant sur Jump to Section.

Certaines observations sont liés à des icônes rouges indiquant des informations sur des paramètres choisis:

obs1.png

 

Installation du Plug-in AWR pour Chrome

Le plus simple est d'aller sur la page du blog de Tyler Muth. Dans la catégorie AWR, vous trouverez des indications pour downloader et installer l'outil: http://tylermuth.wordpress.com

  Conclusion

AWR Formatter synthétise le Rapport AWR pour vous, s'installe très facilement et ne demande aucune licence.

SQL Server 2014: In-Memory tables, Bw-Tree, and storage

Thu, 2014-04-17 09:30

SQL Server 2014 introduces hash indexes with in-memory optimized tables. I described some of their aspects in a previous blog post.  These indexes are very efficient with lookup search operations but have some limitations with operations like range scans, inequality predicates or scan order operations. These limitations are linked to hash index design that stores rows in the index in a random order. Fortunately, nonclustered indexes for in-memory optimized tables (aka range indexes or Bw-Tree) solve this issue and, like hash indexes, involve a data row chain at their own structure in the leaf level.


In this blog post I would like to share with you an interesting consideration about the Bw-Tree storage. Like a traditional b-tree index structure, accessing in-memory table data rows themselves requires first to go through the Bw-Tree structure prior to retrieve the corresponding row data in the chain. At the leaf level of the Bw-Tree we have different pointers to the first concerned data in the row chain. Pages in the index are linked together by logical pointers (called page identifiers or PID). This PID is then translated to a physical address or a flash offset of a page in memory or to a stable media storage. The Bw-Tree layer interacts with the cache layer that abstracts physical page by using a mapping table that maps logical pages to physical pages. Beside, others non-leaf pages in the Bw-Tree structure use the same mechanism. Here a simplified schema for retrieving the data through the Bw-Tree structure :

Bw-Tree --> Mapping Table --> Physical Page

Abstracting the physical layer has some advantages. Indeed, changing the physical location of a page requires only to change to corresponding mapping to the mapping table. The Bw-Tree path will not be affected by this change because each page is logical with their own PID.

Now let’s demonstrate an interesting behaviour of Bw-Tree (range indexes) with several extreme tests. We create an in-memory optimized table named hekaton_table with the following definition:

CREATE TABLE dbo.hekaton_table (        id INT NOT NULL PRIMARY KEY NONCLUSERED HASH WITH ( BUCKET_COUNT = 1048576 ),        col1 CHAR(100) COLLATE French_BIN2 NOT NULL,        INDEX idx_hekaton_table_col1 NONCLUSTERED ( col1 ) ) WITH (        MEMORY_OPTIMIZED = ON,        DURABILITY = SCHEMA_AND_DATA )

 

This table contains a hash index as a primary key on the id column and a nonclustered range index idx_hekaton_table_col1 that concerns the col1 column. We will use the DMVs sys.dm_db_xtp_nonclustered_index_stats, sys.dm_db_xtp_memory_consumers and sys.dm_db_xtp_xtp_memory_stats during the test.

The first DMV sys.dm_db_xtp_nonclustered_index_stats includes statistics about operations on nonclustered indexes in memory-optimized tables.

select        o.name as table_name,        i.name as index_name,        nis.internal_pages,        nis.delta_pages,        nis.leaf_pages from sys.dm_db_xtp_nonclustered_index_stats as nis        join sys.objects as o              on o.object_id = nis.object_id        join sys.indexes as i              on i.object_id = nis.object_id                     and i.index_id = nis.index_id where i.name = 'idx_hekaton_table_col1';

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats

 

The second DMV sys.dm_db_xtp_memory_consumers provides useful information about the memory consumers. We are concerned by the Range index heap consumer for this example.

select        o.name as table_name,        i.name as index_name,        xmc.memory_consumer_desc,        xmc.allocated_bytes / 1024 as allocated_kb,        xmc.used_bytes from sys.dm_db_xtp_memory_consumers as xmc        join sys.indexes as i              on xmc.object_id = i.object_id                     and xmc.index_id = i.index_id where i.name = 'idx_hekaton_table_col1';

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_2

 

The third DMV sys.dm_db_xtp_table_memory_stats provides information about the size of an in-memory table into the memory.

select        memory_used_by_table_kb,        memory_used_by_indexes_kb from sys.dm_db_xtp_table_memory_stats where [object_id] = object_id('hekaton_table')

 

For the first test we will introduce 200K rows of data in the col1 column with a high cardinality. All rows will have a distinct value.

declare @i int = 1;   while @i 200000 begin        insert dbo.hekaton_table values (@i, 't_' + cast(@i as varchar(10)));        set @i += 1; end

 

Then we take a look to result of the all DMVs:

- sys.dm_db_xtp_nonclustered_index_stats

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_3

 

- sys.dm_db_xtp_memory_consumers

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_4

 

- sys.dm_db_xtp_xtp_memory_stats

 

billet_6_-_sys_dm_db_xtp_table_kb_7

 

With the second test we will introduce the same number of data rows in the col1 column with a low cardinality. All rows will have the same value. Here the results of the second test:

- sys.dm_db_xtp_nonclustered_index_stats

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_5

 

- sys.dm_db_xtp_memory_consumers

 

billet_6_-_sys_dm_db_xtp_nonclustered_index_stats_6

 

- sys.dm_db_xtp_xtp_memory_stats

 

billet_6_-_sys_dm_db_xtp_table_kb_8

 

Wow! Only one leaf pages with 400 bytes used during the second test versus 5363 pages allocated with 26485848 bytes during the first test. We retrieve the same total size for the in-memory data itself (29687 KB). What’s the matter? In fact, there is a big difference between traditional nonclustered indexes and Bw-Tree. Indeed, unlike a traditional nonclustered indexes where there is a pointer per row regardless the uniqueness of the index key the Bw-Tree has only one pointer per unique index value thus change the amount of storage used. For a table stored in memory that’s an interesting optimization Laughing

How about the data inserted during the second test? : Rows with the same value are grouped in the same row chain as showed below :

 

billet_6_-_nonclustered_range_index

 

The use of the sys.dm_db_xtp_nonclustered_index_stats DMV also introduces new pages concepts. I advice you the reading of the Microsoft research about Bw-Tree . As seen above there is two others pages type columns: the internal pages that represent the top level of the Bw-Tree (root and nonleaf pages in a classic B-Tree structure) and delta pages that are “special pages” that contains an operation code (insert, delete) and a memory value which is the memory address in the first row in a chain of records. These pages are part of a special mechanism that allow to incrementally update page state in a latch-free manner. Remember with in-memory tables locks and latchs does not exist anymore! I will prepare a future blog post dedicated to this interesting mechanism !

 

 

 

 

 

 

 


Oracle TNS_ADMIN issues due to bad environment settings

Mon, 2014-04-14 18:11

Recently, I faced a TNS resolution problem at a customer. The reason was a bad environment setting: The customer called the service desk because of a DBLINK pointing to a bad database.

The users were supposed to be redirected to a development database, and the DBLINK was redirecting to a validation database instead. The particularity of the environment is that development and validation databases are running on the same server, but on different Oracle homes, each home having its own tnsnames.ora. Both tnsnames.ora contain common alias names, but pointing on different databases. Not exactly best practice, but this is not the topic here.

The problem started with some issues to reproduce the case. Our service desk was not able to reproduce the situation without understanding that the customer was trying to access the database remotely via a development tool (through the listener), while we were connected locally on the server.

Let me present the case with my environment.

First, this is the database link concerned by the issue:

 

SQL> select * from dba_db_links;
OWNER      DB_LINK              USERNAME                       HOST       CREATED
---------- -------------------- ------------------------------ ---------- ---------
PUBLIC     DBLINK               DBLINK                         MYDB       21-MAR-14

 

And this is the output when we try to display the instance name through the DBLINK, when connected locally:

 

SQL> select instance_name from v$instance@DBLINK;
INSTANCE_NAME
----------------
DB2

 

The user is redirected on the remote database, as expected. Now, let's see what happens when connected using the SQL*Net layer:

 

[oracle@srvora01 ~]$ sqlplus system@DB1
SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 24 10:07:45 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
 
Enter password:
 
Connected to:
 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
 
SQL> select instance_name from v$instance@DBLINK;
INSTANCE_NAME
----------------
DB1

 

Here we can see that the user is not redirected to the same database (here, for demonstration puproses, on the database itself).

The first thing to check is the TNS_ADMIN variable, if it exists:

 

[oracle@srvora01 ~]$ echo $TNS_ADMIN
/u00/app/oracle/product/11.2.0/db_3_0/network/admin

 

There is the content of the tnsnames.ora file on that location:

 

[oracle@srvora01 ~]$ cat /u00/app/oracle/product/11.2.0/db_3_0/network/admin/tnsnames.ora
DB1 =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = tcp)(HOST = srvora01)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = DB1)
    )
  )
MYDB =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = tcp)(HOST = srvora01)(PORT = 1521))
    (CONNECT_DATA =
      (SERVICE_NAME = DB2)
    )
  )

 

Clearly, we have a problem with the TNS resolution. The local connection resolves the MYDB alias correctly, while the remote connection resolves a different database with the alias. In this case, we have two solutions:

  • The tnsnames.ora is not well configured: this is not the case, as you can see above
  • Another tnsnames.ora file exists somewhere on the server and is used by remote connections

 To confirm that the second hypothesis is the good one, we can use the strace tool:

 

SQL> set define #
SQL> select spid from v$process p join v$session s on p.addr=s.paddr and s.sid=sys_context('userenv','sid');
SPID
------------------------
5578

 

SQL>  host strace -e trace=open -p #unix_pid & echo $! > .tmp.pid
Enter value for unix_pid: 5578
SQL> Process 5578 attached - interrupt to quit

 

SQL> select instance_name from v$instance @ DBLINK;
open("/u00/app/oracle/product/11.2.0/db_3_0/network/admin/tnsnames.ora", O_RDONLY) = 8
open("/etc/host.conf", O_RDONLY)        = 8
open("/etc/resolv.conf", O_RDONLY)      = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 10
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 10
open("/etc/hostid", O_RDONLY)           = -1 ENOENT (No such file or directory)
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 10INSTANCE_NAME
----------------
DB2

 

The DBLINK is resolved using the file /u00/app/oracle/product/11.2.0/db_3_0/network/admin/tnsnames.ora.

Now, when connected remotely:

 

SQL> set define #
SQL> select spid from v$process p join v$session s on p.addr=s.paddr and s.sid=sys_context('userenv','sid');
SPID
------------------------
6838

 

SQL> host strace -e trace=open -p #unix_pid & echo $! > .tmp.pid
Enter value for unix_pid: 6838
SQL> Process 6838 attached - interrupt to quit

 

SQL> select instance_name from v$instance@DBLINK;
open("/u00/app/oracle/network/admin/tnsnames.ora", O_RDONLY) = 8
open("/etc/host.conf", O_RDONLY)        = 8
open("/etc/resolv.conf", O_RDONLY)      = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 8
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 9
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9
open("/etc/hostid", O_RDONLY)           = -1 ENOENT (No such file or directory)
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9INSTANCE_NAME
----------------
DB1

 

Here the DBLINK is resolved with the file /u00/app/oracle/network/admin/tnsnames.ora.

 

Two different tnsnames.ora files are used according to the connection method! If we query the content of the second tnsnames.ora, we have an explanation for our problem:

 

[oracle@srvora01 ~]$ cat /u00/app/oracle/network/admin/tnsnames.ora
MYDB =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = tcp)(HOST = srvora01)(PORT = 1521))
    (CONNECT_DATA =
      (SERVICE_NAME = DB1)
    )
  )

 

It is not clearly documented by Oracle, but the database session can inherit the environment variables in three different ways:

  • When you connect locally to the server (no SQL*Net, no listener), the Oracle session inherits the client environment
  • When you connect remotely to a service statically registered on the listener, the Oracle session inherits the environment which started the listener
  • When you connect remotely to a service dynamically registered on the listener, the Oracle session inherits the environment which started the database

In our case, the database was restarted with the wrong TNS_NAMES value set. Then, the database registered this value for remote connections. We can check this with the following method:

 

[oracle @ srvora01 ~]$ ps -ef | grep pmon
oracle    3660     1  0 09:02 ?        00:00:00 ora_pmon_DB1
oracle    4006     1  0 09:05 ?        00:00:00 ora_pmon_DB2
oracle    6965  3431  0 10:44 pts/1    00:00:00 grep pmon

 

[oracle @ srvora01 ~]$ strings /proc/3660/environ | grep TNS_ADMIN
TNS_ADMIN=/u00/app/oracle/network/admin

 

Note that we can get the value for TNS_ADMIN using the dbms_system.get_env.

The solution was to restart the database with the correct TNS_ADMIN value:

 

[oracle @ srvora01 ~]$ echo $TNS_ADMIN
/u00/app/oracle/product/11.2.0/db_3_0/network/admin

 

[oracle@srvora01 ~]$ sqlplus / as sysdba
 
SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 24 10:46:03 2014
 
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
 
 
Connected to:
 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

 

SQL> startup
ORACLE instance started.Total System Global Area 1570009088 bytes
Fixed Size                  2228704 bytes
Variable Size            1023413792 bytes
Database Buffers          536870912 bytes
Redo Buffers                7495680 bytes
Database mounted.
Database opened.

 

[oracle@srvora01 ~]$ ps -ef | grep pmon
oracle    4006     1  0 09:05 ?        00:00:00 ora_pmon_DB2
oracle    7036     1  0 10:46 ?        00:00:00 ora_pmon_DB1
oracle    7116  3431  0 10:46 pts/1    00:00:00 grep pmon

 

[oracle@srvora01 ~]$ strings /proc/7036/environ | grep TNS_ADMIN
TNS_ADMIN=/u00/app/oracle/product/11.2.0/db_3_0/network/admin

 

The value for TNS_ADMIN is now correct.

 

[oracle@srvora01 ~]$ sqlplus system @ DB1
 
SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 24 10:47:21 2014
 
Copyright (c) 1982, 2011, Oracle.  All rights reserved.Enter password:
 
Enter password:
 
Connected to:
 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

SQL> select instance_name from v$instance @ DBLINK;
INSTANCE_NAME
----------------
DB2

 

Remote connections are now using the right tnsnames.ora.

I hope this will help you with your TNS resolution problems.

Oracle 12c Adaptive Plan & inflection point

Fri, 2014-04-11 08:25

The Oracle 12c Adaptive Plan feature was already presented by Nicolas Jardot in OOW 2013: Solving customer issues with the 12c Optimizer.

I recently had to answer several questions about its behavior at execution time. Maybe the term 'adaptive' is misleading. It's not that a join will stop and restart to another join method. Even with adaptive plan there will only be one join method to be applied. The feature only defers a decision that was made at parse time in previous versions and that will now be made at execution time - after reading a few rows.

In order to show what happens exactly at execution time, I will reproduce the kind of exercise that we do in our training session Oracle 12c New Features workshop in this posting.

How to install and manage a Kerberos Server

Thu, 2014-04-10 19:04


For some time now, I have been working on how to set up a Single Sign-On (SSO) solution in my company. As a big fan of Open Source solutions, I have obviously proposed the implementation of a solution based on Kerberos. What I mean by that is a solution based on the true Kerberos, i. e. MIT Kerberos. Indeed, Kerberos was originally a research project at the Massachusetts Institute for Technology (MIT) in the early 1980s.

Before starting this kind of project, it's important to clearly define and have in mind the following points:

  • Architecture of the enterprise
  • Operating systems used by end users
  • Operating systems used by applications which must be kerberized
  • Is it difficult to kerberized these applications?

The answers to these elements provide insight on which types of solutions are possible. For example if there is no restrictions on which operating system an end user can use (Windows or different Linux distribution or ...) then the introduction of a solution based on a Linux Kerberos could be a good idea. That's why in this blog, I will first explain how to install a MIT Kerberos Server. In the second part, I will focus on how to manage this Server.

 

1. Install MIT Kerberos

It's important to note that the server and the client share the same release and that the MIT Keberos server could only be installed on a Unix-like. The Mac release is available as part of the Mac OS X since version 10.3 (the current release is Mavericks: 10.9). The Key Distribution Center (KDC) is the Kerberos Server where all identities (users, computers and kerberized applications) will be stored.

For this installation, let's define the followings properties/variable:

  • example.com = the DNS Domain
  • EXAMPLE.COM = the KDC REALM which should be the DNS Domain in UPPERCASE. In case where there should be more than one KDC, all names must be unique and self descriptive
  • kdc01oel.example.com = the FQDN of the KDC
  • 192.168.1.2 = the IP of kdc01oel.example.com

So let's begin the installation. Obviously, the first thing to do is to download the current release of the MIT Kerberos distribution for the target operating system. This could be done at the following URL: http://web.mit.edu/kerberos/dist/index.html. The current Linux release is krb5-1.12.1-signed.tar:

[root@oel opt]# wget http://web.mit.edu/kerberos/dist/krb5/1.12/krb5-1.12.1-signed.tar
--2014-04-01 14:00:28--  http://web.mit.edu/kerberos/dist/krb5/1.12/krb5-1.12.1-signed.tar
Resolving web.mit.edu... 23.58.214.151
Connecting to web.mit.edu|23.58.214.151|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11950080 (11M) [application/x-tar]
Saving to: “krb5-1.12.1-signed.tar”

100%[===============================================>] 11,950,080  1.52M/s   in 7.3s
2014-04-01 14:00:38 (1.56 MB/s) - “krb5-1.12.1-signed.tar” saved [11950080/11950080]
[root@oel opt]# tar -xvf krb5-1.12.1-signed.tar
krb5-1.12.1.tar.gz
krb5-1.12.1.tar.gz.asc

As you could see, this file is signed and you could (should) verify the integrity and identity of the software. This can be done, for example, using GNU Privacy Guard (need another file that can be found on the MIT Kerberos download page):

[root@oel opt]# gpg --verify krb5-1.12.1.tar.gz.asc

After that, just extract the MIT Kerberos source code and build it:

[root@oel opt]# tar -zxf krb5-1.12.1.tar.gz
[root@oel opt]# cd krb5-1.12.1/src/
[root@oel src]# ./configure
...
[root@oel src]# yum install *yacc*
...
[root@oel src]# make
...
[root@oel src]# make install
...

At this step, Kerberos should be installed properly and the binaries, libraries and the documentation should be under /usr/local. The default location is sufficient in almost all cases:

[root@oel src]# krb5-config --all
Version:     Kerberos 5 release 1.12.1
Vendor:      Massachusetts Institute of Technology
Prefix:      /usr/local
Exec_prefix: /usr/local

As Kerberos should be installed properly, the next step is to configure it. This is done through a configuration file named krb5.conf:

[root@oel src]# vi /etc/krb5.conf

[libdefaults]
  default_realm = EXAMPLE.COM
  forwardable = true
  proxiable = true

[realms]
  EXAMPLE.COM = {
    kdc = kdc01oel.example.com:88
    admin_server = kdc01oel.example.com:749
    default_domain = example.com
  }

[domain_realm]
  .example.com = EXAMPLE.COM
  example.com = EXAMPLE.COM

[logging]
  kdc = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmin.log
  default = FILE:/var/log/krb5lib.log

To avoid hostname resolution issues, the file /etc/hosts must contain the fully qualified domain name of the server as well as the IP address:

[root@oel src]# vi /etc/hosts

127.0.0.1         localhost   kdc01oel
192.168.1.2       kdc01oel.example.com   kdc01oel

The next thing to do is to create the realm and the KDC database. Let's begin with the creation of the database parent folder:

[root@oel src]# cd /usr/local
[root@oel local]# mkdir /usr/local/var
[root@oel local]# mkdir /usr/local/var/krb5kdc
[root@oel local]# chmod 700 /usr/local/var/krb5kdc

The file krb5.conf (just above) is the generic Kerberos configuration file but the KDC also has is own configuration file (kdc.conf). Create this file and populate it as follow:

[root@oel local]# cd /usr/local/var/krb5kdc/
[root@oel krb5kdc]# vi kdc.conf

[kdcdefaults]
  kdc_ports = 749,88

[realms]
  EXAMPLE.COM = {
    database_name = /usr/local/var/krb5kdc/principal
    admin_keytab = /usr/local/var/krb5kdc/kadm5.keytab
    acl_file = /usr/local/var/krb5kdc/kadm5.acl
    key_stash_file = /usr/local/var/krb5kdc/.k5.EXAMPLE.COM
    kdc_ports = 749,88
    max_life = 10h 0m 0s
    max_renewable_life = 7d 0h 0m 0s
  }

So let's create the Kerberos database using this configuration file:

[root@oel krb5kdc]# /usr/local/sbin/kdb5_util create -s
Loading random data
Initializing database '/usr/local/var/krb5kdc/principal' for realm 'EXAMPLE.COM',
master key name 'K/document.write(['M','EXAMPLE.COM'].join('@'))'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:
[root@oel krb5kdc]#

If there is any error at this point, it is certainly due to a misconfiguration of the /etc/krb5.conf file or because Kerberos can't resolve the hostname (the /etc/hosts file isn't configure properly).

This finaly conclude the first part about the installation of the MIT Kerberos Server.


2. Manage the KDC

For this part, I assume that the KDC is set up according to what I've explain above. In the previous part, I just shown how to install the KDC but in fact the KDC isn't running for the moment. So the first thing to do is to configure which will be able to connect to the KDC (that mean obtain a ticket) and with which permissions.

To enter to the KDC administration console, use the kadmin.local (only for the local machine):

[root@oel krb5kdc]# /usr/local/sbin/kadmin.local
Authenticating as principal root/document.write(['admin','EXAMPLE.COM'].join('@')) with password.
kadmin.local:

Once in the kadmin.local, several command can be used to manage the KDC. The following command can be used to list them all:

kadmin.local:  ?
Available kadmin.local requests:

add_principal, addprinc, ank
                         Add principal
delete_principal, delprinc
                         Delete principal
modify_principal, modprinc
                         Modify principal
rename_principal, renprinc
                         Rename principal
change_password, cpw     Change password
get_principal, getprinc  Get principal
list_principals, listprincs, get_principals, getprincs
                         List principals
add_policy, addpol       Add policy
modify_policy, modpol    Modify policy
delete_policy, delpol    Delete policy
get_policy, getpol       Get policy
list_policies, listpols, get_policies, getpols
                         List policies
get_privs, getprivs      Get privileges
ktadd, xst               Add entry(s) to a keytab
ktremove, ktrem          Remove entry(s) from a keytab
lock                     Lock database exclusively (use with extreme caution!)
unlock                   Release exclusive database lock
purgekeys                Purge previously retained old keys from a principal
get_strings, getstrs     Show string attributes on a principal
set_string, setstr       Set a string attribute on a principal
del_string, delstr       Delete a string attribute on a principal
list_requests, lr, ?     List available requests.
quit, exit, q            Exit program.

So for example let's create two principal. One with administrator capabilities (xxx/admin) and another one without:

kadmin.local:  addprinc mpatou/admin
WARNING: no policy specified for mpatou/document.write(['admin','EXAMPLE.COM'].join('@')); defaulting to no policy
Enter password for principal "mpatou/document.write(['admin','EXAMPLE.COM'].join('@'))":
Re-enter password for principal "mpatou/document.write(['admin','EXAMPLE.COM'].join('@'))":
Principal "mpatou/document.write(['admin','EXAMPLE.COM'].join('@'))" created.

kadmin.local:  addprinc mpatou
WARNING: no policy specified for document.write(['mpatou','EXAMPLE.COM'].join('@')); defaulting to no policy
Enter password for principal "document.write(['mpatou','EXAMPLE.COM'].join('@'))":
Re-enter password for principal "document.write(['mpatou','EXAMPLE.COM'].join('@'))":
Principal "document.write(['mpatou','EXAMPLE.COM'].join('@'))" created.

With a new "listprincs", the two new principals should be displayed but for now, mpatou/admin has no administrative access because this account isn't declared in the access control list. In the ACL file, permissions can be defined by using the following characters:

  • A = Addition of users or policies into the KDC database
  • D = Deletion of users or policies from the KDC database
  • M = Modification of users or policies in the KDC database
  • C = Changing principals' passwords in the KDC database
  • I = Inquiries into the database, to list principal information
  • L = Listing of the principals contained in the database
  • * = Grants the user all of the above permissions

Moreover, the wildcard (*) can be used to match more than one user. For example */admin will match all administrative account. Let's create the ACL file as define in the KDC configuration file (kdc.conf):

kadmin.local:  exit
[root@oel krb5kdc]# vi /usr/local/var/krb5kdc/kadm5.acl
*/document.write(['admin','EXAMPLE.COM'].join('@'))        ADMCIL

So there is a kadmin.local for local administration but there is also a remote administration console which is kadmin. This remote access can't be used for now because it need a file that doesn't exist. This file (a "keytab"), is a file that store a principal and an encryption key derived from the principal's password. It could be used to log into Kerberos without being prompted for a password and that's why this is useful for all kerberized applications.

[root@oel krb5kdc]# /usr/local/sbin/kadmin.local
Authenticating as principal root/document.write(['admin','EXAMPLE.COM'].join('@')) with password.
kadmin.local:  ktadd -k /usr/local/var/krb5kdc/kadm5.keytab kadmin/admin kadmin/changepw
Entry for principal kadmin/admin with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/admin with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/admin with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/admin with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
Entry for principal kadmin/changepw with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:/usr/local/var/krb5kdc/kadm5.keytab.
kadmin.local:  exit

The location of this kadm5.keytab is also define in the kdc.conf file. Now the KDC server process is ready to start:

[root@oel krb5kdc]# /usr/local/sbin/krb5kdc
[root@oel krb5kdc]# /usr/local/sbin/kadmind

If there is no error, then the KDC should be running and ready to reply to any client with a valid principal. The easiest way to test it is to try to obtain a TGT (Ticket Granting Ticket) using the kinit command:

[root@oel krb5kdc]# cd /usr/local/bin
[root@oel bin]# klist
klist: Credentials cache file '/tmp/krb5cc_0' not found
[root@oel bin]# kinit mpatou
Password for document.write(['mpatou','EXAMPLE.COM'].join('@')):
[root@oel bin]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: document.write(['mpatou','EXAMPLE.COM'].join('@'))

Valid starting       Expires              Service principal
04/03/2014 09:54:48  04/03/2014 19:54:48  krbtgt/document.write(['EXAMPLE.COM','EXAMPLE.COM'].join('@'))
    renew until 04/04/2014 09:54:47
[root@oel bin]# kdestroy
[root@oel bin]# klist
klist: Credentials cache file '/tmp/krb5cc_0' not found

The klist command can be used to list all existing tickets whereas the kdestroy is used to remove them. The KDC is now fully operational and some possible additional steps can be done (e.g. set up slaves KDC).

This finally concludes this blog about how to install a MIT Kerberos Server. If you need more information about Kerberos (MIT or Heimdal or Active Directory implementation), I strongly suggest you to read the book "Kerberos, The Definitive Guide" by Jason Garman. This book was for me the best source of knowledge on this subject.