Sometimes we have to index on ( trunc(date) ) because a SQL statement uses predicate on it instead of giving a range from midnight to midnight. When you do that you probably keep the index on the column. That's two indexes to maintain for DML. Do we need it?
Since version 12, ODA stores databases on ACFS volumes instead of ASM directly. This slightly changed the way the files are managed and administer. This articles presents how to multiplex your control files on ACFS.
Have you ever attempted to bulk import a SQL Server error log in order to use the information inside a report for example? If yes, you have probably wondered how to keep data in the correct order in a query because you cannot refer to any column from the table. In such case you can notice that you may have many records with the same date. Of course, there exists some workarounds but this is not the subject of this blog. Instead, I would like to share with you an interesting discussion I had with a forum member about the guarantee to get the SQL Server error log data in order with a simple SELECT statement without an ORDER BY clause.
Let’s begin with the following script which bulk import data from a SQL Server error log file inside a table:CREATE TABLE ErrLog ( LogCol NVARCHAR(max) NULL )
BULK INSERT ErrLog FROM 'C:Program FilesMicrosoft SQL ServerMSSQL12.MSSQLSERVERMSSQLLogERRORLOG.1' WITH ( FIRSTROW = 6, DATAFILETYPE = 'widechar' )
You may notice that we use FIRSTROW hint to begin from the 6th line and skip information as follows:2015-03-07 13:57:41.43 Server Microsoft SQL Server 2014 - 12.0.2370.0 (X64) Jun 21 2014 15:21:00 Copyright (c) Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)
In addition, using DATAFILETYPE = 'widechar' is mandatory in order to bulk import Unicode data.
Let’s continue and after bulking import data let’s take a look at the data itself inside the table. You will probably get the same kind of sample information as follows:
SELECT LogCol FROM ErrLog;
Comparing records order between the SQL Server error log file and the table tends to state that the order is the same. At this point, we may wonder how to number the records inside the table without affecting the table order. Indeed, numbering records in the table will allow to control the order of data by using the ORDER BY clause. So go ahead and let’s using the ROW_NUMBER() function in order to meet our requirement. You may notice that I use an “artificial” ORDER BY clause inside the windows function to avoid to interfere with the original order of getting data in my table.
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS numline, LogCol FROM ErrLog;
At this point, the forum member tells me that we cannot guarantee the order of data without using an order by clause but once again, it seems that we get the same order that the previous query but can I trust it? I completely agree with this forum member and I tend to advice the same thing. However in this specific case the order seems to be correct but why?
If you take a look at the first script, the first operation consisted in creating a heap table. This detail is very important. Then, the bulk insert operation reads sequentially the file and insert the data in the allocation order of pages. It means that data insertion order is tied to the allocation order of the pages specifying in the IAM page. So when we perform a table scan operation to number each row in the table (see the execution plan below), in fact this operation will be performed by scanning the IAM page to find the extents that are holding pages.
As you know, IAM represents extents in the same order that they exist in the data files, so in our case table scan is performed by using the same allocation page order. We can check it by catching the call stack of our query. We may see that the query optimizer uses a particular function called AllocationOrderPageScanner for our heap table.
So, that we are agreed, I don’t claim that in all cases we may trust blindly the data order without specifying the ORDER BY clause. In fact, I’m sure you will have to process differently most of the time depending on your context and different factors that will not guarantee the order of data without specifying the ORDER BY CLAUSE (bulking import data from multiple files in parallel, reading data from table with b-tree structure in particular condition etc…)
Which DBA has not yet face a performance problem issued by several blocked processes? In reality, I’m sure a very little number of them. Troubleshooting a blocked issue scenario is not always easy and may require to use some useful tools to simplify this hard task. A couple of months ago, I had to deal this scenario at one of my customer. During some specific periods in the business day, he noticed that its application slowed down and he asked to me how to solve this issue.
Fortunately, SQL Server provides a useful feature to catch blocked processes. We have just to configure the “blocked process threshold (s)” server option. There are plenty of blogs that explain how to play with this parameter. So I let you perform your own investigation by using your favourite search engine.
Having a blocked process report is useful but often in such situation, there are a lot of processes that sometimes blocked each other’s and we have to find out among this can of worms the real responsible. So, my main concern was the following: how to extract information from the blocked process report and how to correlate all blocked processes together. After some investigation I found a useful script written by Michael J S Swart here. Usually I prefer to write my own script but I didn't had the time and I had to admit this script met perfectly my need. The original version provides the blocked hierarchy and the XML view of the issue. It’s not so bad because we have all the information to troubleshoot our issue. However, my modification consists to change this XM view by adding useful information in tabular format to make the reading of the final result easier. Here the modified version of the script:
CREATE PROCEDURE [dbo].[sp_blocked_process_report_viewer_dbi] ( @Trace nvarchar(max), @Type varchar(10) = 'FILE' ) AS SET NOCOUNT ON -- Validate @Type IF (@Type NOT IN('FILE', 'TABLE', 'XMLFILE')) RAISERROR ('The @Type parameter must be ''FILE'', ''TABLE'' or ''XMLFILE''', 11, 1) IF (@Trace LIKE '%.trc' AND @Type <> 'FILE') RAISERROR ('Warning: You specified a .trc trace. You should also specify @Type = ''FILE''', 10, 1) IF (@Trace LIKE '%.xml' AND @Type <> 'XMLFILE') RAISERROR ('Warning: You specified a .xml trace. You should also specify @Type = ''XMLFILE''', 10, 1) CREATE TABLE #ReportsXML ( monitorloop nvarchar(100) NOT NULL, endTime datetime NULL, blocking_spid INT NOT NULL, blocking_ecid INT NOT NULL, blocking_bfinput NVARCHAR(MAX), blocked_spid INT NOT NULL, blocked_ecid INT NOT NULL, blocked_bfinput NVARCHAR(MAX), blocked_waitime BIGINT, blocked_hierarchy_string as CAST(blocked_spid as varchar(20)) + '.' + CAST(blocked_ecid as varchar(20)) + '/', blocking_hierarchy_string as CAST(blocking_spid as varchar(20)) + '.' + CAST(blocking_ecid as varchar(20)) + '/', bpReportXml xml not null, primary key clustered (monitorloop, blocked_spid, blocked_ecid), unique nonclustered (monitorloop, blocking_spid, blocking_ecid, blocked_spid, blocked_ecid) ) DECLARE @SQL NVARCHAR(max); DECLARE @TableSource nvarchar(max); -- define source for table IF (@Type = 'TABLE') BEGIN -- everything input by users get quoted SET @TableSource = ISNULL(QUOTENAME(PARSENAME(@Trace,4)) + N'.', '') + ISNULL(QUOTENAME(PARSENAME(@Trace,3)) + N'.', '') + ISNULL(QUOTENAME(PARSENAME(@Trace,2)) + N'.', '') + QUOTENAME(PARSENAME(@Trace,1)); END -- define source for trc file IF (@Type = 'FILE') BEGIN SET @TableSource = N'sys.fn_trace_gettable(N' + QUOTENAME(@Trace, '''') + ', -1)'; END -- load table or file IF (@Type IN('TABLE', 'FILE')) BEGIN SET @SQL = N' INSERT #ReportsXML(blocked_ecid, blocked_spid, blocked_bfinput , blocking_ecid, blocking_spid, blocking_bfinput, blocked_waitime, monitorloop, bpReportXml,endTime) SELECT blocked_ecid, blocked_spid, blocked_inputbuffer, blocking_ecid, blocking_spid, blocking_inputbuffer, blocked_waitime, COALESCE(monitorloop, CONVERT(nvarchar(100), endTime, 120), ''unknown''), bpReportXml, EndTime FROM ' + @TableSource + N' CROSS APPLY ( SELECT CAST(TextData as xml) ) AS bpReports(bpReportXml) CROSS APPLY ( SELECT monitorloop = bpReportXml.value(''(//@monitorLoop)'', ''nvarchar(100)''), blocked_spid = bpReportXml.value(''(/blocked-process-report/blocked-process/process/@spid)'', ''int''), blocked_ecid = bpReportXml.value(''(/blocked-process-report/blocked-process/process/@ecid)'', ''int''), blocked_inputbuffer = bpReportXml.value(''(/blocked-process-report/blocked-process/process/inputbuf/text())'', ''nvarchar(max)''), blocking_spid = bpReportXml.value(''(/blocked-process-report/blocking-process/process/@spid)'', ''int''), blocking_ecid = bpReportXml.value(''(/blocked-process-report/blocking-process/process/@ecid)'', ''int''), blocking_inputbuffer = bpReportXml.value(''(/blocked-process-report/blocking-process/process/inputbuf/text())'', ''nvarchar(max)''), blocked_waitime = bpReportXml.value(''(/blocked-process-report/blocked-process/process/@waittime)'', ''bigint'') ) AS bpShredded WHERE EventClass = 137'; EXEC (@SQL); END IF (@Type = 'XMLFILE') BEGIN CREATE TABLE #TraceXML( id int identity primary key, ReportXML xml NOT NULL ) SET @SQL = N' INSERT #TraceXML(ReportXML) SELECT col FROM OPENROWSET ( BULK ' + QUOTENAME(@Trace, '''') + N', SINGLE_BLOB ) as xmldata(col)'; EXEC (@SQL); CREATE PRIMARY XML INDEX PXML_TraceXML ON #TraceXML(ReportXML); WITH XMLNAMESPACES ( 'http://tempuri.org/TracePersistence.xsd' AS MY ), ShreddedWheat AS ( SELECT bpShredded.blocked_ecid, bpShredded.blocked_spid, bpShredded.blocked_inputbuffer, bpShredded.blocked_waitime, bpShredded.blocking_ecid, bpShredded.blocking_spid, bpShredded.blocking_inputbuffer, bpShredded.monitorloop, bpReports.bpReportXml, bpReports.bpReportEndTime FROM #TraceXML CROSS APPLY ReportXML.nodes('/MY:TraceData/MY:Events/MY:Event[@name="Blocked process report"]') AS eventNodes(eventNode) CROSS APPLY eventNode.nodes('./MY:Column[@name="EndTime"]') AS endTimeNodes(endTimeNode) CROSS APPLY eventNode.nodes('./MY:Column[@name="TextData"]') AS bpNodes(bpNode) CROSS APPLY( SELECT CAST(bpNode.value('(./text())', 'nvarchar(max)') as xml), CAST(LEFT(endTimeNode.value('(./text())', 'varchar(max)'), 19) as datetime) ) AS bpReports(bpReportXml, bpReportEndTime) CROSS APPLY( SELECT monitorloop = bpReportXml.value('(//@monitorLoop)', 'nvarchar(100)'), blocked_spid = bpReportXml.value('(/blocked-process-report/blocked-process/process/@spid)', 'int'), blocked_ecid = bpReportXml.value('(/blocked-process-report/blocked-process/process/@ecid)', 'int'), blocked_inputbuffer = bpReportXml.value('(/blocked-process-report/blocked-process/process/inputbuf/text())', 'nvarchar(max)'), blocking_spid = bpReportXml.value('(/blocked-process-report/blocking-process/process/@spid)', 'int'), blocking_ecid = bpReportXml.value('(/blocked-process-report/blocking-process/process/@ecid)', 'int'), blocking_inputbuffer = bpReportXml.value('(/blocked-process-report/blocking-process/process/inputbuf/text())', 'nvarchar(max)'), blocked_waitime = bpReportXml.value('(/blocked-process-report/blocked-process/process/@waittime)', 'bigint') ) AS bpShredded ) INSERT #ReportsXML(blocked_ecid,blocked_spid,blocking_ecid,blocking_spid, monitorloop,bpReportXml,endTime) SELECT blocked_ecid,blocked_spid,blocking_ecid,blocking_spid, COALESCE(monitorloop, CONVERT(nvarchar(100), bpReportEndTime, 120), 'unknown'), bpReportXml,bpReportEndTime FROM ShreddedWheat; DROP TABLE #TraceXML END -- Organize and select blocked process reports ;WITH Blockheads AS ( SELECT blocking_spid, blocking_ecid, monitorloop, blocking_hierarchy_string FROM #ReportsXML EXCEPT SELECT blocked_spid, blocked_ecid, monitorloop, blocked_hierarchy_string FROM #ReportsXML ), Hierarchy AS ( SELECT monitorloop, blocking_spid as spid, blocking_ecid as ecid, cast('/' + blocking_hierarchy_string as varchar(max)) as chain, 0 as level FROM Blockheads UNION ALL SELECT irx.monitorloop, irx.blocked_spid, irx.blocked_ecid, cast(h.chain + irx.blocked_hierarchy_string as varchar(max)), h.level+1 FROM #ReportsXML irx JOIN Hierarchy h ON irx.monitorloop = h.monitorloop AND irx.blocking_spid = h.spid AND irx.blocking_ecid = h.ecid ) SELECT ISNULL(CONVERT(nvarchar(30), irx.endTime, 120), 'Lead') as traceTime, SPACE(4 * h.level) + CAST(h.spid as varchar(20)) + CASE h.ecid WHEN 0 THEN '' ELSE '(' + CAST(h.ecid as varchar(20)) + ')' END AS blockingTree, irx.blocked_waitime, bdp.last_trans_started as blocked_last_trans_started, bdp.wait_resource AS blocked_wait_resource, bgp.wait_resource AS blocking_wait_resource, bgp.[status] AS blocked_status, bdp.[status] AS blocking_status, bdp.lock_mode AS blocked_lock_mode, bdp.isolation_level as blocked_isolation_level, bgp.isolation_level as blocking_isolation_level, bdp.app AS blocked_app, DB_NAME(bdp.current_db) AS blocked_db, '-----> blocked statement' AS blocked_section, CAST('' + irx.blocked_bfinput + '' AS XML) AS blocked_input_buffer, CASE WHEN bdp.frame_blocked_process_xml IS NULL THEN CAST('' + irx.blocked_bfinput + '' AS XML) ELSE bdp.frame_blocked_process_xml END AS frame_blocked_process_xml, DB_NAME(bgp.current_db) AS blocking_db, bgp.app AS blocking_app, 'blocking statement ----->' AS blocking_section, CAST('' + irx.blocking_bfinput + '' AS XML) AS blocking_input_buffer, CASE WHEN bgp.frame_blocking_process_xml IS NULL THEN CAST('' + irx.blocking_bfinput + '' AS XML) ELSE bgp.frame_blocking_process_xml END AS frame_blocking_process_xml, irx.bpReportXml from Hierarchy h left join #ReportsXML irx on irx.monitorloop = h.monitorloop and irx.blocked_spid = h.spid and irx.blocked_ecid = h.ecid outer apply ( select T.x.value('(./process/@waitresource)', 'nvarchar(256)') AS wait_resource, T.x.value('(./process/@lasttranstarted)', 'datetime') as last_trans_started, T.x.value('(./process/@lockMode)', 'nvarchar(60)') as lock_mode, T.x.value('(./process/@status)', 'nvarchar(60)') as [status], T.x.value('(./process/@isolationlevel)', 'nvarchar(60)') as isolation_level, T.x.value('(./process/@currentdb)', 'int') as current_db, T.x.value('(./process/@clientapp)', 'nvarchar(200)') as app, cast( (select SUBSTRING(txt.text,(ISNULL(T.x.value('./@stmtstart', 'int'), 0) / 2) + 1, ((CASE ISNULL(T.x.value('./@stmtend', 'int'), -1) WHEN -1 THEN DATALENGTH(txt.text) ELSE T.x.value('./@stmtend', 'int') END - ISNULL(T.x.value('./@stmtstart', 'int'), 0)) / 2) + 1) + CHAR(13) AS statement_txt from bpReportXml.nodes('//blocked-process/process/executionStack/frame') AS T(x) cross apply sys.dm_exec_sql_text(T.x.value('xs:hexBinary(substring((./@sqlhandle), 3))', 'varbinary(max)')) AS txt for XML path('')) as xml) AS frame_blocked_process_xml from bpReportXml.nodes('//blocked-process') AS T(x) ) AS bdp outer apply ( select T.x.value('(./process/@waitresource)', 'nvarchar(256)') AS wait_resource, T.x.value('(./process/@status)', 'nvarchar(60)') as [status], T.x.value('(./process/@isolationlevel)', 'nvarchar(60)') as isolation_level, T.x.value('(./process/@currentdb)', 'int') as current_db, T.x.value('(./process/@clientapp)', 'nvarchar(200)') as app, cast( (select SUBSTRING(txt.text,(ISNULL(T.x.value('./@stmtstart', 'int'), 0) / 2) + 1, ((CASE ISNULL(T.x.value('./@stmtend', 'int'), -1) WHEN -1 THEN DATALENGTH(txt.text) ELSE T.x.value('./@stmtend', 'int') END - ISNULL(T.x.value('./@stmtstart', 'int'), 0)) / 2) + 1) + CHAR(13) AS statement_txt from bpReportXml.nodes('//blocking-process/process/executionStack/frame') AS T(x) cross apply sys.dm_exec_sql_text(T.x.value('xs:hexBinary(substring((./@sqlhandle), 3))', 'varbinary(max)')) AS txt for XML path('')) as xml) AS frame_blocking_process_xml from bpReportXml.nodes('//blocking-process') AS T(x) ) AS bgp order by h.monitorloop, h.chain DROP TABLE #ReportsXML
Unfortunately I can’t show my customer context so I will show only a sample of my own test to explain how we can use this script. In fact, the generated result set is splitted into three main sections.
First section: Hierarchy blocked tree, lock resources and transaction isolation level
Let’s begin by the first category. You can see here the hierarchy tree and the blocked interactions that exist between the different processes. The above picture shows the process id = 72 that is blocking the process id = 73. In turn, the process = 73 is blocking other sessions (with id = 75, 77). Furthermore, the process 74 is at the same level than the process id = 73 and it is blocked by the process id = 72. Finally the process id = 76 is blocked by the process id = 74. A real can of worms isn’t it?
Displaying the blocking hierarchy tree is very useful in this case. In addition, I added the transaction isolation level used by all processes, the status of the processes, the locks and the resources related to the issue. As a reminder, these information are already in the blocked processes report and my task consisted in extracting these information in tabular format. We will use all of them later in this blog post. For the moment, let’s focus on the first hierarchy branch: 72 -> 73 -> 75 -> 77 and the resource that all concerned processes are hitting:
KEY: 6:72057594045595648 (089241b7b846) that we can split in three main parts
6 : Database id = 6 => AdventureWorks2012
72057594045595648 : The container hobt id of the partition that give us the schema, table and index as follows:select s.name as [schema_name], o.name as table_name, i.name as index_name from sys.partitions as p join sys.objects as o on p.object_id = o.object_id join sys.indexes as i on i.object_id = p.object_id and i.index_id = p.index_id join sys.schemas as s on s.schema_id = o.schema_id where p.hobt_id = 72057594045595648
Person.Person.PK_Person_BusinessEntityID is a clustered index that includes the BusinessEntityID column.
The lock resource value that identifies the index key in the table Person.Person locked by the process id = 72. We may use the undocumented function %%lockres%% to locate the correct row in the table as follows:
select BusinessEntityID from Person.Person where %%lockres%% = '(089241b7b846)'
At this point we know that the blocking process has started a transaction in repeatable read transaction isolation level and has not yet released the lock on the index key with value 14. This is why the session id = 73 is still pending because it attempts to access to the same resource by putting an S lock.
Let’s continue with the next sections of the result set:
Second section: blocking and blocked input buffers and their related frames
This second part provides detailed information of blocked statement information including the concerned application and the concerned databases as well.
Likewise, the last part provides the same kind of information but for the blocking statement(s):
We will correlate the information of the above sections. For example, if we take a look directly at the blocking input buffer of the process id = 72 we will discover the responsible that is the following stored procedure:
<blockingInputBuffer> EXECTestUpdatePersonNameStyle@NameStyle </blockingInputBuffer>
Next, the blocking frame identifies exactly the portion of code inside the stored procedure where the blocking issue has occurred:
WAITFOR DELAY '00:02:00';
Ok it seems that the stored procedure has started an explicit transaction with the repeatable read transaction isolation level and includes a WAITFOR DELAY command with a duration of 2 minutes. During this time, the different resources are still holding by the transaction because there is no transaction commit or transaction rollback and we are in repeatable read transaction isolation level. Let’s take a look at the stored procedure code:
ALTER PROCEDURE [dbo].[TestUpdatePersonNameStyle] ( @NameStyle BIT, @BusinessEntityID INT ) AS SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; BEGIN TRAN SELECT PhoneNumber FROM Person.PersonPhone WHERE BusinessEntityID = @BusinessEntityID; UPDATE Person.Person SET NameStyle = @NameStyle WHERE BusinessEntityID = @BusinessEntityID + 100; WAITFOR DELAY '00:02:00'; ROLLBACK TRAN;
We can confirm that we found in the first section, the repeatable read transaction isolation level used by the blocking session. In reality, it seems that we have two different resources holding by the above transaction. The first (index key = 14) and the second (index key = 14 + 100).
Now let’s switch to the blocked statement part. A quick look at the input buffer tells us that the session id = 73 is trying to access the same resource than the UPDATE part of the blocking process. It confirms what we saw in the first section: the process id = 73 is in suspended state because it is trying to put a S lock on the concerned resource that is not compatible with an X lock from the UPDATE statement of the process id = 72.
SELECT * FROM Person.Person WHERE BusinessEntityID = 114;
I will do not the same demonstration for all the lines in the result set but let’s finish by the process id = 74. Let’s go back to the first section. We can see that session id = 74 is trying to put an X lock on the following resource:
KEY: 6:72057594045726720 (58e9f9de4ab6)
Let’s apply the same rule that earlier and we may easily find the corresponding index key on the table Person.PersonPhone this time.
Now let’s continue to the next sections and let’s take a look at the blocking frame:
WAITFOR DELAY '00:02:00';
The same thing that the first case…. Finally let’s take a look at the blocking input buffer:
BEGIN TRAN; IF EXISTS(SELECT 1 FROM Person.Person WHERE BusinessEntityID = 14) DELETE FROM Person.PersonPhone WHERE BusinessEntityID = 14; ROLLBACK TRAN;
This time, it concerns an explicit transaction but with a different transaction isolation level: read committed mode. You can correlate with the first section by yourself. The blocking point concerns only the second part of the above query as indicated by the blocked_lock column in the first section: The process id = 74 is trying to put an X lock on a resource that is still holding by the process id = 72 (SELECT statement in repeatable read transaction isolation level).
The issue that I faced with my customer was pretty similar. In fact you have just to replace the WAITFOR DELAY command by a series of other pieces of code which deferred drastically the transaction commit time. In this case, having a precise idea of the blocking tree and the other information readable directly on a tabular format helped us to save a lot of time in order to resolve this issue.
Various methods of table compression have been introduced at each release. Some require a specific storage system Some requires specific options. Some are only for static data. And it's not always very clear for the simple reason that their name has changed.
Name change for technical reasons (ROW/COLUMN STORE precision when a columnar compression has been introduced) or for marketing reason (COMPRESS FOR OLTP gave the idea that other - Exadata - compression level may not be suited for OLTP).
Of course that brings a lot of ambiguity such as:
- HCC is called 'COLUMN STORE' even if it has nothing to do with the In-Memory columns store
- COMPRESS ADVANCED is only one part of Advanced Compression Option
- EHCC (Exadata Hybrid Columnar Compression) is not only for Exadata
- COMPRESS FOR OLTP is not called like that anymore, but is still the only compression suitable for OLTP
- HCC Row-Level Locking is not for ROW STORE but for COLUMN STORE. It's suited for DML operation but is different than FOR OLTP. Anyway COLUMN STORE compression can be transformed to ROW STORE compression during updates. And that locking feature is licenced with the Advanced Compression Option, and available in Exadata only...
- When do you need ACO (Advanced Compression Option) or not?
Let's make it clear here.
In the previous post I explained the performance issue encountered when using a generic query to deal with optional search criteria on multiple columns. The statement was shared by all executions, was marked as bind sensitive, but never became bind aware. Let's use the BIND_AWARE hint.
You have a multicriteria search screen on the EMPLOYEE table where you can enter an employee id, a department id, a manager id or a job id. Either you put the value you want to filter on, or you leave it null when you don't want to filter on it. How will you code that? You can build the query on the fly with dynamic SQL or use a generic query like this one:
SELECT * FROM employees WHERE (job_id = NVL(:job_id, job_id)) AND (department_id = NVL(:department_id, department_id)) AND (manager_id = NVL(:manager_id, manager_id)) AND (employee_id = NVL(:employee_id, employee_id))This is good for the code maintainability, but having a one-fit-all query will not be optimial for each cases. Markus Winand (every database developer should read his book) describes the danger ot that in his website: Use The Index, Luke
RAC is the most complex installation you can have for an oracle database. A RAC DBA is involved not only on database, but storage, network, and system as well. It involves also the application in order to be sure that the application service can follow the database service high availability. It's also brings every database skills to the highest level: small contention on single instance database can become a big bottleneck in RAC.
But RAC is also fascinating. It's the highest service availability. When correctly configured you can stop a node without any impact on your users. It's the highest scalability: you are not limited by the number of CPU or the amount of memory that you can put in a server. Just add a node. RAC is not new. Oracle 6 was already able to open the same database from several instances. It was called parallel server.
Do you think it's impossible to learn and practices that king of infrastructure when you don't have already one in your data center? No. You can install and practice RAC on your laptop. This is what RAC Attack! is about: at various events, experienced RAC Attack volunteers (ninjas) will help you address any related issues and guide you through the setup process and you will have a RAC on your laptop. Next month in Las Vegas is the IOUG event: COLLABORATE15. I'll be there as a speaker and I'm also very happy to help as a RAC Attack! Nija.
Here you can find all information about it:
Hope to see you there.
Enterprise Manager (Cloud Control for example) gathers a lot of metrics. You can display them from the GUI, but you can also query the SYSMAN views directly. Today, I wanted to get the history of free space in an ASM disk group for the previous week. Here is how I got it.
Do you have ever seen the following message while you’re trying to validate your cluster configuration with your availability groups or FCI’s and Windows Server 2012 R2?
Microsoft recommends to add a witness even if you have only two cluster members with dynamic weights. This recommendation may make sense regarding the new witness capabilities. Indeed, Windows 2012 R2 improves the quorum resiliency with the new dynamic witness behavior. However, we need to take care about it and I would like to say at this point that I’m reluctant to recommend to meet this requirement with a minimal cluster configuration with only 2 nodes. In my case, it’s very usual to implement SQL Server AlwaysOn and availability groups or FCI’s architectures with only two cluster nodes at customer places. Let’s talk about the reason in this blog post.
First of all, let’s demonstrate why I don’t advice my customers to implement a witness by following the Microsoft recommendation. In my case it consists in adding a file share witness on my existing lab environment with two cluster nodes that use the dynamic weight behavior:
Now let’s introduce a file share witness (\DC2WINCLUST-01) as follows:
We may notice after introducing the FSW that the node weight configuration has changed:
The total number of votes equals 3 here because we are in the situation where we have an even number of cluster members plus the witness. As a reminder, we are supposed to use a dynamic witness feature according to the Microsoft documentation here.
In Windows Server 2012 R2, if the cluster is configured to use dynamic quorum (the default), the witness vote is also dynamically adjusted based on the number of voting nodes in current cluster membership. If there is an odd number of votes, the quorum witness does not have a vote. If there is an even number of votes, the quorum witness has a vote.
The quorum witness vote is also dynamically adjusted based on the state of the witness resource. If the witness resource is offline or failed, the cluster sets the witness vote to "0."
The last sentence draws my attention and now let’s introduce a failure of the FSW. In my case I will just turn off the share used by my WFSC as follows:
As expected, the file share witness state has changed from online to failed state by the resource control manager:
At this point, according to the Microsoft documentation, we may expect that the WitnessDynamicWeight property will change by the cluster but to my surprise, this was not the case:
In addition, after taking a look at the cluster log I noticed this sample among the entire log records:000014d4.000026a8::2015/02/20-12:45:43.594 ERR [RCM] Arbitrating resource 'File Share Witness' returned error 67 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [RCM] Res File Share Witness: OnlineCallIssued -> ProcessingFailure( StateUnknown ) 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [RCM] TransitionToState(File Share Witness) OnlineCallIssued-->ProcessingFailure. 000014d4.00001ea0::2015/02/20-12:45:43.594 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000014d4.000026a8::2015/02/20-12:45:43.594 ERR [RCM] rcm::RcmResource::HandleFailure: (File Share Witness) 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [QUORUM] Node 1: PostRelease for ac9e0522-c273-4da8-99f5-3800637db4f4 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [QUORUM] Node 1: quorum is not owned by anyone 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [RCM] resource File Share Witness: failure count: 0, restartAction: 0 persistentState: 1. 000014d4.00001e20::2015/02/20-12:45:43.594 INFO [GUM] Node 1: executing request locally, gumId:281, my action: qm/set-node-weight, # of updates: 1 000014d4.000026a8::2015/02/20-12:45:43.594 INFO [RCM] numDependents is zero, auto-returning true 000014d4.00001e20::2015/02/20-12:45:43.594 WARN [QUORUM] Node 1: weight adjustment not performed. Cannot go below weight count 3 in a hybrid configuration with 2+ nodes
The latest line (highlighted in red) is the most important. I guess here that “hybrid configuration” means my environment includes 2 cluster nodes and one witness (regarding its type). An interesting thing to notice is a potential limitation that exists for the dynamic witness behavior that cannot be performed below two cluster nodes. Unfortunately, I didn’t find any documentation from Microsoft about this message. Is it a bug or just a missing entry to the documentation or have I overlook something concerning the cluster behavior? At this point I can’t tell anything and I hope to get soon a response from Microsoft. The only thing I can claim at this point is that if I lose a cluster node, the cluster availability will be compromised. This reproduced issue is not specific on my lab environment and I faced the same behavior several times at my customers.
Let’s demonstrate by issuing a shutdown of one of my cluster node. After a couple of seconds, connection with my Windows failover cluster is lost and here what I found by looking at the Windows event log:
As I said earlier, at this point, with minimal configuration with two cluster nodes, I always recommend to my customers to skip this warming. After all, having only two cluster members with dynamic quorum behavior is sufficient to get a good quorum resiliency. Indeed, according to the Microsoft documentation to allow the system to re-calculate correctly the quorum, a simultaneous failure of a majority of voting members should not occur (in others words, the failure of cluster members must be sequential) and with two cluster nodes we may only lose one node at the same time in all cases.
What about more complex environments? Let’s say a FCI with 4 nodes (two cluster nodes on each datacenter) and a file share witness on the first datacenter. In contrast, in this case, if the file share witness fails, the cluster will adjust correctly the overall node weight configuration both on the cluster nodes and on the witness. This is completely consistent with the message found above: "Cannot go below weight count 3".
The bottom line is that the dynamic witness feature is very useful but you have to take care about its behavior with minimal configurations based on only two cluster nodes which may introduce unexpected results in some cases.
Happy cluster configuration!
I've written about 12c online statistics gathering in a UKOUG OracleScene article. My opinion is clear about it: you sill need to gather stale stats afterwards or you have mising, stale and inconsistent object statistics. This post is about cases where online statistics gathering does not occur (and are not documented) - which is another reason why we can't rely on it.The case where it works
You can check on the article about how online gathering statistics works (or come to our 12c new feature workshop where we cover and practice all 12c optimizer new features)
In order to do something else here I'm showing how to trace it by activating the 0x10000 trace flag for dbms_stats:
SQL> connect demo/demo@//localhost/PDB1 Connected. SQL> set serveroutput on SQL> exec dbms_stats.set_global_prefs('TRACE',1+65536); PL/SQL procedure successfully completed. SQL> drop table DEMO; Table dropped. SQL> create table DEMO ( n number ) pctfree 99; Table created. SQL> insert /*+ append */ into DEMO select rownum from dual connect by 1000>=level; DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 1 1 1 to_char(count("N")) 100 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 2 1 1 substrb(dump(min("N"),16,0,64),1,240) 9 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 3 1 1 substrb(dump(max("N"),16,0,64),1,240) 17 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 1 1 1 to_char(count("N")) 100 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 2 1 1 substrb(dump(min("N"),16,0,64),1,240) 9 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 3 1 1 substrb(dump(max("N"),16,0,64),1,240) 17 DBMS_STATS: postprocess online optimizer stats gathering for DEMO.DEMO: save statis DBMS_STATS: RAWIDX SELMAPPOS RES NNV NDV DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 1 1 1000 1000 0 2891 1000 DBMS_STATS: RAWIDX SELMAPPOS RES NNV NDV DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 2 2 Typ=2 Len=2: c1,2 NULL NULL DBMS_STATS: RAWIDX SELMAPPOS RES NNV NDV DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 3 3 Typ=2 Len=2: c2,b NULL NULL DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 1 1 1 to_char(count("N")) 100 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 2 1 1 substrb(dump(min("N"),16,0,64),1,240) 9 DBMS_STATS: SELMAPPOS CLISTIDX INTCOL SELITEM GATHFLG DBMS_STATS: ------------------------------------------------------------------------ DBMS_STATS: 3 1 1 substrb(dump(max("N"),16,0,64),1,240) 17 1000 rows created.
From the trace, online statistics gathering occured for that direct-path load.
We can see it also in the execution plan:
SQL> select * from table(dbms_xplan.display_cursor('1k2r9n41c7xba')); PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------- SQL_ID 1k2r9n41c7xba, child number 0 ------------------------------------- insert /*+ append */ into DEMO select rownum from dual connect by 1000>=level Plan hash value: 1600317434 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | INSERT STATEMENT | | | 2 (100)| | | 1 | LOAD AS SELECT | | | | | | 2 | OPTIMIZER STATISTICS GATHERING | | 1 | 2 (0)| 00:00:01 | | 3 | COUNT | | | | | | 4 | CONNECT BY WITHOUT FILTERING | | | | | | 5 | FAST DUAL | | 1 | 2 (0)| 00:00:01 | ---------------------------------------------------------------------------------
and statistics are there:
SQL> select last_analyzed,num_rows,blocks from user_tables where table_name='DEMO'; LAST_ANAL NUM_ROWS BLOCKS --------- ---------- ---------- 21-FEB-15 1000 179
Don't forget to set the trace off:
SQL> exec dbms_stats.set_global_prefs('TRACE',0); PL/SQL procedure successfully completed.
Ok. That is the known case. Table statistics are there.
When you want to do some online maintenance, being sure that the application is not connected, you start the database in restrict mode.
SQL> alter system enable restricted session; System altered.
Then you can do you imports, reorg, bulk load, etc. and be sure that nobody will write or read into the table you are working on. Imagine you have tested the previous load and you have observed that the online gathered statistics are sufficient. Now you run the same in production in restricted mode.
SQL> connect demo/demo@//localhost/PDB1 Connected. SQL> set serveroutput on SQL> exec dbms_stats.set_global_prefs('TRACE',1+65536); PL/SQL procedure successfully completed. SQL> drop table DEMO; Table dropped. SQL> create table DEMO ( n number ) pctfree 99; Table created. SQL> insert /*+ append */ into DEMO select rownum from dual connect by 1000>=level; 1000 rows created.
No trace related to online statistics gathering.
SQL> select * from table(dbms_xplan.display_cursor('1k2r9n41c7xba')); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------- SQL_ID 1k2r9n41c7xba, child number 0 ------------------------------------- insert /*+ append */ into DEMO select rownum from dual connect by 1000>=level Plan hash value: 1600317434 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | INSERT STATEMENT | | | 2 (100)| | | 1 | LOAD AS SELECT | | | | | | 2 | COUNT | | | | | | 3 | CONNECT BY WITHOUT FILTERING| | | | | | 4 | FAST DUAL | | 1 | 2 (0)| 00:00:01 | -------------------------------------------------------------------------------
no STATISTICS GATHERING operation.
SQL> select last_analyzed,num_rows,blocks from user_tables where table_name='DEMO'; LAST_ANAL NUM_ROWS BLOCKS --------- ---------- ----------
and no statistics.
Because we can't see the STATISTICS GATHERING operation in the execution plan, I know that it's an optimizer decision done at compilation time. I've dump the 10053 trace and got the following line:
ONLINEST: Checking validity of online stats gathering ONLINEST: Failed validity check: database not open, in restricted/migrate mode, suspended, readonly, instance not open or OCI not available.
So we have a few reasons where online statistics does not occur and that are not documented as Restrictions for Online Statistics Gathering for Bulk Loadsand restricted mode is one of them.
Because the preceding line mentions OCI I wanted to be sure that online statistics gathering occurs even when connected though thin jdbc, and I used the sqlcl beta from SQL Developer 4.1 Early Adopter. Note that I'm not in restricted session anymore.
sql.bat demo/demo@//192.168.78.113/pdb1 SQLcl: Release 4.1.0 Beta on Sat Feb 21 21:10:59 2015 Copyright (c) 1982, 2015, Oracle. All rights reserved. Connected to: Oracle Database 12c Enterprise Edition Release 220.127.116.11.0 - 64bit Production SQL> show jdbc -- Database Info -- Database Product Name: Oracle Database Product Version: Oracle Database 12c Enterprise Edition Release 18.104.22.168.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options Database Major Version: 12 Database Minor Version: 1 -- Driver Info -- Driver Name: Oracle JDBC driver Driver Version: 22.214.171.124.0 Driver Major Version: 12 Driver Minor Version: 1 Driver URL: jdbc:oracle:thin:@//192.168.78.113/pdb1 SQL> create table DEMO ( n number ) pctfree 99; Table DEMO created. SQL> insert /*+ append */ into DEMO select rownum from dual connect by level select last_analyzed,num_rows,blocks from user_tables where table_name='DEMO'; LAST_ANALYZED NUM_ROWS BLOCKS --------------------------- ---------- ---------- 21.02.15 1000 100
Ok. no problem. I don't know what that 'OCI not available' is but it works even though JDBC Thin.
As I already said for other reasons, don't rely on online statistics gathering and always gather stale stats afterwards. It's good to have it as it saves some work to do by dbms_stats later. There are cases where it is better than no statistics (when combined with GTT private statistics for example) but don't rely on it. but don't rely on it.
I'm using Oracle SQL Developer 4.1 Early Adopter for a while and I like it. That version comes with a command line (in beta) which goal is to be fully compatible with sqlplus but running in java, and having a lot more features.
Becuse it's connecting with thin java driver by default, it doesn't use NLS_LANG. It's java. It's unicode. So here is how to set the language and characterset with the java options.
For several months now, I've been working on some QlikView projects which is a quite interesting discovery for me. Generally, these projects are limited to the management of QlikView at the administrator level (installation, upgrade, configuration of the QlikView Management Console, aso...) but I was still able to accumulate some knowledge that I want to share with you today. In this blog entry, I will try to explain how to debug the QlikView Offline Service, how to properly configure the access to remote Shared Folders and how to enable the Single Sign-On between QlikView and a third party software. I will try to describe the required steps as best I can to avoid any problems.
I. QlikView Offline Service for QlikView 11.2 SR7 or below
In a complete QlikView environment that is using SSL (I don't know if it can happen without SSL), if you try to setup the QlikView Offline Service, you may face an issue where the Offline Service doesn't work at all. This happen even if the component was installed successfully and even if there are no errors in the QlikView log files. This issue comes from the fact that by default, QlikView enforces the FIPS compliance when using the Offline Service but this can cause some problems depending on your enterprise network restrictions. After a feedback on that point to the QlikView Support Team, they confirmed us that it was a bug and they fixed it in their next QlikView version (11.2 SR8 and above). A simple workaround for this issue can be setup by following these steps:
- SSL must be properly configured
- The QlikView Offline Service must be properly installed
- Login to the Windows Server with any Administrator account
- Open the file: C:/Windows/Microsoft.NET/Framework64/v4.0.30319/Config/machine.config
- Find the line with: ˂runtime /˃
- Replace this line with:
˂enforceFIPSPolicy enabled="false" /˃
- Save the file
- Open a command prompt as Administrator and execute the command: services.msc
- Restart all QlikView Services
II. Access to remote Shared Folders
After doing so, you should be able to access to your QlikView documents from a smartphone or a tablet to work offline.
As before, depending on your Windows Server GPOs, you may face some issues regarding the access to files stored on a remote Shared Folder (access via the user who run QlikView). By remote I mean another city, country, continent or whatever. This tips can help to solve some Shared Folders access even if you aren't using QlikView, it's more a Windows Server Tips ;). Regarding QlikView, this issue can be easily found in the log file because you will be able to see something like this during a task execution:
The configuration I will show you below worked for me but depending on your network restrictions, it may not work as it is. The important thing here is to understand each parameters and the consequences of this configuration:
- Login to the Windows Server with any Administrator account
- Open a command prompt as Administrator and execute the command: regedit
- Open: HKLM ˃ SYSTEM ˃ CurrentControlSet ˃ Services ˃ LanmanServer ˃ Parameters
- Set "enablesecuritysignature" to 1
- Set "requiresecuritysignature" to 1
- Open: HKLM ˃ SYSTEM ˃ CurrentControlSet ˃ Services ˃ LanmanWorkstation ˃ Parameters
- Set "EnableSecuritySignature" to 1
- Set "RequireSecuritySignature" to 0
- Reboot the Windows Server
As you can see, there are two different sections named "LanmanServer" and "LanmanWorkstation":
- LanmanServer control the parameters of the current Windows Server when it acts as a Server
- LanmanWorkstation control the parameters of the current Windows Server when it acts as a Client
For example, if you access a remote Shared Folder from the QlikView Windows Server, then you are acting as a Client and therefore with this configuration you can access to everything whatever is the LanmanServer configuration of the Shared Folder's Windows Server. Indeed, the local SecuritySignature is enabled but not required (Enable=1, Required=0, it will shorten this as "1-0") so it's the most generic case which cover all possible solutions of LanmanServer configuration of the remote Host (3 solutions: 0-0, 1-0 or 1-1).
In the same way, if a user try to access to a Shared Folder on the QlikView Server, then the QlikView Server will act as a LanmanServer and therefore the configuration taken into account is (1-1). This configuration can be changed but if the LanmanWorkstation configuration of the user's laptop is 1-1, then the LanmanServer configuration will need to be 1-1 otherwise the user will not be able to access to the Shared Folder of the QliKView Server. The 1-1 configuration is of course the most secure and therefore, it's often (always?) chosen on the user's Workstation. That's why it's generally a good idea to set the LanmanServer of the QlikView Server to 1-1 too.
III. SSO between QlikView and "X"
It's really hard to explain this kind of stuff but I hope I was clear enough!
Again, this configuration isn't something only related to QlikView but it can be useful if you need, for example, to allow QlikView to automatically store some documents into another system "X" which can be a Document Management System (Alfresco, SharePoint, Documentum, aso...) or something else. You may not need to do this because it can be allowed by default on your enterprise but it's generally a good practice to restrict the SSO features on a Windows Servers and therefore, this kind of configuration is often required. For this configuration, let's define X as a third party software and https://x.domain.com as the URL related to it.
From the Windows Server, if you try to access to the real URL of your third party software (e.g for Alfresco Share it would be: https://x.domain.com/share) and if you get a pop-up window asking you for credentials, then follow the steps below:
- Ensure that your Active Directory is properly configured for SSO (this is a very large topic and I will not describe it here)
- Login to the Windows Server with the account under which QlikView is running
- Open: Internet Explorer ˃ Internet Options ˃ Security ˃ Trusted Sites ˃ Sites
- Write: https://x.domain.com
- Click on: Add
- Write: about:blank (this step may not be mandatory)
- Click on: Add (this step may not be mandatory)
- Close the window
- Click on: Custom Level...
- Scroll down and find: User Authentication
- Set "Logon" to "Automatic logon with current username and password"
- Click on: OK
- Click on: Apply
- Click on: OK
- Restart Internet Explorer
- Ensure that the account under which QlikView is running has the proper permissions on the third party software
After that configuration, if you try again to access to the real URL of your third party software, then the pop-up window should not be displayed and the login should be successful. Of course, the last step is important because the user that is running QlikView must have access to the third party software otherwise the SSO is useless...
This concludes this first approach of some QlikView Tips & Tricks. I'm sure that more will come soon but I will need to find some time to share that with you. I hope you will find this blog entry useful and don't hesitate to give me your feedback using the comments below!
We saw in my last blog how to install the SQL Server plugin for Nagios.
In this new blog, I will explain:
- how to create a wrapper to avoid the unset of the PERLLIB variable
- how to create and configure a configuration file by monitoring instance
- how to create an SQL Login for nagios
- How to subscribe to an alert
As we saw in the first nagios blog, to be able to run the check mssql plugin we have to unset the PERLLIB variable and to export library.
We unset the PERLLIB libraries because there are set with the Oracle Client and there is conflicts. We force the plugin check_mssql_health to use the PERLLIB libraries of the OS instead of these of the Oracle client by unsetting the PERLLIB libraries of the check_oracle_health plugin.
We will do that in a wrapped file:
After we are able to test our wrapper with success:
We have decided to create a configuration file per SQL Server instance to monitor.
This file is as follow:
The important things to see here is:
In the host definition line:
- mssql_sid with the name of the instance given in the freetds.conf file (see first nagios blog)
- mssql_usr with the user name used for the SQL Server connection
- address with the IP Address of the SQL Server host
In the command line:
- the call to the check_mssql_health_wrapped file
- the counter used in this call, here cpu-busy
- the warning threshold fixed at 80% of cpu used
- the alert threshold fixed to 90% of cpu used
In the service line:
- normal_check_interval which defines the interval to check the service under normal conditions, here 10 minutes
- retry_check_interval which determines the number of minutes to wait before scheduling a re-check when service has changed to non-OK state, here 2 minutes
- max_check_attempts which checks if the service has been retried max_check_attempts time without a change in its status, it will revert to being scheduled at normal_check_interval rate and a mail is sent to the contact_group, here 3 times
- contact_groups which will receive alerts
- notification_interval which determines every how many minutes alerts will be send
It means that the cpu-busy counter will be check every 10 minutes by nagios, in case of non-OK state the next check will be scheduled after 2 minutes and after 3 checks without change, an alert will be send to the contact group and another check will be scheduled 10 minutes later. If the status stays non-ok a next alert will be sent after 6 hours.
A lots of counters are available for this plugin, to have a list, please go here.
An SQL Login can be created to connect to the SQL Server instance and databases to avoid to use sa user for example.
First create a windows domain account:
Execute the following script in SSMS in a query window to create the new logins and grand permissions:USE [master]
CREATE LOGIN [DBITESTnagios_mssql_health] FROM WINDOWS WITH DEFAULT_DATABASE=[master]
GRANT VIEW SERVER STATE to [DBITESTnagios_mssql_health]
CREATE ROLE db_nagios_mssql_health
GRANT EXECUTE TO db_nagios_mssql_health
GRANT VIEW DATABASE STATE TO db_nagios_mssql_health
GRANT VIEW DEFINITION TO db_nagios_mssql_health
CREATE USER [DBITESTnagios_mssql_health] FOR LOGIN [DBITESTnagios_mssql_health]
EXEC sp_addrolemember'db_nagios_mssql_health', [DBITESTnagios_mssql_health]
After execute the following script to grand permission on database level, take care if a new database is installed later this script has to be run for this new db.execute sp_MSforeachdb 'use [?]
CREATE ROLE db_nagios_mssql_health
GRANT EXECUTE TO db_nagios_mssql_health
GRANT VIEW DATABASE STATE TO db_nagios_mssql_health
GRANT VIEW DEFINITION TO db_nagios_mssql_health
CREATE USER [DBITESTnagios_mssql_health] FOR LOGIN [DBITESTnagios_mssql_health]
EXEC sp_addrolemember ''db_nagios_mssql_health'', [DBITESTnagios_mssql_health]'
To subscribe to an alert, new groups and contact can be defined.
Those objects will be created in a contact configuration file which will be added to the config.cfg file of nagios with this line:
The contact configuration file have the following structure:
I hope those information will help you to configure the SQL Server plugin for nagios and will give you service status details as the picture below:
I don’t work often with SQL Server replication. The main reason is that the number of customers that use replication is pretty low and each time that I have to deal with it, it’s a good opportunity to improve my skills on this area. A couple of months ago I had to face an interesting issue with the snapshot replication (yes, I was lucky ... this is not the more complex replication mechanism you have to deal with). My customer had a lot of databases (approximatively 250) to replicate from two datacenters apart from one continent.
The global architecture includes two servers that act as publishers with a total of 250 articles (one table into one database) and a central subscriber that gathers this 250 articles (250 databases with one table into each database). All articles are concerned by push subscriptions and all replication jobs are running on the publishers. This replication process is part of a more wide ETL process which delivers some financial metric reports to the business users.
Before computing the financial metrics we need to ensure that all databases replication is done correctly. Unfortunately, according to the customer security context we are stuck because we may not have access from the different publishers. So the main question that remains is how to ensure that we don’t start the calculation of the financial metrics, if we detect that there exist running database replication processes at the same time from the subscriber side?
After reflexion, I had a solution but it is not perfect. I will explain why later in this blog post. My solution includes two things:
1- Recording the last time the concerned table is created. As reminder snapshot replication will recreate concerned articles on the subscriber.
2- Tracking bulk insert activity issued by the snapshot agent before performing the calculation of the financial metrics
Here my script:
SELECT DB_NAME() as database_name, tb.name as table_name, tb.create_date as last_creation_date, txt.text as current_statement, txt.program_name as interface, txt.request_mode, txt.request_owner_type, txt.request_status FROM sys.tables as tb OUTER APPLY( SELECT db_name(r.database_id) as database_name, t.text, s.program_name, l.request_mode, l.request_status, l.request_owner_type FROM sys.dm_exec_requests AS r JOIN sys.dm_exec_sessions as s on r.session_id = s.session_id JOIN sys.dm_tran_locks as l on l.request_session_id = s.session_id CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS t WHERE t.text LIKE 'insert bulk % + tb.name + ''%TABLOCK%' AND r.session_id @@SPID AND l.resource_type = 'OBJECT' AND l.resource_database_id = r.database_id AND l.resource_associated_entity_id = tb.object_id ) AS txt WHERE tb.name = 'mytable'
I finally created a stored procedure that fetches each concerned database and execute the above script in the current context of the database. In the context of my customer I implemented an additional step which sends an email on half-day basis. Notice also that I added a custom business rule that detects replication issue if it does not occur before the last 6 hours. You can modify and adjust the script at your convenience.
SET NOCOUNT ON; DECLARE @database_name SYSNAME; DECLARE @sql NVARCHAR(MAX); -- working table : alert_replication_monitoring TRUNCATE TABLE msdb.dbo.alert_replication_monitoring; -- for each concered database we will verify if the t_replixxx table is updated -- from snapshot replication DECLARE C_DB CURSOR FAST_FORWARD FOR SELECT name FROM sys.databases WHERE name LIKE 'repli_%'; OPEN C_DB; FETCH NEXT FROM C_DB INTO @database_name; WHILE @@FETCH_STATUS = 0 BEGIN SET @sql = N'USE ' + QUOTENAME(@database_name) + ' IF EXISTS (SELECT 1 FROM sys.tables WHERE name LIKE N''t_repli%'') BEGIN INSERT INTO msdb.dbo.alert_replication_monitoring SELECT DB_NAME() as database_name, tb.name as table_name, tb.create_date as last_creation_date, txt.text as current_statement, txt.program_name as interface, txt.request_mode, txt.request_owner_type, txt.request_status FROM sys.tables as tb OUTER APPLY ( SELECT db_name(r.database_id) as database_name, t.text, s.program_name, l.request_mode, l.request_status, l.request_owner_type FROM sys.dm_exec_requests AS r JOIN sys.dm_exec_sessions as s on r.session_id = s.session_id JOIN sys.dm_tran_locks as l on l.request_session_id = s.session_id CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS t WHERE t.text LIKE ''insert bulk %'' + tb.name + ''%TABLOCK%'' AND r.session_id @@SPID AND l.resource_type = ''OBJECT'' AND l.resource_database_id = r.database_id AND l.resource_associated_entity_id = tb.object_id ) AS txt WHERE tb.name LIKE ''t_repli%'' END'; EXEC sp_executesql@sql; FETCH NEXT FROM C_DB INTO @database_name; END CLOSE C_DB; DEALLOCATE C_DB; -- Check status of each database replication SELECT * FROM ( SELECT database_name AS [database], table_name AS [table], last_creation_date AS [last synchro], DATEDIFF(MI, last_creation_date, CURRENT_TIMESTAMP) AS [time since the last update (minutes)], current_statement, CASE WHEN DATEDIFF(MI, last_creation_date, CURRENT_TIMESTAMP) AND current_statement IS NOT NULL THEN '1 - Replication in progress.' WHEN DATEDIFF(MI, last_creation_date, CURRENT_TIMESTAMP) AND current_statement IS NULL THEN '2 - Replication done.' WHEN DATEDIFF(MI, last_creation_date, CURRENT_TIMESTAMP) > 360 AND current_statement IS NOT NULL THEN '3 - Replication in progress and takes longer than expected (6 hours).' ELSE '4 - No replication has occured during the six last hours.' END replication_status FROM msdb.dbo.alert_replication_monitoring ) AS repli ORDER BY [database], replication_status DESC;
Here a picture of the script result while the snapshot agent is running...
... and when there is no activity but we track the last time the table was synchronized.
As I said earlier, this method has a main following caveat:
We may only claim that a replication process is not running at a given time but we may not know if the replication process is done correctly or with errors.
Remember that this is more a workaround than a perfect solution.
Hope it helps! Please feel free to share your comments about this script!
There has been the announce that non-CDB is deprecated, and the reaction that CDB is not yet stable.
Well. Let's talk about the major issue I've encountered. Multitenant is there for consolidation. What is the major requirement of consolidation? It's availability. If you put all your databases into one server and managed by one instance, then you don't expect a failure.
When 12c was out (and even earlier as we are beta testers) - 126.96.36.199 - David Hueber has encountered an important issue. When a SYSTEM datafile was lost, then we cannot revocer it without stopping the whole CDB. That's bad of course.
When Patchet 1 was out (and we were beta tester again) I tried to check it that had been solved. I've seen that they had introduced the undocumented "_enable_pdb_close_abort" parameter in order to allow a shutdown abort of a PDB. But that was worse. When I dropped a SYSTEM datafile the whole CDB instance crashed immediately. I opened a SR and Bug 19001390 'PDB system tablespace media failure causes the whole CDB to crash' was created for that. All is documented in that blog post.
Now the bug status is: fixed in 188.8.131.52.1 (Oct 2014) Database Patch Set Update
Good. I've installed the latest PSU which is 184.108.40.206.2 (Jan 2015) And I test the most basic recovery situation: loss of a non-system tablespace in one PDB.
Here it is:
RMAN> report schema;
Report of database schema for database with db_unique_name CDB
List of Permanent Datafiles
File Size(MB) Tablespace RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1 800 SYSTEM YES /u02/oradata/CDB/system01.dbf
3 770 SYSAUX NO /u02/oradata/CDB/sysaux01.dbf
4 270 UNDOTBS1 YES /u02/oradata/CDB/undotbs01.dbf
5 250 PDB$SEED:SYSTEM NO /u02/oradata/CDB/pdbseed/system01.dbf
6 5 USERS NO /u02/oradata/CDB/users01.dbf
7 490 PDB$SEED:SYSAUX NO /u02/oradata/CDB/pdbseed/sysaux01.dbf
11 260 PDB2:SYSTEM NO /u02/oradata/CDB/PDB2/system01.dbf
12 520 PDB2:SYSAUX NO /u02/oradata/CDB/PDB2/sysaux01.dbf
13 5 PDB2:USERS NO /u02/oradata/CDB/PDB2/PDB2_users01.dbf
14 250 PDB1:SYSTEM NO /u02/oradata/CDB/PDB1/system01.dbf
15 520 PDB1:SYSAUX NO /u02/oradata/CDB/PDB1/sysaux01.dbf
16 5 PDB1:USERS NO /u02/oradata/CDB/PDB1/PDB1_users01.dbf
List of Temporary Files
File Size(MB) Tablespace Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1 60 TEMP 32767 /u02/oradata/CDB/temp01.dbf
2 20 PDB$SEED:TEMP 32767 /u02/oradata/CDB/pdbseed/pdbseed_temp012015-02-06_07-04-28-AM.dbf
3 20 PDB1:TEMP 32767 /u02/oradata/CDB/PDB1/temp012015-02-06_07-04-28-AM.dbf
4 20 PDB2:TEMP 32767 /u02/oradata/CDB/PDB2/temp012015-02-06_07-04-28-AM.dbf
RMAN> host "rm -f /u02/oradata/CDB/PDB1/PDB1_users01.dbf";
host command complete
RMAN> alter system checkpoint;
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00601: fatal error in recovery manager
RMAN-03004: fatal error during execution of command
ORA-01092: ORACLE instance terminated. Disconnection forced
RMAN-03002: failure of sql statement command at 02/19/2015 22:51:55
ORA-03113: end-of-file on communication channel
Process ID: 19135
Session ID: 357 Serial number: 41977
ORACLE error from target database:
ORA-03114: not connected to ORACLE
Ok, but I have the PSU:
$ /u01/app/oracle/product/12102EE/OPatch/opatch lspatches
19769480;Database Patch Set Update : 220.127.116.11.2 (19769480)
Here is the alert.log:
Completed: alter database open
2015-02-19 22:51:46.460000 +01:00
Shared IO Pool defaulting to 20MB. Trying to get it from Buffer Cache for process 19116.
Dumping current patch information
Patch Id: 19769480
Patch Description: Database Patch Set Update : 18.104.22.168.2 (19769480)
Patch Apply Time: 2015-02-19 22:14:05 GMT+01:00
Bugs Fixed: 14643995,16359751,16870214,17835294,18250893,18288842,18354830,
2015-02-19 22:51:51.113000 +01:00
db_recovery_file_dest_size of 4560 MB is 18.72% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Setting Resource Manager plan SCHEDULER[0x4446]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager CDB plan DEFAULT_MAINTENANCE_PLAN via parameter
2015-02-19 22:51:54.892000 +01:00
Errors in file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_ckpt_19102.trc:
ORA-63999: data file suffered media failure
ORA-01116: error in opening database file 16
ORA-01110: data file 16: '/u02/oradata/CDB/PDB1/PDB1_users01.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Errors in file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_ckpt_19102.trc:
ORA-63999: data file suffered media failure
ORA-01116: error in opening database file 16
ORA-01110: data file 16: '/u02/oradata/CDB/PDB1/PDB1_users01.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
USER (ospid: 19102): terminating the instance due to error 63999
System state dump requested by (instance=1, osid=19102 (CKPT)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_diag_19090_20150219225154.trc
ORA-1092 : opitsk aborting process
2015-02-19 22:52:00.067000 +01:00
Instance terminated by USER, pid = 19102
You can see the bug number in 'bug fixed' and the instance is still terminating after media failure on a PDB datafile. That's bad news.
I've lost one datafile. At first checkpoint the CDB is crashed. I'll have to open an SR again. But for sure consolidation through multitenancy architecture is not yet for sensible production.
In the previous post we have seen how the OracleText index tables are maintained when new document arrives: At sync the new documents are read up to the available memory and words are inserted in the $I table with their mapping information. Now we will see how removed documents are processed. We will not cover updates as their are just delete + insert.Previous state
Here is the state from the previous post where I had those 3 documents:
- 'Hello World'
- 'Hello Moon, hello, hello'
- 'Hello Mars'
SQL> select * from DR$DEMO_CTX_INDEX$K; DOCID TEXTKEY ---------- ------------------ 1 AAAXUtAAKAAABWlAAA 2 AAAXUtAAKAAABWlAAB 3 AAAXUtAAKAAABWlAACThe $R is a table mapping the opposite navigation (docid to rowid) storing a fixed-length array of ROWIDs indexed by the docid, and split into several lines:
SQL> select * from DR$DEMO_CTX_INDEX$R; ROW_NO DATA ---------- ------------------------------------------------------------ 0 00001752D000280000056940414100001752D00028000005694041420000The $I table stores the tokens, the first 5 columns being indexed ($X) and the TOKEN_INFO blob stores detailed location of the token:
SQL> select * from DR$DEMO_CTX_INDEX$I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO ---------- ---------- ----------- ---------- ----------- ---------- HELLO 0 1 1 1 008801 WORLD 0 1 1 1 008802 HELLO 0 2 3 2 0098010201 MARS 0 3 3 1 008802 MOON 0 2 2 1 008802We have seen that the $I table can be fragmented for 3 reasons:
- Each sync insert his tokens (instead of merging with other ones)
- TOKEN_INFO size is limited to fit in-row (we will see 12c new features later)
- Only tokens that fit in the allocated memory can be merged
SQL> select * from DR$DEMO_CTX_INDEX$N; no rows selectedDelete
Do you remember how inserts are going to the CTXSYS.DR$PENDING table? Deletes are going to CTXSYS.DR$DELETE table:
SQL> delete from DEMO_CTX_FRAG where num=0002; 1 row deleted. SQL> select * from CTXSYS.DR$DELETE; DEL_IDX_ID DEL_IXP_ID DEL_DOCID ---------- ---------- ---------- 1400 0 2I've deleted docid=2 but the tokens are still there:
SQL> select * from DR$DEMO_CTX_INDEX$I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO ---------- ---------- ----------- ---------- ----------- ---------- HELLO 0 1 1 1 008801 WORLD 0 1 1 1 008802 HELLO 0 2 3 2 0098010201 MARS 0 3 3 1 008802 MOON 0 2 2 1 008802as well as their mapping to the ROWID:
SQL> -- $R is for rowid - docid mapping (IOT) SQL> select * from DR$DEMO_CTX_INDEX$R; ROW_NO DATA ---------- ------------------------------------------------------------ 0 00001752D000280000056940414100001752D00028000005694041420000However, the $N has been maintained to know that docid=2 has been removed:
SQL> select * from DR$DEMO_CTX_INDEX$N; NLT_DOCID N ---------- - 2 UThis is the goal of $N (Negative) table which records the docid that should not be there and that must be deleted at next optimization (garbage collection).
From there, a search by words ('normal lookup') will give docid's and rowid's and the CTXSYS.DR$DELETE must be read in order to know that the document is not there anymore. It's an IOT and the docid can be found with an index unique scan.
However for the opposite way, having a ROWID and checking if it contains some words ('functional lookup') we need to know that there is no document. In my case I deleted the row, but you may update the document, so the ROWID is still there. There is no pending table for that. It is maintained immediately in the $K table:
SQL> select * from DR$DEMO_CTX_INDEX$K; DOCID TEXTKEY ---------- ------------------ 1 AAAXUtAAKAAABWlAAA 3 AAAXUtAAKAAABWlAACthe entry that addressed docid=2 has been deleted. Commit
All those changes were done within the same transaction. Other sessions still see the old values. No need to read CTXSYS.DR$DELETE for them. What I described above is only for my session: the normal lookup reading the queuing table, and the functional lookup stopping at $K. We don't have to wait a sync to process CTXSYS.DR$DELETE. It's done at commit:
SQL> commit; Commit complete. SQL> select * from CTXSYS.DR$DELETE; no rows selected SQL> select * from DR$DEMO_CTX_INDEX$R; ROW_NO DATA ---------- ------------------------------------------------------------ 0 00001752D000280000056940414100000000000000000000000000000000Of course we can't read it but we see that part of it has been zeroed. That $R table is definitely special: it's not stored in a relational way, and its maintenance is deferred at commit time.
But nothing has changed in $I which contains garbage (and sync is not changing anything to that):
SQL> select * from DR$DEMO_CTX_INDEX$I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO ---------- ---------- ----------- ---------- ----------- ---------- HELLO 0 1 1 1 008801 WORLD 0 1 1 1 008802 HELLO 0 2 3 2 0098010201 MARS 0 3 3 1 008802 MOON 0 2 2 1 008802And of course $N row is still there to record the deleted docid:
SQL> select * from DR$DEMO_CTX_INDEX$N; NLT_DOCID N ---------- - 2 USync
I've not reproduced it here, but sync is not changing anything. Sync is for new documents - not for deleted ones.Conclusion What you need to remember here is:
- New documents are made visible through OracleText index at sync
- Removed document are immediately made invisible at commit
- New documents brings fragmentation
- Removed document brings garbage
I plan to write several posts about OracleText indexes, which is a feature that is not used enough in my opinion. It's available in all editions and can index small text or large documents to search by words. When you create an OracleText index, a few tables are created to store the words and the association between those words and the table row that contains the document. I'll start to show how document inserts are processed.Create the table and index
I'm creating a simple table with a CLOB
SQL> create table DEMO_CTX_FRAG (num number constraint DEMO_CTX_FRAG_PK primary key,txt clob); Table created.and a simple OracleText on that column
SQL> create index DEMO_CTX_INDEX on DEMO_CTX_FRAG(txt) indextype is ctxsys.context; Index created.That creates the following tables:
- DR$DEMO_CTX_INDEX$I which stores the tokens (e.g words)
- DR$DEMO_CTX_INDEX$K which index the documents (docid) and links them to the table ROWID
- DR$DEMO_CTX_INDEX$R which stores the opposite way navigation (get ROWID from a docid)
- DR$DEMO_CTX_INDEX$N which stores docid for deferred maintenance cleanup.
I'm inserting a row with some text in the clob column
SQL> insert into DEMO_CTX_FRAG values (0001,'Hello World'); 1 row created.I commit
SQL> commit; Commit complete.And here is what we have in the OracleText tables:
SQL> select * from DR$DEMO_CTX_INDEX$K; no rows selected SQL> select * from DR$DEMO_CTX_INDEX$R; no rows selected SQL> select * from DR$DEMO_CTX_INDEX$I; no rows selected SQL> select * from DR$DEMO_CTX_INDEX$N; no rows selectedNothing is stored here yet. Which means that we cannot find our newly inserted row from an OracleText search.
By default, all inserts maintain the OracleText tables asynchronously.
The inserted row is referenced in a CTXSYS queuing table that stores the pending inserts:
SQL> select * from CTXSYS.DR$PENDING; PND_CID PND_PID PND_ROWID PND_TIMES P ---------- ---------- ------------------ --------- - 1400 0 AAAXUtAAKAAABWlAAA 13-FEB-15 Nand we have a view over it:
SQL> select pnd_index_name,pnd_rowid,pnd_timestamp from ctx_user_pending; PND_INDEX_NAME PND_ROWID PND_TIMES ------------------------------ ------------------ --------- DEMO_CTX_INDEX AAAXUtAAKAAABWlAAA 13-FEB-15Synchronization
SQL> exec ctx_ddl.sync_index('DEMO_CTX_INDEX'); PL/SQL procedure successfully completed.The queuing table has been processed:
SQL> select pnd_index_name,pnd_rowid,pnd_timestamp from ctx_user_pending; no rows selectedand here is how that document is sotred in our OracleText tables.
$K records one document (docid=1) and the table rowid that contains it:
SQL> select * from DR$DEMO_CTX_INDEX$K; DOCID TEXTKEY ---------- ------------------ 1 AAAXUtAAKAAABWlAAA$R table stores the docid -> rowid is a non-relational way:
SQL> select * from DR$DEMO_CTX_INDEX$R; ROW_NO DATA ---------- ------------------------------------------------------------ 0 00001752D0002800000569404141How is it stored? It's an array of ROWIDs which are fixed length. Then from the docid we can directly go to the offset and get the rowid. Because DATA is limited to 4000 bytes, there are several rows. But a docid determines the ROW_NO as well as the offset in DATA.
$I stores the tokens (which are the words here as we have TEXT token - which is the type 0) as well as their location information:
SQL> select * from DR$DEMO_CTX_INDEX$I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO ---------- ---------- ----------- ---------- ----------- ---------- HELLO 0 1 1 1 008801 WORLD 0 1 1 1 008802For each word it stores the range of docid that contains the work (token_first and token_last are those docid) and token_info stores in an binary way the occurrences of the word within the documents (it stores pairs of docid and offest within the document). It's a BLOB but is limited to 4000 bytes so that it is stored inline. Which means that if a token is present in a lot of document, several lines in $I will be needed, each covering a different range of docid. This has changed in 12c and we will see that in future blog posts.
Thus, we can have several rows for one token. This is the first cause of fragmentation. Searching for documents that contain such a word will have to read several lines of the $I table. The $N has nothing here because we synchronized only inserts and there is nothing to cleanup.
SQL> select * from DR$DEMO_CTX_INDEX$N; no rows selected
I will insert two lines, which also contain the 'hello' word.
SQL> insert into DEMO_CTX_FRAG values (0002,'Hello Moon, hello, hello'); 1 row created. SQL> insert into DEMO_CTX_FRAG values (0003,'Hello Mars'); 1 row created. SQL> commit; Commit complete.And I synchronize:
SQL> exec ctx_ddl.sync_index('DEMO_CTX_INDEX'); PL/SQL procedure successfully completed.So, I've now 3 documents:
SQL> select * from DR$DEMO_CTX_INDEX$K; DOCID TEXTKEY ---------- ------------------ 1 AAAXUtAAKAAABWlAAA 2 AAAXUtAAKAAABWlAAB 3 AAAXUtAAKAAABWlAACThe reverse mapping array has increased:
SQL> select * from DR$DEMO_CTX_INDEX$R; ROW_NO DATA ---------- ------------------------------------------------------------ 0 00001752D000280000056940414100001752D00028000005694041420000And now the tokens:
- 'moon' is only in docid 2
- 'mars' is only in docid 3
- 'hello' is in 2 (token_count) documents, from docid 2 to docid 3 (token_first and token_last)
This is the other cause of fragmentation coming from frequent sync. Each sync will add new rows. However, when multiple documents are processed in the same sync, then only one $I entry per token is needed.
There is a third cause of fragmentation. We see here that the token_info is larger for that HELLO covering docid 2 to 3 because there are several occurrences of the token. All that must fit in memory when we synchronize. So it's good to synchronize when we have several documents (so that the common tokens are not too fragmented) but we need also to have enough memory. The default is 12M and is usually too small. It can be increased with the 'index memory' parameter of the index. And there is also a maximum set by ctx_adm.set_parameter for which the default (50M) is also probably too low.
Nothing yet in the $N table that we will see in the next post:
SQL> select * from DR$DEMO_CTX_INDEX$N; no rows selectedsummary
The important points here is that inserted document are visible only after synchronization, and synchronizing too frequently will cause fragmentation. If you need to synchronize in real time (on commit) and you commit for each document inserted, then you will probably have to plan frequent index optimization. If on the other hand we are able to synchronize only when we have inserted a lot of documents, then fragmentation is reduced according that we had enough memory to process all documents in one pass.
The next post will be about deletes and updates.
Nagios is an IT Infrastructure monitoring solution which is able to monitor SQL Server instances with a specific plugin.
I have installed this plugin to test those functionalities and I will explain here how to do it.
FreeTDS is a set of libraries for Unix and Linux that allows your programs to natively talk to Microsoft SQL Server and Sybase databases.
First we will download this set of libraries, for that go to the freetds website and click on Stable Release:
The file Freetds-stable.tgz will be downloaded.
When the file is downloaded, go to the nagios download directory and uncompressed the tgz file:
Go to the Freetds directory and list files:
Prepare and control libraries before compilation:
Now, we will compile and install the Freetds libraries:
Please, check if you don't have any errors.
If not, Freetds is now installed.
DBD::Sybase is a Perl module which works with the DBI module to provide access to Sybase databases.
To download this module go to http://search.cpan.org/~mewp/DBD-Sybase-1.15/Sybase.pm and click the download link:
When the file DBD-Sybase-1.15.tar.gz is downloaded, go to the download directory, uncompressed the file and go to the new directory:
We will now compile and install this new module:
Now the module is installed. Check if you received errors.
Create a symbolic link:
We are able now to edit the Freetds.conf file to change:
- the tds version due to security advice (follow the link and find the chapter Security Advice)
- add SQL Server parameters: IP Address, TCP/IP port, instance name if not default instance
The Perl module is installed.
Prerequisites are now installed and we have to install the SQL Server plugin.
First, we have to download the SQL Server plugin of Nagios. To do it connect here and click the below link:
When the file is downloaded:
- go to the download directory
- list files
- uncompressed the file
- re-list files to see the new plugin directory
- change the location to the new directory
The next command to type is Configure which will check that all dependencies are present and configure & write a Makefile that will contain build orders.
After we will have to execute a make which will make the compilation (can be run as normal user):
The next command to run is a Make install which will install the plugin (as to be run as root):
The SQL Server Nagios plugin is successfully installed.Set environment
As user root, change the permissions on file freetds.conf to be able to add server/instance from user nagios:
Add libraries /usr/local/freetds/lib to file /etc/ld.so.conf:
As user nagio, export the freetds libraries to the LD_LIBRARY_PATH environment variable:
Test the connection:
Connection failed due to wrong PERLLIB path...
To resolve this error, unset the PERLLIB variable.
We force the plugin named check_mssql_health to use the PERLLIB libraries of the OS instead of these of the Oracle client by unsetting the PERLLIB libraries of the check_oracle_health plugin:
Verification of the connection:
The next steps will be now to test this plugin by checking which counters can be used, how to configure them, with which thresholds, how to subscribe to alerts...
Those steps will be covered by a next blog.
When you install SQL Server, you have to consider how to place the database files. At this point you will probably meet the well-known best practices and guidelines provided by Microsoft but are you really aware of the storage vendor guidelines? Let’s talk about it in this blog post.
A couple of weeks ago, I had the opportunity to implement a new SQL Server architecture over on a Netapp storage model FAS 3140. The particularity here is that my customer wants to manage his backups with NetApp SnapManager for SQL Server. This detail is very important because in this case we have to meet some requirements about storage layout. After reading the Netapp storage documents I was ready to begin my database files.
First of all, we should know how SnapManager works. In fact SnapManager performs SQL Server backups by using either a snapshot of database files or by issuing streaming backups. Streaming backups concern only the system database files and the transaction log files. To achieve this task, SnapManager will coordinate several components like SnapDrive for Windows to control the storage LUNS and SQL Server to freeze IO by using the VSS Framework. It is important to precise that without SnapManager, performing snapshots from a pure storage perspective is still possible but unfortunately in this case the storage layer is not aware of the database files placement. We may find in such situation that a database is not consistent. Here a simplified scheme of what's happen during the SnapManager backup process:
SnapManager (1) --> VDI (2) - - backup database .. with snapshot --> VSS (3) - database files IO freeze --> SnapDrive (3) - snapshot creation
So why do we have to take care about database file placements with Netapp Storage? Well, according to our first explanation above, if we perform snapshot backups the unit of work is in fact the volume (FlexVol is probably the better term here. FlexVol is a kind of virtual storage pool to manage physical storage efficiently and can include one or more LUNs). In others words, when a snapshot is performed by SnapManager all related LUNs are considered together and all concerned database files in single volume are frozen when snapshot is created. Notice that if a database is spread across several volumes, IO are frozen for all concerned volumes at the same time but snapshots are taken serially.
This concept is very important and as a database administrator, we also now have to deal with these additional constraints (that I will describe below) to place our database files regarding the storage layout and the RPO / RTO needs.
1- System database files must be placed on a dedicated volume including their respective transaction log files. Only streaming backups are performed for system databases
2- Tempdb database files must also be placed on a dedicated volume that will be excluded from SnapManager backup operations.
3- Otherwise, placement of user databases depends on several factors as their size or the storage layout architecture in-place. For example, we may encounter cases with small databases that will share all their database files on the same volume. We may also encounter cases where we have to deal with placement of large database files. Indeed, large databases often include different Filegroups with several database files. In this case, spreading database files on different LUNs may be a performance best practice and we have to design the storage with different LUNs that share the same dedicated volume. Also note that the database transaction log files have to be on separated volume (remember that transaction log files are concerned by streaming backups).
4- In consolidated scenario, we may have several SQL Server instances on the same server with database files on the same volume or we may have dedicated volumes for each instance.
Regarding these constraints, keep in mind that your placement strategy may presents some advantages and but also some drawbacks:
For example, restoring a database on a dedicated volume from a snapshot will be quicker than using streaming backups but in the same time we may reach quickly the maximum number of drives we may use if we are in the situation where we dedicate one or several volumes per database. Imagine that you have 5 databases spreaded on 4 volumes (system, tempdb, data and logs)... I'll let you do the math. A possible solution is to replace the letter-oriented volume identification by mount points.
In the same time, sharing volumes between several databases makes more difficult the restore in-place process of only one database because in this case we will have to copy data from a mounted snapshot back into the active file system. But increasing the number of shared files on a single volume may cause timeout issues during the snapshot operation. According to Netapp Knowledgebase here the maximum recommended number of databases on a single volume is 35. If you want to implement a policy that verify that the number of database files is lower than this threshold please see the blog post of my colleague Stéphane Haby. The number of files concerned by a snapshot creation, the size of the volumes or the storage performance are all factors to consider during the database files placement.
After digesting all these considerations, let me present briefly my context: 3 SQL Server instances on a consolidated server with the following storage configuration. We designed the storage layout as follows
One volume for system databases, one volume for tempdb, one volume for user database data files, one volume for user database log files and finally one volume dedicated for the snapinfo. The snapinfo volume is used by SnapManager to store streaming backups and some metadata
But according to what I said earlier my configuration seems to have some caveats. Indeed, to restore only one database in-place in this case is possible but because we’re actually sharing database files between several databases and several SQL Server instances, we will force the snapmanager to copy data from a mounted snapshot back into the file system. Otherwise, we may also face the timeout threshold issue (10s hard-coded into the VSS framework) during the freezing database IO step. Finally we may not exclude issues caused by the limit of 35 databases on a single volume because I’m in a consolidated environment scenario. Fortunately my customer is aware of all the limitations and he is waiting for the upgrade of its physical storage layout. We will plan to redesign the final storage layout at this time. Until then, the number of databases should not be a problem.