Skip navigation.

Pythian Group

Syndicate content
Love Your Data
Updated: 6 hours 22 min ago

Simplify Oracle Tracing with Creative Scripting

Fri, 2015-08-28 14:26

Running a SQL trace is something that all DBAs do to varying degrees. Let’s say you are working on optimizing a SQL statement, and experimenting with some different hints for indexes and optimizer directives. This kind of effort typically goes something like this:

  • modify the SQL statement
  • enable tracing
  • run the statement
  • disable tracing
  • disconnect
  • retrieve the trace file
  • use a profiler to process the trace file
    this might be Method-R mrskew,Oracle tkprof, or something of your own.
  • delete the trace file if no longer needed

That process is OK if all you need to do is look at a couple of trace files, but quickly becomes tedious for any serious optimization effort as there will be many iterations of this process.  This is the kind of job that just cries out for some simple automation.

Let’s walk though automating much of this process using Sqlplus, ssh and some profiling tools.

First let’s consider the environment:

  • Oracle 11.2 database on a remote server
  • Workstation has 11.2 client software installed
  • ssh is setup for connecting to the oracle user on the database server
  • some profiling tools are available

Let’s get started with the script that is the subject of our ‘tuning’ effort.

-- sql2trace.sql
select * from dual;

As you can see there is not really going to be any tuning done in this article; it is all about the process.

The following script tracefile_identifier_demo.sql is used to setup the trace environment by collecting some information about the database host the process owner, and then setting the tracefile_identifier parameter.  The values for these are then used to set sqlplus define variables.

-- tracefile_identifier_demo.sql

-- column variables to capture host, owner and tracefile name
col tracehost new_value tracehost noprint
col traceowner new_value traceowner noprint
col tracefile new_value tracefile noprint

set term off head off feed off

-- get oracle owner
select username traceowner from v$process where pname = 'PMON';

-- get host name
select host_name tracehost from v$instance;

-- set tracefile identifier
alter session set tracefile_identifier = 'MYTRACEFILE';

select value tracefile from v$diag_info where name = 'Default Trace File';

set term on head on feed on

-- do your tracing here
alter session set events '10046 trace name context forever, level 12';

-- run your SQL here

alter session set events '10046 trace name context off';

-- disconnect to ensure all trace data flushed
-- the disconnect must be done in the called script
-- otherwise the values of the defined vars are lost

-- now get the trace file, or other processing
--@@mrskew '&&traceowner@&&tracehost' '&&tracefile'
@@tkprof '&&traceowner@&&tracehost' '&&tracefile'

This article began as an idea to write about tracefile_identier, hence the script name.

Most of this script is quite straightforward:

  • set column command initiated define variables to capture host, process owner and tracefile name
  • collect the data
  • enable tracing
  • run the target script
  • disable tracing
  • call the tkprof.sql script to run tkprof

The interesting bit is found in tkprof.sql.

-- tkprof.sql

col ssh_target new_value ssh_target noprint
col scp_filename new_value scp_filename noprint

set term off feed off verify off echo off

select '&&1' ssh_target from dual;
select '&&2' scp_filename from dual;

set feed on term on verify on

host ssh &&ssh_target 'cat &&scp_filename' | tkprof /dev/stdin ./tkprof.out sort=exeqry sys=no
host cat ./tkprof.out

There are a couple of things to take notice of in tkprof.sql.  Did you notice the disconnect statement?  There are couple of points of interest about that.  Prior to 11g it was necessary to disconnect from Oracle to ensure that all cursors were closed and all STAT and row source operation rows were written to the trace file.  Disconnecting the session is not necessary in Oracle 11g+.

Another interesting bit about this disconnect statement is its placement.  At first the disconnect statement was in the main script.  The problem was that the define variables would all lose their values prior to calling the tkprof.sql script, and so the call would fail; and so the disconnect command is in the called script.

Finally the trace output is retrieved via ssh and piped to tkprof.  Notice that there is no need to actually copy the file, rather the contents of the file are simple sent to STDOUT and piped to tkprof.

The tkprof command does not read from STDIN.  If for instance you try this; cat somefile | tkprof – ./tkprof.out sort=exeqry; tkprof will exit with an error that an input file is needed.  That problem is circumvented by using the file /dev/stdin.

Put it all together and it looks like this:

11:34:11 JKSTILL@oravm > @tracefile_identifier_demo

Session altered.

Elapsed: 00:00:00.00


1 row selected.

Elapsed: 00:00:00.00

Session altered.

Elapsed: 00:00:00.00

TKPROF: Release - Development on Thu Aug 27 11:34:18 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Host key fingerprint is de:ad:be:ed:a2:d6:63:4b:rx:77:fd:1c:e1:36:2b:88
+--[ RSA 2048]----+
|                 |
|                 |
|                 |
|         .  .    |
|        S  +.    |
|        ..ox.o   |
|       o+.F.* o  |
|      99+o.o.= . |
|     .  |

TKPROF: Release - Development on Thu Aug 27 11:34:18 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Trace file: /dev/stdin
Sort options: exeqry
count    = number of times OCI procedure was executed
cpu      = cpu time in seconds executing
elapsed  = elapsed time in seconds executing
disk     = number of physical reads of buffers from disk
query    = number of buffers gotten for consistent read
current  = number of buffers gotten in current mode (usually for update)
rows     = number of rows processed by the fetch or execute call

SQL ID: a5ks9fhw2v9s1 Plan Hash: 272002086

select *

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.00          0          2          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.00       0.00          0          2          0           1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 90
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  TABLE ACCESS FULL DUAL (cr=2 pr=0 pw=0 time=22 us cost=2 size=2 card=1)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  log file sync                                   1        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00

SQL ID: 06nvwn223659v Plan Hash: 0

alter session set events '10046 trace name context off'

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.00       0.00          0          0          0           0

Misses in library cache during parse: 0
Parsing user id: 90


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.00          0          2          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        6      0.00       0.00          0          2          0           1

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       3        0.00          0.00
  SQL*Net message from client                     3        0.00          0.00
  log file sync                                   1        0.00          0.00


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        0      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          3           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        1      0.00       0.00          0          0          3           1

Misses in library cache during parse: 0

    2  user  SQL statements in session.
    1  internal SQL statements in session.
    3  SQL statements in session.
Trace file: /dev/stdin
Trace file compatibility:
Sort options: exeqry
       1  session in tracefile.
       2  user  SQL statements in trace file.
       1  internal SQL statements in trace file.
       3  SQL statements in trace file.
       3  unique SQL statements in trace file.
     218  lines in trace file.
       0  elapsed seconds in trace file.

The same process was used to run the trace data through the Method-R mrskew command:

-- mrskew.sql

col ssh_target new_value ssh_target noprint
col scp_filename new_value scp_filename noprint

set term off feed off verify off echo off

select '&&1' ssh_target from dual;
select '&&2' scp_filename from dual;

set feed on term on verify on
host ssh &&ssh_target 'cat &&scp_filename' | mrskew

The results of calling mrskew.sql  rather than tkprof.sql:

CALL-NAME                    DURATION       %  CALLS      MEAN       MIN       MAX
—————————  ——–  ——  —–  ——–  ——–  ——–
SQL*Net message from client  0.003733   74.1%      3  0.001244  0.001004  0.001663
log file sync                0.001300   25.8%      1  0.001300  0.001300  0.001300
SQL*Net message to client    0.000008    0.2%      3  0.000003  0.000002  0.000003
PARSE                        0.000000    0.0%      2  0.000000  0.000000  0.000000
FETCH                        0.000000    0.0%      2  0.000000  0.000000  0.000000
CLOSE                        0.000000    0.0%      2  0.000000  0.000000  0.000000
EXEC                         0.000000    0.0%      2  0.000000  0.000000  0.000000
—————————  ——–  ——  —–  ——–  ——–  ——–
TOTAL (7)                    0.005041  100.0%     15  0.000336  0.000000  0.001663

These scripts can all be found at

If you have ideas about how to improve these, please feel free to clone the repo, make some changes and issue a pull request.

If you don’t know what all of that means, might I suggest this article?  Git for Beginners

The next time you have some tracing to do, why not give this method a try?  Doing so will save you time and make you more productive.


Categories: DBA Blogs

Pillars of PowerShell: SQL Server – Part 2

Fri, 2015-08-28 14:24

This is the seventh and final post in the series on the Pillars of PowerShell. The previous posts in the series are:

  1. Interacting
  2. Commanding
  3. Debugging
  4. Profiling
  5. Windows OS
  6. SQL Server – Part 1

In this final post I am going to touch on SQL Server Management Objects (SMO) with PowerShell. SMO is one of the most widely used methods, and offers the most versatile way of working with SQL Server to me. It can be a bit tedious to work with being that you are going to be using raw .NET objects now instead of cmdlets, but offers so much more compared to SQLPS. In this post I am just going to touch on the basics of loading SMO, and how you can connect to an instance of SQL Server (or multiple). I am going to end it showing you a function I published a few years ago and use fairly frequently to this day.

Loading SMO

As with SQLPS, you have to load SMO into your PowerShell session before you can utilize it. SMO is what is referred to as an “assembly”, basically a collection of types and other objects that form a logical unit of functionality for interacting with various parts of SQL Server. SQL Server 2012 and above you can import the SQLPS module and it will automatically import the associated version of SMO. However, being that SQLPS is loading in more than just SMO it can take time for that to complete before your script will continue. In that regard, it can shave off some time by just loading SMO directly without all the overhead of the SQLPS module. You will commonly see the following line of code used to load SMO into your session:



Generally this command is going to load the highest version registered in the GAC on your machine. In the screenshot you may see the version is “13.0.0”, this is from SQL Server Management Studio preview (July 2015) that is installed on my machine. Now with PowerShell things change over time and using LoadWithPartialName is actually the version 1 method of loading SMO. This method is actually no longer supported, but still works for now. In PowerShell 2.0 a cmdlet was added to do this for you called, Add-Type. If you were to just type in Add-Type ‘Microsoft.SqlServer.Smo’ when you have multiple versions, your are going to get an error similar to this:


In this situation you have to specify the assembly you want to load, so there is a bit more to doing this with SMO. You can load an assembly by specifying the file itself or by the assembly name along with 4 bits of information:

  1. Name
  2. Version
  3. Culture
  4. PublicKeyToken

To date, Microsoft always uses the same Culture and PublicKeyToken on almost all of their assemblies that come out of Redmond. So the only thing lacking is the version, which is going to be in the format of a 4-part version number, If you have worked with SQL Server and you are familiar with the build numbers, you simply need to know that “10” is SQL Server 2008, “11” is SQL Server 2012, “12” is SQL Server 2014, and “13” is going to be SQL Server 2016. So, if I want to load the SQL Server 2012 SMO into my session I simply use this command:

Add-Type -AssemblyName "Microsoft.SqlServer.Smo, Version=, Culture=neutral, PublicKeyToken=89845dcd8080cc91"
The first connection…

To connect to a single instance of SQL Server with Windows Authentication you can use the following:

$srvObject = New-Object Microsoft.SqlServer.Management.Smo.Server "MyServer"

Once you hit enter, it will make a connection to your instance and then the variable $srvObject will contain properties and methods that you can use to manipulate the server-level objects of your instance. If you recall from the previous pillars in this series, this is where Get-Member comes in real handy for exploring. As an example let’s say you wanted to get similar information to what SELECT @@VERSION returns in T-SQL. You simply need to know the properties that hold this information and pipe the object to select:

$srvObject | select Product, VersionString, Edition, OSVersion 

In PowerShell it is good to start out with the mindset “if I write it for one server, might as well write it to handle multiple”. What I mean by this is you get to the point of developing a script into a tool. If I wanted to turn the above bit of code into something I can reuse, and run for one instance or 50 instances it just takes a bit of work and you are there before you know it:

function Get-SqlVersion {
 param (
 $allServers = @()
 $props = @{ServerName="";Product="";Version="";Edition="";OSVersion=""}
 foreach ($s in $server) {
 $srvObject = New-Object Microsoft.SqlServer.Management.Smo.Server $s

 $cserver = New-Object psobject -Property $props
 $cserver.ServerName = $s
 $cserver.Product = $srvObject.Product
 $cserver.Version = $srvObject.VersionString
 $cserver.Edition = $srvObject.Edition
 $cserver.OSversion = $srvObject.OSVersion
 $allServers += $cserver

Now, don’t let this scare you as it may look more complicated than it seems. You could just put two lines inside the foreach loop that create your server object and then just select the properties, then you are done. It is best though when you start to write functions that the output of your function is an object. So that is the only additional step I take using New-Object psobject to create a PowerShell object with the properties ServerName, Product, Version, Edition, and OSVersion. In the event you expand on this function in the future, and wanted to pipe this output to another cmdlet or custom bit of code it will be in a more formal object type for you to work against.

Golden Nugget

One of the things I got annoyed with fairly quickly when troubleshooting an instance of SQL Server was having to search through the error log(s). You could be dealing with the default of 6 logs for an instance or up to 99 of them. Now there is some T-SQL code out there of people iterating through each log for you, but I just prefer to use PowerShell. I published this code on my personal blog back in December of 2014. You can find the write-up and code here: Search-SqlErrorLog. It will be good practice for you to try and understand it on your own, but I include help information just in case.

This is one of the few times I wrote a function that only works with one server at a time. You can do some one-liner tricks with the pipeline to easily call it for multiple servers:

"server1","server2" | foreach {Search-SqlErrorLog -server $_ -all -value "^backup"}

The output of this function provides the number of the log it was found in, the date, the process (if noted in the log), and the text found matching the value you provided (which can accept regex expressions, the “^” means the start of the string):


The End

I hope you learned something new in this series on PowerShell, and good scripting to you all.

Categories: DBA Blogs

Migration of Oracle Database to Amazon RDS using Golden Gate

Fri, 2015-08-28 14:15

Amazon RDS is a web service used to manage databases, like Oracle, in the cloud. Small- and medium-sized enterprises with databases of normal load, volume, and SLA, can certainly leverage the ease and cost efficiency Amazon RDS offers.

There are two other methods that are widely used to migrate databases with minimal downtime: Oracle Data Guard and Oracle GoldenGate. AWS RDS doesn’t support Data Guard, but luckily it does support Oracle GoldenGate. There are some version constraints though.

The following steps are involved while migrating a database from on-premises to AWS RDS:

— Source database on premises
— Oracle GoldenGate Hub on EC2 instance
— Target database on AWS RDS

Now there could be different topologies for the above 3 components, but we are just using this topology for simplicity. For details on this topology, refer to this very fine and simple Appendix: Using Oracle GoldenGate with Amazon RDS.

Generally and roughly, the steps used to migrate databases from on-premises Oracle database to AWS RDS could be as follows:

— Create target database targetdb in AWS RDS with same parameters as that of the source database sourcedb.

— Create same tablespaces on targetdb in AWS RDS as they exist in source database sourcedb.

— Create same non default users on targetdb in AWS RDS as they exist in source database sourcedb.

— Create same non default roles on targetdb in AWS RDS as they exist in source database sourcedb and assign these roles to users on targetdb.

— Export data/objects from sourcedb database to specific SCN from non default schemas

— Import data/objects into targetdb database

— Configure GoldenGate extract process on sourcedb , for configuration see this

— Configure GoldenGate replicate processes on targetdb , for configuration see this

— Set up Oracle GoldenGate (GG) Hub on EC2 , for configuration see this

— Start GG extract process on sourcedb

— Start GG replicate process on targetdb starting after that SCN until it catch all changes generated on sourcedb database during exp/imp time.

— Then plan the cut-off time for applications to switch to new AWS RDS database after stopping replicat process at targetdb.

— Cleanup of sourcedb.

These are just the skeleton steps and need refining and proper planning. It’s always good to first thoroughly test such action plans. But as you can see, Oracle GoldenGate is a viable tool to migrate databases to the AWS RDS. Pythian has a full range of skills, experience, and capabilities to oversee such migrations as its our daily routine to use GoldenGate to do migrations. And yes, even if AWS RDS is a cloud service, you still need a DBA :)

Categories: DBA Blogs

Three Hidden Azure SQL Database Gotchas

Fri, 2015-08-28 13:35

Azure SQL Database is Microsoft’s Database as a Service (DBaaS) platform offering. It allows end users to leverage the power of SQL Server in the cloud without the expense and complexity of building a private infrastructure. Additionally, this offering simplifies database maintenance tasks while providing seamless high availability and disaster recovery capabilities.

Although DBaaS offerings are still crawling out their infancy, with the correct planning and use cases, implementing an Azure SQL Database solution can be a relatively straightforward process. However, as this platform continues to mature, you can expect to encounter some “Ghosts in the Machine”. Hopefully this post will allow you to avoid some of these unexpected behaviors.

  1. What’s in a name?

Azure SQL Servers all share the same public domain, and access is controlled through IP white-lists and user credentials. Until recently, Azure SQL Database dynamically allocated server names comprised of long random strings for security purposes and because each Azure server name must be unique globally. However, recently Microsoft provided the ability to allocate specific server names specified by the end user, i.e.

This feature is a more than a welcome addition, particularly for organizations who wish to pre-configure connection strings for cloud implementations.

The hidden gotcha resides in the implementation of this feature. Once you create a server with a user defined name, the Azure cloud reserves that name for you within the Azure fabric. If for any reason you remove the server you will be unable to recreate the server using the same name for at least 5 days. When you attempt to recreate the server, you will receive the message “Specified server name is already used” as depicted below:


Microsoft is aware of this limitation, however, at this time, the only way to correct the situation is to contact Microsoft Support and have them remove the Azure fabric metadata manually.

Additionally, it should be noted that you can only specify a specific Azure SQL Database Server name in the preview portal. This feature is not available in the standard portal or via the New-AzureSqlDatabaseServer Cmdlet in PowerShell.

2. You can change the performance tier at any time, unless you can’t.

One of the fantastic benefits of leveraging Azure SQL Database is the ability to switch service tiers at any time, without service disruption in order to leverage pay per minute costing efficiencies.

Unfortunately, another hidden gotcha may rear its ugly head during the switching process. Organizations that utilize BCP processes against an Azure SQL instance need to be wary when performing a service level switch. BCP operations often simply “Hang” when switching between service levels. The only resolution for this issue is to terminate the process and re-initiate once the tier switch has been completed.

3. I know you’re there, but I can’t see you.

Just like all could offerings, Azure SQL Database continues to mature and improve. However, you need to be prepared for some management inconsistencies. The preview portal is aptly named and although some functions are only available within the preview portal, you may need to frequently revert to the standard portal for a more consistent experience.

As an example, I have a client who switched databases between standard and premium tiers and vice versa. These databases no longer display in the preview portal at all. However, they do appear correctly in the standard portal as shown in the CIA level of redacted screen captures below.



Categories: DBA Blogs

Trust and confidence from Pythian

Fri, 2015-08-28 13:25

Recently I “inherited” some new responsibilities at work. It’s not the first time during my 11 or so of the last 16 years at Pythian. Throughout my employ at Pythian, I have been continually given new titles based on new roles I have taken on. For me, besides the enjoyment I have been lucky to have at Pythian, this trust and confidence are two of the biggest contributors to one’s longevity with a company.

For Pythian and me, it all started one spring afternoon in about 1998. Paul and Steve had been doing the Pythian-thing for a year or more, and were looking for assistance getting “off-the-ground” so to speak. That endeavour was part of the reason for our new association and it’s been a magic carpet ride since. I did leave at one point for almost 6 years, but returned in early 2011. Between 1998 and 2011, the size of the company changed, but it was still the same old company.

I now manage the day-to-day operations of the consulting group and take pride in the work I do. Touché all you people out there in Pythian-land.

Categories: DBA Blogs

Creating an Oracle Database Cloud Service

Fri, 2015-08-28 13:10

Back in late June of 2015, Larry Ellison launched several public cloud services and one of those services was the public DBaaS. Today, I had the opportunity to try out this new service. This blog post will examine how to create it and how to connect it with sqlcli. As with any cloud service, it all happens in the background, saving you from doing tedious configuration steps to start using your service.


In my case, it took about 30 mins from when I clicked on create service to start using my database.

So the first thing that you have to do, obviously, is access the Oracle Cloud My Services application.  If you do not currently have access, speak with your sales rep or cloud administrator, but remember that this application is not free. Once you have access, click on the Oracle Database Cloud Service link and the following page will come up. Click on “Create Service” :

Once you have done that, we need to choose the type of service we will solicit and the billing frequency. As I have talked about in previous posts, it all depends on your business needs and abilities. The difference here between choosing a “Cloud Service” and a “Cloud Service – Virtual Image” is that in the first option, the database and the database instance are created for you, whereas in the “Virtual Image“, you will need to create it yourself, so choose carefully. One of the good things that comes with the first option is that the cloud patching option comes with it, but in the “Virtual Image“, you have to do this yourself.

As of the writing of this post, Oracle offers two database versions – and I chose the latter.



In the Edition section, we get to choose the type of service we will get when choosing the Cloud Software Edition. Unlike the previous one, here we will choose the bells and whistles that you will be licensed to use in this database. I won’t include the differences between the two here, but you can view them in in the PaaS section, under Database. In my case, I just chose the regular Enterprise Edition :

In the details section, we can set the characteristics of the database service. It is important to select the “Compute Shape” correctly as this is critical to your usage billing. It is also good to know that one OCPU (Oracle CPU) is equivalent to a 3.0 GHz 2012 Intel Xeon with HyperThreading Enabled. Also you will have to add a Public SSH key to access your compute node. You can learn how here: how to create one. This is where you will also set the usable storage, your system or administrator password for the database, the name of the SID, the version (in this case, you are using version, the name of the PDB. Last, but not least, you will choose your backup destination. In my case, I just chose a local, but you can choose the Oracle Database Backup Service if you have one.



Last, but not least, you will get a confirmation of the service you are about to create. I didn’t copy this particular screenshot when I created it, but here is a similar one, so you get the gist.


Once you click on create, you can select the service and see the details of the creation process, as well as some others, like the Public IP, Port, etc.

Once the DB and VM are allocated, you need to go back to the Oracle Cloud My Services application  and go to the Oracle Compute Cloud Service console. This is to enable the security rule that will allow us to connect to port 1521 for this DB.



In the page that comes up, go to the Network section, and you will see a set of Security Rules, which you will find disabled.2015-08-21_1056

In my case, I enabled the “dbaas/test-orcl/db/ora_p2_dblistener” rule.


In this particular case – and I want to emphasize this – I am not concerned with security, so I also enabled the Security List for Inbound/Outbound Policy traffic.



Once I had done this, I am now ready to connect to my DB via sqlcli  like I would connect to any other DB:

Renes-iMac:bin Rene$ ./sql system@***.***.****.****:1521:ORCL

SQLcl: Release RC on Fri Aug 21 11:41:42 2015

Copyright (c) 1982, 2015, Oracle. All rights reserved.

Password? (**********?) ************
Connected to:
Oracle Database 12c Enterprise Edition Release - 64bit Production
With the Oracle Label Security option 

SQL> select name from v$database;


SQL> set lines 200 pages 9999



---------- --------------- ---------

SQL> alter session set container=PDB1;

Session altered.


As you can see, it is quite easy to request a database service and start using it. You will have to start building your case to use the public cloud, but once you do, you can see that using your database is no different from an on-premise to a cloud service.

Note– This was originally published on

Categories: DBA Blogs

Log Buffer #438: A Carnival of the Vanities for DBAs

Fri, 2015-08-28 12:08

This Log Buffer Edition covers Oracle, MySQL, and SQL Server blog posts from the last week.


Integrating Telstra Public SMS API into Bluemix

Adaptive Query Optimization in Oracle 12c : Ongoing Updates

First flight into the Oracle Mobile Cloud Service

Oracle 12C Problem with datapatch. Part 2, the “fix”

oracle applications r12 auto start on linux

SQL Server:

Email Formatted HTML Table with T-SQL

SQL Server 2016 – Introduction to Stretch Database

Soundex – Experiments with SQLCLR Part 3

An Introduction to Real-Time Communication with SignalR

Strange Filtered Index Problem


Announcing Galera Cluster 5.5.42 and 5.6.25 with Galera 3.12

doing nothing on modern CPUs

Single-threaded linkbench performance for MySQL 5.7, 5.6, WebScale and MyRocks

Identifying Insecure Connections

MyOraDump, Oracle dump utility, version 1.2

Categories: DBA Blogs

Log Buffer #437: A Carnival of the Vanities for DBAs

Fri, 2015-08-28 12:07

This Log Buffer Edition goes out deep into the vistas of database world and brings out few of the good ones published during the week from Oracle, SQL Server, and MySQL.


Overriding Default Context-Sensitive Action Enablement

This is an alternative to if… then… else… elsif… end if when you want to use conditional statements in PL/SQL.

Achieving SAML interoperability with OAM OAuth Server

Release of BP02 for Oracle Identity Manager

IT Business Edge: Oracle Ties Mobile Security to Identity and Access Management

SQL Server:

How to render PDF documents using SQL CLR. Also a good introduction on creating SQL CLR functions.

What is DNX?

SQL Server Performance dashboard reports

Using Microsoft DiskSpd to Test Your Storage Subsystem

Connect to Salesforce Data as a Linked Server


Optimizing PXC Xtrabackup State Snapshot Transfer

Adding your own collation to MySQL

Monitoring your Amazon Aurora Databases using MONyog

How much could you benefit from MySQL 5.6 parallel replication?

MySQL checksum

The post Log Buffer #437: A Carnival of the Vanities for DBAs appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Are you ready to be a private cloud service provider?

Thu, 2015-08-20 20:35

When defining what a cloud service is, we need to know that it is not a technology per se, but its an architectural and operational paradigm. It is a self-service computing environment offering the ability to create, consume, and pay for services. In this architecture, computing resources are elastically supplied from a shared pool and charged based on metered use and it uses service catalogs to provide a menu of options and service levels.

According to the IDC  the “total cloud IT infrastructure spending (server, disk storage, and ethernet switch) will grow by 21% year over year to $32 billion in 2015, accounting for approximately 33% of all IT infrastructure spending, which will be up from about 28% in 2014. Private cloud IT infrastructure spending will grow by 16% year over year to $12 billion, while public cloud IT infrastructure spending will grow by 25% in 2015 to $21 billion.

Meaning that the growth for this architecture (Private,Public or Hybrid) will not stop for the foreseeable future, so we first need to understand what drives it and how to translate your current architecture into a 3rd platform architecture.

2015-08-19_1240 Source: Image from IDC 3rd Platform Study

The principles of a cloud architecture support the following necessary capabilities:

  • Resource pooling – Services can be adjusted to suit each client’s needs without any changes being apparent to the client or end user.
  • Rapid elasticity – The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.
  • On-demand self-service – Provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider
  • Measured service – Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer
  • Broad network access – Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms
Business Drivers

Cloud will not be a true fit for everybody or for every case. We need to understand and determine the business drivers before we implement a cloud architecture.

  1. Increment our agility within our enterprise by providing:
    1. The ability to remove certain human procedures and have the end user be a self-service consumer
    2. A well-defined service catalog
    3. Capability to adapt to workload changes by provisioning or deprovisioning system resources
  2. Reduce enterprise costs by:
    1. Using shared system resources for our different applications and internal business divisions
    2. Being capable of determining the actual usage of system resources to show the benefit of our architecture
    3. Capable of automating mundane and routine tasks
  3. Reduce enterprise risks
    1. By having greater control of the resources we have and how they are being used
    2. Have more unified security across our business
    3. Providing different levels of high availability to our enterprise
Service Catalog

The most critical part when defining any type of service is defining what is it that we are going to provide. Take McDonalds for example. When we get to a counter, there is a well-defined catalog of what products we can consume in that establishment. It will be a certain type of hamburger and junk food. To define it more clearly, we can’t go into McDonalds and order a pizza or Italian food, as that is not in their business or service catalog.

When defining our business enterprise service catalog, we need to define the What, as to what type of service we want to provide, what service levels we want to provide, what policies we are going to apply to the service, and what our capabilities are to provide it.

The business service catalog will translate into a technical enterprise catalog, defining every detail of how we will provide our business services. Here we need to define the How. How are we going to deploy the service? How are we going to provide the service levels? How are we going to apply the business policies and how are we going to manage our services?

As mentioned, this is not a technology, but it is an architecture, and like any, we first must understand where we are to know where we are going. So we, in our current organization, first need to capture our existing assets, skills, and processes so that we can then validate the future state of our architecture. 2015-08-19_1312

Meter, Charge, and Optimize

Business consumers want to know what they are consuming and what it costs, even if they don’t actually want to pay for the service. Additionally, from an operational perspective, as different tenants start sharing the same piece of platform or infrastructure, there needs to be accountability on the usage, or else resources may be over-allocated. To mitigate this, we often meter the usage and optionally chargeback [or show back] the tenants. Though an IT organization may not actually charge back its LOBs, this provides a transparent mechanism to budget resources and optimize the cloud platform on an ongoing basis.


These are just a few points to be aware of if you want to become a private cloud provider, but this is also helpful for any cloud architecture, as we need to understand what drives the change, what it is we are going provide, and how we are going to deliver and measure the services that we are providing.

Note– This was originally published on

The post Are you ready to be a private cloud service provider? appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Git for Beginners

Thu, 2015-08-20 20:04
git, simplified

Perhaps you’ve come across a great cache of publicly available SQL scripts that would be very useful in monitoring your databases, and these scripts are hosted on github.  Getting those scripts is as simple as clicking the Download button.

What if, however, you wish to contribute to the script library?

Or perhaps you would like to collaborate with coworkers on a project and want to host the files on github.

How do you get the files to your local server so that changes can be saved and pushed to the master repo?

Github is often the answer for that.

Some time ago github was probably considered by most IT folks as a tool for developers.  That has changed, as now git and github are popularly used to manage changes and allow collaboration on many kinds of projects that require file management.

If you are reading this blog, you are probably a DBA.  What better way to manage SQL scripts and allow others to contribute than with github?

Let’s simplify the use of git and make it usable for casual users. In other words, DBAs who want to access a SQL repo, and don’t want to relearn git every time, need to access the repo.

The methods shown here are not the same ones that would be used by a team of developers. Typically developers would create a fork of a project, clone that fork, modify files, and then issue pull requests to the main repo owner. There would also be branches to the development tree, merging, etc.

For this demo, there will still be a need to fork your own copy of the repo, but that is as far as it will go at this time.

Read more about creating a fork:

In the spirit of keeping this simple, there will be no branching in this demo; I’ll only show the basics required to contribute to a project.

With simplicity as a goal, the following steps are to be performed in this demo:

  • Create a copy (fork) of the main repo in github
  • Clone the repo to a work environment (my linux workstation)
  • Add a file to the local repo
  • Commit the changes and push to my forked repo on github
  • Issue a ‘pull request’ asking the main repo admin to include my changes

So while it will be necessary to create a fork of the project, we won’t be dealing with branches off the mainline.


– you already have a github account

– git is installed on your laptop, server, whatever.

Git Repos

Two users will be used for this demo: jkstill and pytest.

The following repos will be used.

Main Repo:

Developer’s (you) repo:

The Main Repo is public, so you can run this demo using your own account if you like.

Fork the Repo

The following steps were performed by the pytest user on github.

Login to using a browser.

Navigate to

Click on the ‘Fork’ icon and follow any instructions; this should only take a few seconds.

After forking this repo as pytest, my browser was now directed to

ssh key setup

This only needs to be done once.

The following examples are for github user pytest.

The pytest account will be used to demonstrate the concepts. Later I will explain more about ssh usage as it pertains to github, but for now this is probably sufficient.

create a new ssh key for use with github
   ssh-keygen -t rsa -N '' -f ~/.ssh/id_rsa_pytest_github -C 'github'
add key to github account

While logged in to your github account in a browser, find the account settings icon.

The icon for account settings is in upper right corner of browser window.

Navigate to the Add SSH Key section.

account settings -> SSH Keys -> Add SSH Key

The key added will be the public key. So in this case, the contents of ~/.ssh/ would be pasted in the the text box that appears when the Add SSH Key button is pushed.

authenticate to github – the ‘’ is required

Make sure to authenticate the key with github.

   ssh -i ~/.ssh/id_rsa_pytest_github -t

Here is a successful example:

> ssh -i ~/.ssh/id_rsa_github -t

Host key fingerprint is DE:AD:BE:EF:2b:00:2b:36:63:1b:56:4d:eb:df:a6:42

+--[ RSA 2048]----+
|        .        |
|       + .       |
|      . B .      |
|     o * +       |
|    Y * S        |
|   + O o . .     |
|    .   Z . o    |
|       . . t     |
|        . .      |
PTY allocation request failed on channel 0
Hi pytest! You've successfully authenticated, but GitHub does not provide shell access.
Clone the REPO

Now you are ready to clone the newly forked repo to your workstation. At this point, it is assumed that git is already installed in your development environment. If git is not installed then you will need to install it.  There are many resources available whichever platform you are working on; installation will not be covered here.

The following command will clone your forked copy of the repo in the current directory:

> git clone
Cloning into 'git-demo'...
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 7 (delta 0), reused 7 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), done.
Checking connectivity... done

> cd git-demo

> ls -la
total 20
drwxr-xr-x 3 jkstill dba 4096 Aug 18 15:45 .
drwxr-xr-x 4 jkstill dba 4096 Aug 18 15:45 ..
drwxr-xr-x 8 jkstill dba 4096 Aug 18 15:45 .git
-rw-r--r-- 1 jkstill dba  113 Aug 18 15:45 .gitignore
-rw-r--r-- 1 jkstill dba   47 Aug 18 15:45

Note: it is possible to use the ~/.ssh/config file to specify multiple ssh keys for use with git. This is useful when you may be using multiple accounts.

The command I used to do this operation is below as I do have multiple accounts:

  git clone git-as-pytest:pytest/git-demo

You can read more about this in a later section of this article.

Now cd to the new repo:  cd git-demo

There should be two files and a directory as seen in the previous example.

Modify or add a script

Now you can modify a script or add a new script and then commit to your local repo.

In this case, we will add a script fra_config.sql to the local repo.

-- fra_config.sql
-- show location and size of FRA

col fra_location format a30
col fra_size format a16

select fra_location, fra_size from (
   select name, value
   from v$parameter2
   where name like 'db_recovery_file_dest%'
pivot ( max(value) for name in (
      'db_recovery_file_dest' as FRA_LOCATION,
      'db_recovery_file_dest_size' as FRA_SIZE

Modified files can be seen with git status:

> git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#       fra_config.sql
nothing added to commit but untracked files present (use "git add" to track)

Now add the file to the list of those that should be tracked and check the status again:

> git add fra_config.sql

> git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#       new file:   fra_config.sql

As we are happy with the results, it is time to commit to the local repo:

> git commit -m 'Added the new file fra_config.sql'
[master 86eaf7c] Added the new file fra_config.sql
1 file changed, 18 insertions(+)
create mode 100644 fra_config.sql

> git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
#   (use "git push" to publish your local commits)
nothing to commit, working directory clean

Shouldn’t we have put a date in that file? OK, a date and time was added, changes to the file displayed, the file was added to the list of those to commit, and the commit made:

> git diff fra_config.sql | cat
diff --git a/fra_config.sql b/fra_config.sql
index 03b98fd..37c58ac 100644
--- a/fra_config.sql
+++ b/fra_config.sql
@@ -1,6 +1,7 @@

-- fra_config.sql
-- show location and size of FRA
+-- jkstill 2015-08-18 16:03:00 PDT

col fra_location format a30
col fra_size format a16

> git add fra_config.sql

> git commit -m 'added timestamp'
[master 83afd35] added timestamp
1 file changed, 1 insertion(+)

> git status
# On branch master
# Your branch is ahead of 'origin/master' by 2 commits.
#   (use "git push" to publish your local commits)
nothing to commit, working directory clean

Committing can and should be done frequently, as the commit affects only the local repository.

This makes it possible to see (and retrieve) incremental changes to a file as you work on it.

Once you are satisfied with all changes, push the changes to the repo. Notice that git status knows that 2 commits have been performed locally that are not seen in the master repository.

Configure the Remote

Before pushing to the main repo, there is a little more configuration work to do. While this method is not strictly necessary, it does simplify the use of git.

You will need to edit the file ~/.ssh/config; create it if it does not already exist.

Here’s my example file where a host git-as-pytest has been created. This host will be used to connect to github.

GSSAPIAuthentication no

Host git-as-pytest
  User git
  IdentityFile /home/jkstill/.ssh/id_rsa_pytest_github
  IdentitiesOnly yes

Now edit the file ./.git/config.  Find the line that remote “origin” and change the URL as seen in this example.

  repositoryformatversion = 0
  filemode = true
  bare = false
  logallrefupdates = true
[remote "origin"]
  #url =
  url = git-as-pytest:pytest/git-demo.git
  fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
  remote = origin
  merge = refs/heads/master

Now you should be able to push the changes to the master repo:

> git push origin master
Counting objects: 7, done.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 787 bytes | 0 bytes/s, done.
Total 6 (delta 2), reused 0 (delta 0)
To git-as-pytest:pytest/git-demo.git
788e5b1..83afd35  master -> master

The changes to your files can be seen in your repo on

Issue a PULL request

Once you think the file or files are ready to be included in the master repository, you will issue a pull request to the admin of the master repo.

The repo admin can then pull the changes and examine them. Once it has been determined that the changes can be made to the master repo, the admin will push the changes.

Issuing the pull request

View the repo in your browser, press the ‘pull request’ icon and follow the instructions. This action will cause an email to be sent to the repo admin with URL to view the pull request.   The admin can then examine and test the changes, and merge the pull request (if appropriate) into the mainline.

If the pull request results in your changes being merged, github will send you an email.

After the Pull request has been merged

Now other users can get the updates with the following commands

  git pull
  git status
  git commit

These commands will merge the repo from github with this one.

As there is the possibility of overwriting files you are working on, be sure this is the right thing to do.

Now that you have the basics, you can get started.

Please feel free to use  the repo to follow along with the steps shown here.

The post Git for Beginners appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Difference Between Oracle’s Table and Mongo’s Collection

Thu, 2015-08-20 11:44

Roughly speaking, the notion of ‘Tables’ in Oracle is similar to MongoDB’s ‘Collections’. They are NOT identical though. Before we examine the differences between Oracle’s Table and MongoDB’s Collection, let’s see what Table in Oracle and Collection in MongoDB are.

Table in Oracle:

A table in Oracle is made up of a fixed number of columns for any number of rows. Every row in a table has the same columns.

Collection in MongoDB:

A collection in MongoDB is made up of documents. The concept of Documents is similar to rows in a table, but it’s not identical. A document can have its own unique set of columns. In MongoDB, columns are called fields.

So in MongoDB, fields are defined at the document level (or we can say in Oracle lingo that columns are defined at the row level), whereas in Oracle the columns are defined at the table level.

That is actually the main difference between Oracle’s Table and Mongo’s collection among other subtle differences such as collections are schema-less, whereas Table in Oracle has to be in some schema.

Example of an Oracle table:


1                Smith    Karachi
2               Adam    Lahore
3               Jim        Wah Cantt
4               Ken         Quetta



Select * from EMP;

In the above example, the table is ‘EMP’, with 4 rows. All 4 rows have a fixed number of columns EMPID, NAME, and CITY.

Example of a MongoDB Collection:

db.EMP.insert({EMPID: ‘1’,NAME: ‘Smith’, CITY: ‘Karachi’})
db.EMP.insert({EMPID: ‘2’,NAME: ‘Adam’, CITY: ‘Wah Cantt’, Designation: ‘CTO’})
db.EMP.insert({EMPID: ‘3,NAME: ‘Jim’, Designation: ‘Technician’})
db.EMP.insert({EMPID: ‘4’,NAME: ‘Ken’})

> db.EMP.find()

{ “_id” : ObjectId(“55d44757283d7d463aec4cc1”), “EMPID” : “1”, “NAME” : “Smith”, “CITY” : “Karachi” }
{ “_id” : ObjectId(“55d44757283d7d463aec4cc2”), “EMPID” : “2”, “NAME” : “Adam”, “CITY” : “Wah Cantt”, “Designation” : “CTO” }
{ “_id” : ObjectId(“55d44757283d7d463aec4cc3”), “EMPID” : “3”, “NAME” : “Jim”, “Designation” : “Technician” }
{ “_id” : ObjectId(“55d44757283d7d463aec4cc4”), “EMPID” : “4”, “NAME” : “Ken” }

In the above example, first we inserted 4 documents into collection ‘EMP’. Notice that all 4 documents have different number of columns. db.EMP.find() command is to display these documents.

Hope that helps……

The post Difference Between Oracle’s Table and Mongo’s Collection appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Log Buffer #436: A Carnival of the Vanities for DBAs

Fri, 2015-08-14 08:00

This Log Buffer Edition covers the top blog posts of the week from the Oracle, SQL Server and MySQL arenas.


  • Momentum and activity regarding the Data Act is gathering steam, and off to a great start too. The Data Act directs the Office of Management and Budget (OMB) and the Department of the Treasury (Treasury) to establish government-wide financial reporting data standards by May 2015.
  • RMS has a number of async queues for processing new item location, store add, warehouse add, item and po induction. We have seen rows stuck in the queues and needed to release the stuck AQ Jobs.
  • We have a number of updates to partitioned tables that are run from within pl/sql blocks which have either an execute immediate ‘alter session enable parallel dml’ or execute immediate ‘alter session force parallel dml’ in the same pl/sql block. It appears that the alter session is not having any effect as we are ending up with non-parallel plans.
  • Commerce Cloud, a new flexible and scalable SaaS solution built for the Oracle Public Cloud, adds a key new piece to the rich Oracle Customer Experience (CX) applications portfolio. Built with the latest commerce technology, Oracle Commerce Cloud is designed to ignite business innovation and rapid growth, while simplifying IT management and reducing costs.
  • Have you used R12: Master Data Fix Diagnostic to Validate Data Related to Purchase Orders and Requisitions?

SQL Server:

  • SQL Server 2016 Community Technology Preview 2.2 is available
  • What is Database Lifecycle Management (DLM)?
  • SSIS Catalog – Path to backup file could not be determined
  • SQL SERVER – Unable to Bring SQL Cluster Resource Online – Online Pending and then Failed
  • Snapshot Isolation Level and Concurrent Modification Collisions – On Disk and In Memory OLTP


  • A Better Approach to all MySQL Regression, Stress & Feature Testing: Random Coverage Testing & SQL Interleaving.
  • What is MySQL Package Verification? Package verification (Pkgver for short) refers to black box testing of MySQL packages across all supported platforms and across different MySQL versions. In Pkgver, packages are tested in order to ensure that the basic user experience is as it should be, focusing on installation, initial startup and rudimentary functionality.
  • With the rise of agile development methodologies, more and more systems and applications are built in series of iterations. This is true for the database schema as well, as it has to evolve together with the application. Unfortunately, schema changes and databases do not play well together.
  • MySQL replication is a process that allows you to easily maintain multiple copies of MySQL data by having them copied automatically from a master to a slave database.
  • In Case You Missed It – Breaking Databases – Keeping your Ruby on Rails ORM under Control.

The post Log Buffer #436: A Carnival of the Vanities for DBAs appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Thoughts on Google Cloud Dataflow

Thu, 2015-08-13 15:20

Google Cloud Dataflow is a data processing tool developed by Google that runs in the cloud. Dataflow is an easy to use, flexible tool that delivers completely automated scaling. It is deeply tied to the Google cloud infrastructure, making it a very powerful for projects running in Google Cloud.

Dataflow is an attractive resource management and job monitoring tool because it automatically manages all of the Google Cloud resources, including creating and tearing down  Google Compute Engine resources, communicating with Google Cloud Storage, working with Google Cloud Pub/Sub, aggregating logs, etc.

Cloud Dataflow has the following major components:

SDK – The Dataflow SDK provides a programming mode that simplifies/abstracts out the processing of large amounts of data. Dataflow only provides a Java SDK at the moment, which is a barrier for non-Java programmers. More on the programming model later.

Google Cloud Platform Managed Services – This is one of my favourite features in Dataflow. Dataflow manages and ties together components, such as Google Compute Engine, spins up and tears down VMs, manages BigQuery, aggregates logs, etc.

These two components can be used together to create jobs.

Being programmatic, Dataflow is extremely flexible. It works well for both batch and streaming jobs. Dataflow excels at high-volume computations and provides a unified programming model, which is very efficient and rather simple considering how powerful it is.

The Dataflow programming model simplifies the mechanics of large-scale data processing and abstracts out a lot of the lower level tasks, such as cluster management, adding more nodes, etc. It lets you focus on the logical aspect of your pipeline and not worry about how the job will run.

The Dataflow pipeline consists of four major abstractions:

  • Pipelines – A pipeline represents a complete process on a dataset or datasets. The data could be brought in from external data sources. It could then have a series of transformation operations, such as filter, joins, aggregation, etc., applied to the data to give it meaning and to achieve its desired form. This data could be then written to a sink. The sink could be within the Google Cloud platform or external. The sink could even be the same as the data source.
  • PCollections – PCollections are datasets in the pipeline. PCollections could represent datasets of any size. These datasets could be bounded (fixed size – such as national census data) or unbounded (such as a Twitter feed or data from weather sensors). PCollections are the input and output of every transform operation.
  • Transforms – Transforms are the data processing steps in the pipeline. Transforms take one or more PCollections, apply some transform operations to those collections, and then output to a PCollection.
  • I/O Sinks and Sources – The Source and Sink APIs provide functions to read data into and out of collections. The sources act as the roots of the pipeline and the sinks are the endpoints of the pipeline. Dataflow has a set of built in sinks/sources, but it is also possible to write sinks sources for custom data sources.

Dataflow is also planning to add integration for Apache Flink and Apache Spark. Adding Spark and Flink integration would be a huge feature since it would open up the possibilities to use MLlib, Spark SQL, and Flink machine-learning capabilities.

One of the use cases we explored was to create a pipeline that ingests streaming data from several POS systems using Dataflow’s streaming APIs. This data can be then joined with customer profile data that is ingested incrementally on a daily basis from a relational database. We can then run some filtering and aggregation operations on this data. Using the sink for BigQuery, we can insert the data into BigQuery and then run queries. What makes this so attractive is that in this whole process of ingesting vast amounts of streaming data, there was no need to set up clusters or networks or install software, etc. We stayed focused on the data processing and the logic that went into it.

To summarize, Dataflow is the only data processing tool that completely manages the lower level infrastructure. This removes several API calls for monitoring the load and spinning up and tearing down VMs, aggregating logs, etc., and lets you focus on the logic of the task at hand.  The abstractions are very easy to understand and work with and the Dataflow API also provides a good set of built in transform operations for tasks such as filtering, joining, grouping, and aggregation. Dataflow integrates really well with all components in the Google Cloud Platform, however, Dataflow does not have SDKs in any language besides Java, which is somewhat restrictive.

The post Thoughts on Google Cloud Dataflow appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Mongostat – A Nifty Tool for Mongo DBAs

Fri, 2015-08-07 12:08

One of the main MongoDB DBA’s task is to monitor the usage of MongoDB system and it’s load distribution. This could be needed for proactive monitoring, troubleshooting during performance degradation, root cause analysis, or capacity planning.

Mongostat is a nifty tool which comes out of the box with MongoDB which provides wealth of information in a nicely and familiar formatted way. If you have used vmstat, iostat etc on Linux; Mongostat should seem very familiar.

Mongostat dishes out statistics like counts of database operations by type (e.g. insert, query, update, delete, getmore). The vsize column  in Mongostat output shows the amount of virtual memory in megabytes used by the process. There are other very useful columns regarding network traffic, connections, queuing etc.

Following are some of the examples of running Mongostat.

[mongo@mongotest data]$ mongostat
insert query update delete getmore command flushes mapped  vsize    res faults qr|qw ar|aw netIn netOut conn     time
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:29
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:30
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:31
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:32
*0    *0     *0     *0       0     2|0       0 160.0M 646.0M 131.0M      0   0|0   0|0  133b    10k    1 12:47:33
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:34
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:35
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:36
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:37
*0    *0     *0     *0       0     2|0       0 160.0M 646.0M 131.0M      0   0|0   0|0  133b    10k    1 12:47:38

Following displayes just 5 rows of output.

[mongo@mongotest data]$ mongostat -n 5
insert query update delete getmore command flushes mapped  vsize    res faults qr|qw ar|aw netIn netOut conn     time
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:45
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:46
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:47
*0    *0     *0     *0       0     1|0       0 160.0M 646.0M 131.0M      0   0|0   0|0   79b    10k    1 12:47:48
*0    *0     *0     *0       0     2|0       0 160.0M 646.0M 131.0M      0   0|0   0|0  133b    10k    1 12:47:49

In order to see full list of options:

[mongo@mongotest data]$ mongostat –help
mongostat <options> <polling interval in seconds>

Monitor basic MongoDB server statistics.

See for more information.

general options:
–help                     print usage
–version                  print the tool version and exit

verbosity options:
-v, –verbose                  more detailed log output (include multiple times for more verbosity, e.g. -vvvvv)
–quiet                    hide all log output

connection options:
-h, –host=                    mongodb host to connect to (setname/host1,host2 for replica sets)
–port=                    server port (can also use –host hostname:port)

authentication options:
-u, –username=                username for authentication
-p, –password=                password for authentication
–authenticationDatabase=  database that holds the user’s credentials
–authenticationMechanism= authentication mechanism to use

stat options:
–noheaders                don’t output column names
-n, –rowcount=                number of stats lines to print (0 for indefinite)
–discover                 discover nodes and display stats for all
–http                     use HTTP instead of raw db connection
–all                      all optional fields
–json                     output as JSON rather than a formatted table


Discover more about our expertise in Big Data.

The post Mongostat – A Nifty Tool for Mongo DBAs appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Log Buffer #435: A Carnival of the Vanities for DBAs

Fri, 2015-08-07 11:30

Sun of database technologies is shining through the cloud technology. Oracle, SQL Server, MySQL and various other databases are bringing forth some nifty offerings and this Log Buffer Edition covers some of them.


  • How to create your own Oracle database merge patch.
  • Finally the work of a database designer will be recognized! Oracle has announced the Oracle Database Developer Choice Awards.
  • Oracle Documents Cloud Service R4: Why You Should Seriously Consider It for Your Enterprise.
  • Mixing Servers in a Server Pool.
  • Index compression–working out the compression number
  • My initial experience upgrading database from Oracle 11g to Oracle 12c (Part -1).

SQL Server:

  • The Evolution of SQL Server BI
  • Introduction to SQL Server 2016 Temporal Tables
  • Microsoft and Database Lifecycle Management (DLM): The DacPac
  • Display SSIS package version on the Control Flow design surface
  • SSAS DSV COM error from SSDT SSAS design Data Source View


  • If you run multiple MySQL instances on a Linux machine, chances are good that at one time or another, you’ve ended up connected to an instance other than what you had intended.
  • MySQL Group Replication: Plugin Version Access Control.
  • MySQL 5.7 comes with many changes. Some of them are better explained than others.
  • What Makes the MySQL Audit Plugin API Special?
  • Architecting for Failure – Disaster Recovery of MySQL/MariaDB Galera Cluster


Learn more about Pythian’s expertise in Oracle , SQL ServerMySQL.

The post Log Buffer #435: A Carnival of the Vanities for DBAs appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs

Partitioning in Hive Tables

Fri, 2015-08-07 11:03

Partitioning a large table is general practice for a few reasons:

  • Improving query efficiency by avoiding to transfer and process unnecessary data.
  • Improving data lineage by isolating batches of ingestion, so if a ingestion batch failed for some reason and introduces some corrupted data, it’s safe to re-ingest the data

With that being said this practice often results in a table with a lot of partitions, which makes querying a full table or a large part of it a very slow operation. It also makes the Hive client executing the query “memory hungry”. This is mainly caused by how Hive processes a query. Before generating a query plan, the Hive client needs to read the metadata of all partitions. That means a lot of RPC round trips between the Hive client and Hadoop namenode, as well as RDBMS transactions between the Hive client and metastore. It’s a slow process and also consumes a lot of memory. A simple experiment using Hive-0.12 shows that it takes around 50KB heap space to store all data structures for each partition. Below are two examples from a heap dump of a Hive client executing a query which touches 13k+ partitions.


Screen Shot 2015-08-05 at 11.24.16 pm

We can set HADOOP_HEAPSIZE in to a larger number to keep ourself out of trouble. The HADOOP_HEAPSIZE will be passed as -Xmx argument to JVM. But if we want to run multiple Hive queries at the same time on the same machine, we will run out of memory very quickly. Another thing to watch out when increasing the heap size is: if the parallel GC is used for the JVM, which is the default option for Java server VM, and if the maximum GC pause time isn’t set properly, a Hive client dealing with a lot of partitions will quickly increase its heap size to the maximum and never shrink the heap size down.

Another potential problem of querying a large amount of partitions is that Hive uses CombineHiveInputFormat by default, which instructs Hadoop to combine all input files which are smaller than “split size” into splits. The algorithm used to do the combining is “greedy”. It bins larger files into splits first, then smaller ones. So the “last” couple of splits combined usually have a huge amount (depends on how unevenly the size of input files is distributed) of small files in them. As a result, those “unlucky” map tasks which get these splits will be very slow compared to other map tasks and consume a lot of memory to collect and process metadata of input files. Usually you can tell how bad the situation is by comparing SPLIT_RAW_BYTES counters of map tasks.

A possible solution to this problem is creating two versions of that table: one partitioned, and one non-partitioned. The partitioned one is still populated as the way it is. The non-partitioned one can be populated in parallel with the partitioned one by using “INSERT INTO”. One disadvantage of the non-partitioned version is it’s harder to be revised if corrupted data is found in it because in that case the whole table has to be rewritten. Though, starting with hive 0.14, updating and deleting SQL statements are allowed for tables stored in ORC format. Another possible problem of the non-partitioned version is that the table may contain a large number of small files on HDFS, because every “INSERT INTO” will create at least one file. As the number of files in the table increases, querying to the table slows down. So a periodical compaction is recommended to decrease the number of files in a table. It can be done by simply executing “INSERT OVERWRITE SELECT * FROM” periodically. You need to make sure no other inserts are being executed at the same time or data loss will occur.

Learn more about Pythian’s expertise in Big Data.

The post Partitioning in Hive Tables appeared first on Pythian - Data Experts Blog.

Categories: DBA Blogs