Skip navigation.

Pythian Group

Syndicate content
Official Pythian Blog - Love Your Data
Updated: 10 hours 36 min ago

vCPU sharing in EC2: HVM to the rescue?

Fri, 2014-06-27 07:51

I’ve been doing some testing to clarify what a vCPU in Amazon Web Services actually is. Over the course of the testing, I experienced inconsistent results on a 2-thread test on a 4-vCPU m3.xlarge system, due to the mislabeling of the vCPUs as independent single-core processors by the Linux kernel. This issue manifests itself in a CPU-bound, multithreaded workload where there is idle CPU time.

My test environment used a paravirtualized (PV) kernel, which moves some of the virtualization logic into the Linux kernel, reducing the need for high-overhead hardware emulation. One drawback is that the kernel cannot be modified to, for example, resolve the CPU mislabeling. But there is an alternative: an HVM system relying on virtualization extensions in the CPU hardware and allowing custom kernels or even non-Linux operating systems to run. Historically the drawback has been a performance hit, though I read a very interesting post from Brendan Gregg’s blog, indicating that what’s called HVM in Amazon EC2 is actually a hybrid of PV and HVM, combining aspects of both. A test run by Phoronix on EC2 showed HVM performance on par with PV, and in some cases even better. So it definitely seems worth repeating my earlier tests on.

As before, I fire up an instance, but this time using the latest HVM Amazon Linux image:

$ ec2-describe-images ami-76817c1e -H
Type    ImageID Name    Owner   State   Accessibility   ProductCodes    Architecture    ImageType       KernelId        RamdiskId Platform        RootDeviceType  VirtualizationType      Hypervisor
IMAGE   ami-76817c1e    amazon/amzn-ami-hvm-2014.03.2.x86_64-ebs        amazon  available       public          x86_64  machine                           ebs     hvm     xen
BLOCKDEVICEMAPPING      /dev/xvda               snap-810ffc56   8
$ ec2-run-instances ami-76817c1e -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION     r-a4f480da      462281317311    default
INSTANCE        i-c5d5b6ef      ami-76817c1e                    pending marc-aws        0               m3.xlarge       2014-06-23T19:02:18+0000  us-east-1d                              monitoring-disabled                                     ebs                                       hvm     xen             sg-5fc61437     default

Checking in on CPUs:

[ec2-user@ip-10-145-187-117 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

It’s the same 2.6GHz E5-2670 processor, but is reported as a single-socket non-hyperthreaded quad-core processor. Not yet the dual-core hyperthreaded processing we’re getting though.

Time to run a few tests.

[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0 $$
pid 1768's current affinity list: 0-3
pid 1768's new affinity list: 0
[ec2-user@ip-10-145-187-117 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 18.1955 s, 119 MB/s
[ec2-user@ip-10-145-187-117 ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 36.4968 s, 59.5 MB/s
2170552320 bytes (2.2 GB) copied, 36.506 s, 59.5 MB/s

In the same range as with PV, but also 1-2% slower, meaning we’re seeing a small amount of HVM overhead. Let’s try across processors

[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0,1 $$
pid 1768's current affinity list: 0
pid 1768's new affinity list: 0,1
[ec2-user@ip-10-145-187-117 ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 27.8401 s, 78.0 MB/s
2170552320 bytes (2.2 GB) copied, 27.8398 s, 78.0 MB/s
[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0,2 $$
pid 1768's current affinity list: 0,1
pid 1768's new affinity list: 0,2
[ec2-user@ip-10-145-187-117 ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 18.1849 s, 119 MB/s
2170552320 bytes (2.2 GB) copied, 18.2014 s, 119 MB/s

Again, a tiny bit slower than with PV. To test variability, I’ll kick off 20 consecutive runs, and print a histogram of output:

[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0-3 $$
pid 1768's current affinity list: 0,2
pid 1768's new affinity list: 0-3
[ec2-user@ip-10-145-187-117 ~]$ for run in {1..20}; do
>  for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
>  wait
> done
[ec2-user@ip-10-145-187-117 ~]$ cat output | awk '/bytes/ {print $8,$9}' | sort -n | uniq -c
      1 113 MB/s
      3 114 MB/s
      4 115 MB/s
      6 116 MB/s
     10 117 MB/s
     10 118 MB/s
      6 119 MB/s

Running between 113 and 119 MB/s per thread: much less variability than before. In chart form:

Looking at “top”:

[ec2-user@ip-10-145-187-117 ~]$ cat > ~/.toprc <<-EOF
> RCfile for "top with windows"           # shameless braggin'
> Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
> Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
>         winflags=25913, sortindx=10, maxtasks=2
>         summclr=1, msgsclr=1, headclr=3, taskclr=1
> Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
>         winflags=62777, sortindx=0, maxtasks=0
>         summclr=6, msgsclr=6, headclr=7, taskclr=6
> Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
>         winflags=62777, sortindx=13, maxtasks=0
>         summclr=5, msgsclr=5, headclr=4, taskclr=5
> Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
>         winflags=62777, sortindx=4, maxtasks=0
>         summclr=3, msgsclr=3, headclr=2, taskclr=3
[ec2-user@ip-10-145-187-117 ~]$ top -b -n20 -U ec2-user

top - 20:31:51 up 28 min,  2 users,  load average: 1.37, 1.17, 0.63
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 22.9%us,  0.3%sy,  0.0%ni, 76.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 74.0%us,  3.0%sy,  0.0%ni, 23.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 59.7%us,  4.0%sy,  0.0%ni, 36.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 33.7%us,  2.7%sy,  0.0%ni, 63.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 1951 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:08.92 gzip
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:08.92 gzip                                                         

top - 20:31:54 up 28 min,  2 users,  load average: 1.37, 1.17, 0.63
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 72.3%us,  4.3%sy,  0.0%ni, 23.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 94.4%us,  5.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 21.3%us,  2.0%sy,  0.0%ni, 76.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:11.84 gzip
 1951 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:11.83 gzip                                                         

top - 20:31:57 up 28 min,  2 users,  load average: 1.34, 1.17, 0.64
Tasks:  82 total,   3 running,  79 sleeping,   0 stopped,   0 zombie
Cpu0  : 95.3%us,  4.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 92.4%us,  7.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 1951 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:14.74 gzip
 1953 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:14.75 gzip                                                         

top - 20:32:00 up 28 min,  2 users,  load average: 1.32, 1.17, 0.64
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 29.9%us,  1.7%sy,  0.0%ni, 68.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 63.0%us,  3.7%sy,  0.0%ni, 33.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 40.5%us,  2.3%sy,  0.0%ni, 57.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 55.3%us,  3.7%sy,  0.0%ni, 41.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 1951 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:17.66 gzip
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:17.67 gzip

We see that work is split between adjacent CPUs, but that the scheduler is doing a good job of keeping the adjacent CPUs near 100% usage between them.

So based on these tests, it looks like, even though the CPU is still mislabeled, HVM has almost entirely avoided the issue of variability due to shared-core scheduling, at the cost of a small reduction in overall throughput.

Categories: DBA Blogs

Log Buffer #377, A Carnival of the Vanities for DBAs

Fri, 2014-06-27 07:49

Improvement and progress are all about growth. Pythian just recently grew more by entering into an agreement to acquire as announced by Paul Vallee. This Log Buffer also adds one more edition to its growing history.


WebLogic 12.1.3 is a major step forward to becomes SOA Suite 12c, BPM Suite 12c and all the other Fusion Middleware Components will run on it!

The Internet of Things is changing everything. From the way you start your day in the morning, to the way products are manufactured and cities operate their services, machines all around you will collect data, share it with each other, and use it to recommend and even make decisions along the way.

Pivotal Cloud Foundry Installed lets create an ORG / USER to get started

An in-line sub process can best be thought of as a callable scope. It is a scope that is not part of the normal process flow, but rather a unit of encapsulated logic that can be called zero, one or multiple times from anywhere within the BPEL process

Do you need to create PDF reports from PL/SQL?

SQL Server:

Unique indexes are the database developer’s responsibility. Non-unique indexes can be more easily maintained directly on the production database by an automated process.

Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of software delivery and deployment by making build, integration and delivery into a routine.

When you are writing TSQL code there are times when you want to perform the same logic over and over again. To accomplish this you can used the WHILE keyword.

What you need for a Multi Subnet Configuration for AlwaysOn in SQL Server 2012

All the wonderful functionality that in-memory tables and natively compiled procedures provide in SQL Server 2014 is pretty cool. But, changes to core of the engine results in changes in things that we may have developed a level of comfort with. Grant Fritchey explains.


Optimistic updates for Edit Data operations in MySQL for Excel

Here are a few videos those of you who are just getting started with Sphinx may find useful.

Harnessing the power of master/slave clusters to operate data-driven businesses on MySQL

Every SELECT from your Python program may acquire a metadata lock!

Why %util number from iostat is meaningless for MySQL capacity planning

Categories: DBA Blogs

Welcome to Employees and Clients

Thu, 2014-06-26 11:29

Today, we announced that Pythian has entered into an agreement to acquire, itself the result of a recent merger between PalominoDB and DriveDev.

I want to start with a hearty welcome to the 40+ new esteemed collaborators joining our firm today. Simultaneously, I want to welcome’s valued clients to the Pythian family.

I am looking forward to cultivating a long-lasting collaboration and friendship with each one of you, many of whom I have already counted as friends for years.

To that point, I want to highlight my longstanding friendship and collaboration with Laine Campbell, the CEO of I first met Laine in 2007 and was impressed by her intelligence, her energy, her charisma and, most of all, her remarkable passion for doing the right thing by her team, her clients, and her community.

In February 2008, I sent Laine an email with the subject “Join us?”, the most important line of which was “I’m looking for a founder for a new office in the Bay Area.”

Laine was gracious in her reply: “At this point, I’m absolutely packed with long-term clients.  I’m quite comfortable with revenue and challenge and location.  I really am flattered you’d consider me for the position, but I’m going to have to pass.” That was only about a year after she had founded PalominoDB.

Laine and I have been friends ever since and have made a discipline of trading notes and advice about our respective businesses.

As we fast-forward six years to the present, Laine and her team have achieved what many might have thought impossible. Out of thin air, with no venture capital and in only eight short years, is a business eerily reminiscent of Pythian in 2008… a feat that took us 11 years.

Earlier this year, PalominoDB joined forces with DriveDev, itself a highly successful DevOps business transformation company founded in 2007 to create delivers a coherent and differentiated vision that helps transform businesses through breakthrough velocity, availability, security, performance, and cost.

In what has to be one of the longest corporate romances our niche has known, Laine reached out to me in May indicating that she’d like to accept my original offer and join forces with us. It was my turn to be flattered and go through a week’s soul searching.  I was not alone in the exercise. A lot of soul searching, strategic thinking, and sheer hard work has gone into this announcement today. By the end of our efforts, it became clear that joining forces would dramatically accelerate our ability to reshape the enterprise IT services landscape.

I would like to specifically thank Laine Campbell, Aaron Lee, and Vicki Vance as owners of for their courage, vision, and determination through these demanding weeks. On the Pythian side, I would like to especially thank  Andrew Waitman, without whom this deal would be impossible to contemplate, Alain Tardif and Siobhan Devlin, and the rest of the executive team at Pythian who’ve moved mountains on our behalf to make it real. I don’t want to forget to highlight as well the external support of Bob Ford at Kelly Santini and our financing partners.

We have months of hard work ahead of us integrating our businesses. It’s our goal and imperative to listen and learn from each other, and pick and choose the best parts of each respective business as we weave a coherent and integrated whole. This will be the first meaningful merger Pythian undertakes.

Together we are almost 350 strong and are home to the industry’s largest open-source database managed services capability. Together we will accelerate the adoption of Enterprise DevOps and help countless SaaS, retail, media, and online businesses leave their competitors in the dust. And that is a vision worth getting excited about.

Categories: DBA Blogs

Oracle Database: Query to List All Statistics Tables

Wed, 2014-06-25 08:00

If you were a big fan of manual database upgrade steps, perhaps you would have come across this step many times in your life while reading MOS notes, upgrade guides, etc.

Upgrade Statistics Tables Created by the DBMS_STATS Package
If you created statistics tables using the DBMS_STATS.CREATE_STAT_TABLE procedure, then upgrade these tables by executing the following procedure:

In my experience, I found the statistics tables can be created from Oracle rdbms version 8i. So this step became part of the database upgrade documents until now. I also noticed the structure of the statistics table was the same until 10gR2 version, but Oracle had modified the structure marginally on 11g and 12c versions.

I have been using this single query to list all statistics tables that exist on a database, which can be still used despite changes on the table structure.

SQL> select owner,table_name from dba_tab_columns where COLUMN_NAME=’STATID’ AND DATA_TYPE= ‘VARCHAR2′;

Though this is not a critical step, it is required as a part of the post upgrade. Here is the small action plan to run the required command to upgrade all statistics tables.

Connect as SYS database user and run these steps:
SQL> set pages 1000
SQL> set head off
SQL> set feedback off
SQL> spool /home/oracle/stattab_upg.sql
SQL> select ‘EXEC DBMS_STATS.UPGRADE_STAT_TABLE(”’||owner||”’,”’||table_name||”’);’ from dba_tab_columns where COLUMN_NAME=’STATID’ AND DATA_TYPE= ‘VARCHAR2′;
SQL> spool off
SQL> @/home/oracle/stattab_upg.sql
SQL> exit

Categories: DBA Blogs

Virtual CPUs with Amazon Web Services

Tue, 2014-06-24 15:41

Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?

I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).

Investigating for ourselves

I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:

$ ec2-describe-images ami-fb8e9292 -H
Type    ImageID Name    Owner   State   Accessibility   ProductCodes    Architecture    ImageType       KernelId        RamdiskId Platform        RootDeviceType  VirtualizationType      Hypervisor
IMAGE   ami-fb8e9292    amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon  available       public          x86_64  machine aki-919dcaf8                      ebs     paravirtual     xen
BLOCKDEVICEMAPPING      /dev/sda1               snap-b047276d   8

Launching the instance:

$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION     r-cde66bb3      462281317311    default
INSTANCE        i-b5f5a2e6      ami-fb8e9292                    pending marc-aws        0               m3.xlarge       2014-06-16T20:23:48+0000  us-east-1d      aki-919dcaf8                    monitoring-disabled                              ebs                                      paravirtual     xen             sg-5fc61437     default

The instance is up and running within a few minutes:

$ ec2-describe-instances i-b5f5a2e6 -H
Type    ReservationID   Owner   Groups  Platform
RESERVATION     r-cde66bb3      462281317311    default
INSTANCE        i-b5f5a2e6      ami-fb8e9292       ip-10-145-209-67.ec2.internal     running marc-aws        0               m3.xlarge       2014-06-16T20:23:48+0000        us-east-1d      aki-919dcaf8                      monitoring-disabled                   ebs                      paravirtual      xen             sg-5fc61437     default
BLOCKDEVICE     /dev/sda1       vol-1633ed53    2014-06-16T20:23:52.000Z        true

Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:

[ec2-user@ip-10-7-160-199 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1

Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.

Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.

A simple test

The simplest case: a single thread, on processor 0:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0 $$
pid 1531's current affinity list: 0-3
pid 1531's new affinity list: 0
[ec2-user@ip-10-7-160-199 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s

With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.

[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s
Sharing those cores

Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s

We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)

Trying the exact same test, but this time, non-adjacent processors 0 and 2:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s

All the way up to full-speed, showing dedicated cores.

What does this all mean? Let’s go back to the Amazon’s vCPU definition

Each vCPU is a hyperthreaded core

As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.

A side effect: inconsistent performance

There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:

[ec2-user@ip-10-7-160-199 ~]$ grep 'core id' /proc/cpuinfo
core id         : 0
core id         : 0
core id         : 0
core id         : 0

This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s

We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:

[ec2-user@ip-10-7-160-199 ~]$ cat > ~/.toprc <<-EOF
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=25913, sortindx=10, maxtasks=2
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3
[ec2-user@ip-10-7-160-199 ~]$ top -b -n10 -U ec2-user
top - 21:07:50 up 43 min,  2 users,  load average: 0.55, 0.45, 0.36
Tasks:  86 total,   4 running,  82 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.7%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  1.4%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.3%hi,  0.0%si,  0.3%st
Cpu2  : 96.0%us,  4.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.7%hi,  0.0%si,  0.3%st

 1766 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:06.08 gzip
 1768 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:06.08 gzip

Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:

top - 21:07:53 up 43 min,  2 users,  load average: 0.55, 0.45, 0.36
Tasks:  86 total,   4 running,  82 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.3%us,  3.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 96.0%us,  3.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.0%si,  0.3%st
Cpu3  :  0.3%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st

 1766 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:09.08 gzip
 1768 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:09.08 gzip

Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.

This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.

Here’s a chart summarizing the results:

aws-cpu Summing up

Over the course of the testing I’ve learned two things:

  • A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
  • The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.

Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.

Categories: DBA Blogs

Working in Pythian’s Advanced Technology Consulting Group

Mon, 2014-06-23 08:22

Before I joined Pythian, I had the fortune of having a lot of good jobs across various industries. My favorite jobs were the ones that were fast paced and required me to ramp up my skills on the fly while learning new technology. My least favorite jobs were the ones where my skills were not used and the challenges were few and far between. When I joined Pythian I hadn’t realized I found my first great job.

In April 2012, I joined Pythian’s Professional Consulting Group (PCG). The members of PCG represented some of the world’s leading data experts, but the name did not adequately represent the skills of the members. Members of PCG were experts in many complementary technologies and many, if not all, were quickly becoming experts in emerging technologies such as Big Data. Because of this, the Professional Consulting Group became the Advanced Technology Consulting Group (ATCG).

As a member of ATCG, my main responsibility is to deliver consulting services to our customers either on site or remotely. Examples of some of the work we might do include: troubleshooting performance problems, migrating databases into Exadata, setting up replication with Oracle GoldenGate, and data integration with numerous sources using Oracle Data Integrator. While all of the items I mentioned deal with Oracle technologies, ATCG also has members who specialize in Microsoft SQL Server and MySQL.

The services we provide to our customers do not stop at traditional database services, ATCG also delivers Big Data services using Hadoop. Examples of some of the Hadoop work I have been involved with include: installing and configuring Cloudera Hadoop, securing Hadoop with Kerberos, and troubleshooting performance. As you can see, ATCG has the opportunity to gain valuable experience across a broad range of technologies.

All of our projects begin with a call with the potential customer. Members of ATCG serve as a technical resource on the call. It is our responsibility to understand the customer’s issue and estimate the effort required to perform the work. Sometimes this can be challenging because the customer might not have a technical resource on their side who can articulately convey the issue. Even if there is a technical resource on the customer’s side, we have to be mindful to not alienate others on the call, so it is vitally important that we are able to convey our message in way everybody on the call can understand.

You might be thinking “I am not a salesperson!” and “I have never used some of these technologies.” You would not be alone. ATCG are not sales people, we simply assist Sales by providing our technical knowledge on a call. Imagine that you are speaking with your boss or customer about a problem or issue – It really is no different. Dealing with new technology is little different at Pythian from your current job; If you don’t understand something, you can talk to a few coworkers or research on the net at your current job. At Pythian we can reach out to 200+ coworkers and find quite a few who have experience with the technology in question. We can search our internal technical documents, which are quite vast as they detail all of the work we have done, and as a last resort we can search the net. At Pythian, you are never alone and you are never without resources.

There are times when we might not have a project to work on, a.k.a. downtime. During our downtime, we can build our knowledge of technologies that we have interest in or that we may need a refresher for. We can practice our new found knowledge assisting other teams. We can help build the Pythian knowledge base by posting blogs and contributing to our internal documentation.

The work in ATCG is very challenging and you are always learning something new, whether it is a new technology or a new way of thinking about a particular topic. Being bored or pigeonholed is not a problem in ATCG; we are involved in some of toughest problems and work with the latest technologies. And when we are not, we are in control of our workday so we can pursue interests in new and emerging database technologies.

Categories: DBA Blogs

Ambari Blueprints and One-Touch Hadoop Clusters

Fri, 2014-06-20 11:11

For those who aren’t familiar, Apache Ambari is the best open source solution for managing your Hadoop cluster: it’s capable of adding nodes, assigning roles, managing configuration and monitoring cluster health. Ambari is HortonWorks’ version of Cloudera Manager and MapR’s Warden, and it has been steadily improving with every release. As of version 1.5.1, Ambari added support for a declarative configuration (called a Blueprint) which makes it easy to automatically create clusters with many ecosystem components in the cloud. I’ll give an example of how to use Ambari Blueprints, and compare them with existing one-touch deployment methods for other distributions.

Why would I want that?

I’ve been working on improving the methodology used by the Berkeley Big Data Benchmark. Right now spinning up the clusters is a relatively manual process, where the user has to step through the web interfaces of Cloudera Manager and Ambari, copy-paste certificates and IPs, and assign roles to nodes. The benchmark runs on EC2 instances, so I’ve been focused on automatic ways to create clusters on Amazon:

  • Apache Whirr can create a Hadoop cluster (or a number of other Big Data technologies), including CDH5, MapR and HDP. Documentation is sparse, and there doesn’t appear to be support for installing ecosystem projects like Hive automatically.
  • Amazon EMR supports installing Hive and Impala natively, and other projects like Shark via bootstrap actions. These tend to be older versions which aren’t suitable for my purposes.
  • MapR’s distribution is also available on EMR, but I haven’t used that since the different filesystem (MapRFS vs. HDFS) would impact results.

Hive-on-Tez is only supported on HDP at the moment, so it’s crucial that I have a one-touch command to create both CDH5 clusters, but also HDP clusters. Ambari Blueprints provide a crucial piece of the solution.

The Blueprint

Blueprints themselves are just JSON documents you send to the Ambari REST API. Every Ambari Blueprint has two main parts: a list of “host groups”, and configuration.

Host Groups

Host groups are a set of machines with the same agents (“components” in Ambari terms) installed – a typical cluster might have host groups for the NameNode, SecondaryNameNode, ResourceManager, DataNodes and client nodes for submitting jobs. The small clusters I’m creating have a “master” node group with the NameNode, ResourceManager, and HiveServer components on a single server and then a collection of “slaves” running the NodeManager and DataNode components. Besides a list of software components to install, every host group has a cardinality. Right now this is a bit of a pain, since the cardinality is exact: your blueprint with 5 slave nodes must have 5 slaves Hopefully the developers will add an option for “many”, so we don’t have to generate a new blueprint for every different sized cluster.  Thanks to John from HortonWorks for a correction, cardinality is an optional hint which isn’t validated by Ambari. This wasn’t clear from the docs.

To provide a concrete example, the sample host groups I’m using look like this:

"host_groups" : [
 "name" : "master",
 "components" : [
 "name" : "NAMENODE"
 "name" : "HIVE_METASTORE"
 "name" : "HIVE_SERVER"
 "name" : "MYSQL_SERVER"
 "cardinality" : "1"
 "name" : "slaves",
 "components" : [
 "name" : "DATANODE"
 "name" : "HDFS_CLIENT"
 "name" : "NODEMANAGER"
 "name" : "YARN_CLIENT"
 "name" : "TEZ_CLIENT"
 "name" : "HIVE_CLIENT"
 "cardinality" : "5"

This host_groups describes a single node with all of the “master” components installed, and five slaves with just the DataNode, NodeManager and clients installed. Note that some components have depedencies: it’s possible to build an invalid blueprint which contains a HIVE_METASTORE but not a MYSQL_SERVER. The REST API provides appropriate error messages when such a blueprint is submitted.


Configuration allows you to override the defaults for any services you’re installing, and it comes in two varieties: global, and service-specific. Global parameters are required for different services: to my knowledge Nagios and Hive require global parameters to be specified – these parameters apply to multiple roles within the cluster, and the API will tell you if any are missing. Most cluster configuration (your typical core-site.xml, hive-site.xml, etc. parameters) can be overriden in the blueprint by specifying a configuration with the leading part of the file name, and then providing a map of the keys to overwrite. The configuration below provides a global variable that Hive requires, and it also overrides some of the default parameters in hive-site.xml. These changes will be propagated to the cluster as if you changed them in the Ambari UI.

"configurations": [
    "global": {
      "hive_metastore_user_passwd": "p"
    "hive-site": {
      "javax.jdo.option.ConnectionPassword": "p",
      "": "",
      "hive.execution.engine": "tez",
      "hive.exec.failure.hooks": "",
      "hive.exec.pre.hooks": "",
      "": ""

This config will override some parameters in hive-site.xml, as well as setting the metastore password to ‘p’. Note that you can specify more configuration files to override (core-site.xml, hdfs-site.xml, etc.) but each file must be it’s own object in the configurations array, similar to how global and hive-site are handled above.

Once you’ve specified the host groups and any configuration overrides, the Blueprint also needs a stack – the versions of software to install. Right now Ambari only supports HDP – see this table for the stack versions supported in each Ambari release. As a weird constraint, the blueprint name is inside the blueprint itself, along with the stack information. This name must be the same as the name you provide to the REST endpoint, for some reason. To upload a new blueprint to an Ambari server you can use:

$ curl -X POST -H 'X-Requested-By: Pythian' <ambari-host>/api/v1/blueprints/<blueprint name> -d @<blueprint file>

The X-Requested-By header is required, and as noted the blueprint name must match the file.

You can see the entire blueprint file from this example here, feel free to use it as a baseline for your cluster.

Creating the Cluster

Once you’ve written a blueprint with the services and configuration you want, you need to:

  • Create EC2 instances with the correct security groups
  • Install ambari-master on one, and ambari-agent on the others
  • Configure the agents to report to the master
  • Write a file mapping hosts to host groups
  • Push both files (the blueprint and the mapping) to the REST API

Fortunately, we have a Python script that can do that for you! This script will create a benchmarking cluster with a specific number of data nodes, an Ambari master and a separate Hadoop master. It can easily be modified to create multiple classes of machines, if you want to have more host groups than “master” and “slave”. The core of the script (the EC2 interaction and Ambari RPM installation) is stolen from based on work by Ahir Reddy from Databricks, with the Ambari Blueprints support added by yours truly.

If you’re curious about the host mapping file: it has the blueprint name, and an array of host names for every host_group. Corresponding to the example above, the cluster definition would be:

  "host_groups: [

You could replace “host-n” with the real domain names for your Amazon instances (use the internal ones!), and create a new cluster over those machines using:

$ curl -X POST -H 'X-Requested-By: Pythian' <ambari-host>/api/v1/clusters/<cluster name> -d @<mapping file>

Ambari Blueprints have some rough edges right now, but they provide a convenient way to deploy all of the services supported by the HDP stack. Watch this space for more posts about my effort to create a repeatable, one-touch, cross-distribution Hadoop SQL benchmark on EC2.

Categories: DBA Blogs

Log Buffer #376, A Carnival of the Vanities for DBAs

Fri, 2014-06-20 08:52

This Log Buffer Editions collects and present you various blog posts from Oracle, SQL Server and MySQL arena.


Oracle SOA Suite Achieves Significant Customer Adoption and Industry Recognition

The Perks of Integrated Business Planning

Why are we so excited about Oracle Database XE R2?

Skipping ACS ramp-up using a SQL Patch

Moving datafiles has always been a pain.

SQL Server:

Using DBCC DROPCLEANBUFFERS When Testing Performance

Data Mining Introduction Part 9: Microsoft Linear Regression

PowerShell One-Liners: Accessing, Handling and Writing Data

Stairway to SQL Server Security: Level 2, Authentication

Matthew Flatt was trying to measure the performance of a tool interacting with Microsoft Azure


Using UDFs for geo-distance search in MySQL

Amidst various blog postings on Docker, a security issue announced yesterday that detailed an exploit of Docker that makes it possible to do container breakout.

Discover the MySQL Central @ OpenWorld Content Catalog

Shinguz: Replication channel fail-over with Galera Cluster for MySQL

SQL queries preview before sent to server in MySQL for Excel

Categories: DBA Blogs

Service Reliability: How Systems Administration is Evolving – Part Two

Thu, 2014-06-19 08:06

In part one of this mini-series, I wrote about how the old ways of systems administration is broken and shared a story from my personal experiences. Today I’ll be talking about how it’s evolved with Site Reliability Engineering (also known as Service Reliability Engineering or SRE).

SRE is born

Interestingly, this is a concept I was pushing back around 2008-2010, but unfortunately my ideas fell on deaf ears.  How do Netflix, Facebook, Dropbox, and Google provide such reliable service? First I can tell you what they don’t do – they don’t throw more hardware or ram at the problem.

So how do you do it?

First everyone has to agree that service reliability is everyone’s problem.  It is the reason our jobs exist! Management must agree, because the way a company thinks and operates needs to change from the top down, and the focus needs to be on service reliability.

Operations and Development teams need to be aligned so that outages are everyone’s problem. Remember the story in my previous blog post? What do you think would have happened if I had forced the developers to reboot the servers every 12 hours, then every 24 hours? Would it have taken weeks or months to resolve the problem? Of course not.

You need your Ops team to know how to code, and understand development. I’m not saying they have to actually be on Dev team, but they need to understand coding, speak the language, and script their way out of issues. Most importantly, they must be on the hook if things go wrong.

At Google and Facebook, their developers are on call. If a release suddenly causes a significant amount of paging/failures, they don’t hire more Ops employees. The developers fill the gap until the issue is fixed, which clearly happens very quickly.

No one wants to be getting paged all night, especially developers. If they’re getting paged due to a bug in their software, you can bet the issue will be resolved in days, not weeks. Making developers responsible for the service reliability means they are required to think about failure in their software development – they’ll have to design in graceful failure modes, and expect their underlying infrastructure to be unreliable. If their little widget takes down the entire site, you can be sure they’re going to be up until it’s corrected.

The bottom line is that software and hardware come together to provide a service, and one isn’t more important than the other. Major wins in reliability can be realized if you align software design, automation, and hardware design.

At Pythian, we offer this as a managed service. Our team has the experience, background, and skills to help get you to this promise land of high reliable, scalable systems. We are building a team of Site Reliability Engineers to help companies benefit and transition to this new paradigm in systems administration, where service reliability is everyone’s problem.

Categories: DBA Blogs

Service Reliability: How Systems Administration is Evolving – Part One

Wed, 2014-06-18 07:52

The world of systems administration is changing, and it’s affecting everyone involved. Today’s blog post is the first of two in mini-series dedicated to Service Reliability: How Systems Administration is Evolving.

The days of formulas that tell us, You need “X” system admins, for every “Y” physical server, and every “Z” VMs are coming to a close. Even the world of IT management is changing.

Why? Because as scale increases, it’s simply impossible to continue at the pace we were at. Google saw this in the mid 2000’s and began the next evolution of systems administration. They recognized that there was no way they could scale up the way things had been managed for decades. In fact, Netflix came to a similar conclusion.

Interestingly, what Google did has nothing to do with technology, rather to do with the philosophy of systems administration. They started a new group which they originally called Production Engineering, and was later renamed Site Reliability Engineering, also known as Service Reliability Engineering or SRE.  At its core, SRE changes the fundamental thinking of IT management. It recognizes site reliability as everyone’s responsibility. Some might say that’s obvious, but in the past it wasn’t.

Worked fine in dev meme.jpg

The old way is broken

Most companies have two very separate and distinct groups. Operations and Development. Historically these two groups are highly siloed, and in some cases, do not get along very well. Why? It comes down to philosophy, really.

Operations folks are driven to ensure systems are up, secure, and reliable. Developers, on the other hand, are driven to create cool new features and applications. Here lies one of the biggest problems.

Years back I worked as an Operations Director, and had a counterpart on the development side who was the Software Engineering Director. We had just completed releasing a major update for one of our platforms, and very quickly we saw we had major issues. Our primary application servers (25+ physical boxes) were becoming unstable after about 12 hours of production load (I won’t go into why this happened, that’s a story for another day.) We quickly identified this, so the Ops team began rebooting these boxes in a rolling fashion. They were boxes that had some specialized hardware in them, and starting/stopping, then testing them took about 15-30 minutes each.  We had a team of about 5 people, which was not a 24/7 group. Clearly this caused significant pain for our Operations staff. We determined that part of the problem was a memory leak. Due to the nature of the release, rolling back simply was not an option.

The initial response I received was that we would just have to deal with it for now, as there were a few other pressing issues they wanted to resolve first. After many sleepless nights and lost weekends, we finally were able to get a update so the systems only needed to be rebooted daily, 7 days a week. It stayed this way for months.

Crazy, right?

But why? It was because the software team, and the management we both reported to, was far more interested in hitting deadlines for features, and new functionality – not how much sleep, or how many days off our Ops employees were getting. I was told on more than one occasion that high availability and recovery were Ops problems, not Development problems.

The core of this problem is simple. Development felt that service reliability was 100%  an Operations problem. Our new release takes 2x more ram? Add more ram to 100 servers! Our new application requires 20 new servers? Sure, with some work it could be cut down to 2-3, but just get the 20 servers. That’s easy!

Without naming names, has anyone else faced this issue? Comment below. Stay tuned for part two, where I’ll be discussing the birth of SRE, how it’s allowed systems administration to evolve, and how to achieve it.

Categories: DBA Blogs

Room for Improvement – Using DBMS_REDEFINITION with Spatial Data

Tue, 2014-06-17 07:53
Resizing a Column in a Table With a Spatial Column

Recently a client asked for help with using DBMS_REDEFINITION to modify a column in a table.

As the intent was to reduce the precision of a numeric column, simply modifying the column would not work, as there was already data in the column:

13:51:44 - jkstill@js01 SQL&gt; /
alter table t modify(id number(6,2))
ERROR at line 1:
ORA-01440: column to be modified must be empty to decrease precision or scale

The client had made a few attempts use DBMS_REDEFINITION, but ran into several issues with each attempt.

In this case the owner of the destination table was different than the owner of the source table.
As per My Oracle Support, this situation requires several permissions granted that may not normally be necessary for using DBMS_REDEFINITION.
However I have seen similar problems when the source and destination user are the same.

The version of database for the client is running on Solaris
These test are also performed on, but on Linux 5 rather than Solaris.

Most of the issues involve permissions that must be granted to the owner of the destination object.
This is true even if the job is being run as SYSDBA.
These errors were due to the table in question having a column of spatial data with a corresponding spatial index.

17:27:49 – sys@js01 SQL> desc spdata
Name Null? Type
———————————————– ——– ——————————–


The information in this article will be presented somewhat backwards to the way that is usually seen.
First I will show a working example of using DBMS_REDEFINITION to redefine the column.
Following that some of the problems will be highlighted, and then some technical references shown.

Doing so will make this article a little more user friendly I think.
If you are facing a similar issue then the parts you need the most are right up front.

A Working Example

Following are the steps to create the test data, setup DBMS_REDEFINITION and complete the process

Create the Test Data

The first step is to create a test user. The permissions granted to this user are quite important, as will be demonstrated later on.

-- redefdest_user.sql

create user redefdest identified by redefdest;

grant resource, connect, create session, unlimited tablespace to redefdest;

grant alter session to redefdest;

grant execute on dbms_redefinition to redefdest;

grant create any sequence  to redefdest;
grant create any table  to redefdest;
grant alter any table  to redefdest;
grant drop any table  to redefdest;
grant lock any table  to redefdest;
grant select any table to redefdest;
grant create any index to redefdest;
grant create any trigger  to redefdest;

Once this process is completed, remember to revoke any privileges that are not needed on a permanent basis.

Now let’s create a simple table with some spatial data:

-- tab_create.sql

create table redefdest.spdata
   id number(28,0) not null,
   radius number(16,8) not null,
   location mdsys.sdo_geometry not null

The next step is to setup the Geometry Metadata. This must be run as the owner of the SPDATA table.

-- insert_geo_data.sql

delete from user_sdo_geom_metadata where  table_name = 'SPDATA' and column_name = 'LOCATION';

insert into user_sdo_geom_metadata
(table_name, column_name, diminfo, srid)
   sdo_dim_array (
      sdo_dim_element ('Longitude', -180, 180, 0.005),
      sdo_dim_element ('Latitude', -90, 90, 0.005)
   8307 -- SRID - Spatial Reference IDentifier - see mdsys.cs_srs


select * from user_sdo_geom_metadata;

Now we can insert some test data into the table.

-- insert.sql

insert into spdata(id, radius, location)
   sdo_geometry(2001, 8307,
       sdo_point_type (floor(dbms_random.value(-180,180)),floor(dbms_random.value(-90,90)) , null),
       null, null
from (
select level id
from dual
connect by level <= 100
) data;


Now create indexes on the ID column and the spatial data column.

-- idx_create.sql

create index redefdest.spdata_id_idx on redefdest.spdata (id);

create index redefdest.spdata_location_idx on redefdest.spdata (location)
   indextype is mdsys.spatial_index  parameters ('SDO_DML_BATCH_SIZE=2000');

The goal of this excercise is to change the scale and precision of the RADIUS column.

That cannot be done directly via ALTER TABLE as any attempt to so will cause ORA-01440.

The first task is to create the interim table. This table will be created nearly identical to the source table. The RADIUS column in the interim table will have a smaller scale and precision than the source table.

The columns for the table SPDATA_INTERIM are all set as nullable.
The reason for the will be explained later on.

-- create_interim_table.sql

create table redefdest.spdata_interim
   id number(28,0),
   radius number(12,4),
   location mdsys.sdo_geometry

Now geo data for the interim table must be created. This is not the same script as seen previously. Though the script name is the same, it is stored in a different directory. The only difference from the previous script is the name of the table used – SPDATA_INTERIM rather than SPDATA.
As before, this SQL must be run as the owner of the table in question, SPDATA_INTERIM in this case.

-- insert_geo_data.sql

insert into user_sdo_geom_metadata
(table_name, column_name, diminfo, srid)
   sdo_dim_array (
      sdo_dim_element ('Longitude', -180, 180, 0.005),
      sdo_dim_element ('Latitude', -90, 90, 0.005)
   8307 -- SRID  - see mdsys.cs_srs


Now that the interim table has been created and the geo data inserted, the redefinition process can begin:

-- redefine.sql

  1  declare
  2  v_col_map varchar(2048) := 'ID ,RADIUS ,LOCATION';
  3  begin
  4     dbms_redefinition.start_redef_table (
  5              uname          => 'REDEFDEST'
  6             ,orig_table     => 'SPDATA'
  7             ,int_table      => 'SPDATA_INTERIM'
  8             ,col_mapping    => v_col_map
  9             ,options_flag   => dbms_redefinition.cons_use_rowid
 10             ,orderby_cols   => null
 11             ,part_name      => null
 12  );
 13* end;
17:34:51 - jkstill@js01 SQL> /

PL/SQL procedure successfully completed.

Once the refinition process has completed, the dependent objects can be created.

-- copy_dependent_objects.sql

Wrote file afiedt.buf

  1  declare
  2     v_number_of_errors number := 0;
  3  begin
  4     dbms_redefinition.copy_table_dependents(
  5             uname             => 'REDEFDEST'
  6             ,orig_table       => 'SPDATA'
  7             ,int_table        => 'SPDATA_INTERIM'
  8             ,copy_indexes     =>  dbms_redefinition.cons_orig_params
  9             ,copy_triggers    =>  true
 10             ,copy_constraints =>  true
 11             ,copy_privileges  =>  true
 12             ,ignore_errors    => false
 13             ,num_errors       => v_number_of_errors
 14             ,copy_statistics  => true
 15             ,copy_mvlog       => true
 16     );
 17     dbms_output.put_line('Number of Errors' || v_number_of_errors);
 18* END;
17:35:58 - jkstill@js01 SQL> /

PL/SQL procedure successfully completed.

And now finish the redefinition process, check the SPDATA.RADIUS column to ensure it is now NUBMER(12,4), and drop the interim table (now the original table)

-- finish_redef.sql

  1  begin
  2    dbms_redefinition.finish_redef_table (
  3      uname          => 'REDEFDEST'
  4     ,orig_table     => 'SPDATA'
  5     ,int_table      => 'SPDATA_INTERIM'
  6  );
  7* end;
17:35:59 - jkstill@js01 SQL> /

17:36:43 - jkstill@js01 SQL> alter table redefdest.spdata modify (id not null);

Table altered.

17:44:05 - jkstill@js01 SQL>  alter table redefdest.spdata modify (radius not null);
alter table redefdest.spdata modify (radius not null)
ERROR at line 1:
ORA-01442: column to be modified to NOT NULL is already NOT NULL

17:44:05 - jkstill@js01 SQL> alter table redefdest.spdata modify (location not null);

Table altered.

PL/SQL procedure successfully completed.

16:01:40 - redefdest@js02 SQL> desc spdata
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                        NOT NULL NUMBER(28)
 RADIUS                                             NUMBER(12,4)
 LOCATION                                  NOT NULL MDSYS.SDO_GEOMETRY

16:01:40 - redefdest@js02 SQL> select count(*) from spdata where radius is not null;


With all of the required permissions in place, everything works as expected.

Getting to that point however required reading a few Oracle Support notes and some experimentation.

There are several ways this process can fail if one of the requirements is not met.

Alter/Create/Drop Any Table

If the owner of the tables is missing any of ALTER/CREATE/DROP ANY TABLE, the process will fail when copying dependent objects.
This due to a bug in Oracle.

In this test the CREATE ANY TABLE privilege has been revoked, and the entire test re-run:

18:04:25 - sys@js01 SQL> revoke create any table from redefdest;

Revoke succeeded.

## Copy Dependents ##

ERROR at line 1:
ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
ORA-13249: internal error in Spatial index: [mdidxrbd]
ORA-13249: Error in Spatial index: index build failed
ORA-13249: Error in R-tree: [mdrcrtscrt]
ORA-13231: failed to create index table [MDRT_190DB$] during R-tree creation
ORA-29400: data cartridge error
ORA-01031: insufficient privileges
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1155
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1885
ORA-06512: at line 4

Likewise if CREATE ANY SEQUENCE is revoked, the process will fail.
(CREATE ANY TABLE had already been re-granted)

18:12:23 - sys@js01 SQL> revoke create any sequence from redefdest;

Revoke succeeded.

## Copy Dependents ##

ERROR at line 1:
ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
ORA-13249: internal error in Spatial index: [mdidxrbd]
ORA-13249: Error in Spatial index: index build failed
ORA-13249: Error in R-tree: [mdrcrtscrt]
ORA-13231: failed to create index table [MDRS_190F9$] during R-tree creation
ORA-29400: data cartridge error
ORA-01031: insufficient privileges
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1155
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1885
ORA-06512: at line 4

So each of these permission are required for this process to succeed, even when running as SYSDBA.

Permissions Required

The test objects created for this article are quite simple.
As such there are other potential errors that could occur due to the destination account not having all required permissions for a successful use of dbms_redefinition on a table with spatial data.
Before granting new privileges to an account, be sure to audit the current privileges.
That way you will know which privileges can be safely revoked when the table modifications are complete.

Constraint Issues

There are some bugs associated with DBMS_REDEFINITION.

The COPY_DEPENDENT_OBJECTS subprogram seemingly does not handle some constraints very well.

The original table SPDATA was created with all columns set to NOT NULL.

The interim table SPDATA_INTERIM was created with all columns set to NULL.

If the interim table is created with one or more columns as NOT NULL, the following error occurs:

## Copy Dependents ##
ERROR at line 1:
ORA-01442: column to be modified to NOT NULL is already NOT NULL
ORA-06512: at "SYS.DBMS_REDEFINITION", line 984
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1899
ORA-06512: at line 4

Part of the job of the procedure is to create check constraints as found on the original table.

That part of the process doesn’t seem to work quite correctly.

When the process has completed, the columns appear as NULLable to the sqlplus DESC command, but that is not the case.

18:21:15 - jkstill@js01 SQL> desc spdata
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 ID									    NUMBER(28)
 RADIUS 								    NUMBER(12,4)

18:21:17 - jkstill@js01 SQL> insert into spdata values(null,null,null);
insert into spdata values(null,null,null)
ERROR at line 1:
ORA-01400: cannot insert NULL into ("REDEFDEST"."SPDATA"."ID")

18:21:35 - jkstill@js01 SQL> @check_cons

------------------------------ ------------------------------ - ------------------------------ --------
SPDATA			       SYS_C0018231		      C "ID" IS NOT NULL	       ENABLED
SPDATA			       SYS_C0018232		      C "RADIUS" IS NOT NULL	       ENABLED
SPDATA			       SYS_C0018233		      F "LOCATION" IS NOT NULL	       ENABLED

3 rows selected.

The output of the DESC command and the contents of DBA_CONSTRAINTS do not agree.

While it would be interesting to dig into the lower level reasons for why this is happening, it for now just an inconvience while there are more pressing matters to work on.


Following are the Oracle Notes considered when troubleshooting this issue.

Registering constraints

The DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT procedure could be used to better deal with constraints, at least according to this Oracle Note:
I have not yet tried this.

In regard to desc table not showing constraints

This bug affects Oracle on Solaris, so it seems likely to be an issue for the client, and may well affect other platforms as well.

Direct privs granted to table owner

How to Re-Organize a Table Online (Doc ID 177407.1)
ORA-13233 and ORA-01031 When Creating a Spatial Index in Another Schema (Doc ID 251593.1)

Test Code:

Categories: DBA Blogs

Instant REST API For Any Database

Mon, 2014-06-16 07:48

Not so long ago, I was playing with ElasticSearch, which has the interesting characteristic of having a REST API as its primary interface. Sure, it’s a little more stilted and awkward than any native interface but, on the other hand, it’s a nice universal type of API. Any language that can make a http request can talk to it and, hey, bad comes to worse, even ‘curl’ will do. It would be kinda cool if other databases had such a web service.

And then I began to think…

Don’t we have DBIx::Class::Schema::Loader, which can connect to a database and auto-generate its DBIx::Class schema?

package MyDB;

use parent 'DBIx::Class::Schema::Loader'; 


# later on

my $schema = MyDB->connect( 'dbi:SQLite:foo.db' ); # boom, we have our schema

And once we have a DBIx::Class representation of a schema, can’t we introspect it and pretty much get everything there is to know about it?

use Data::Printer;

# get all the table names
my @tables = $schema->sources;

# and all the columns of all the tables
for my $table ( $schema->sources ) {
    say "Table $table";
    p $schema->source($table)->columns_info;

That is, that’s if we want to do it manually, considering that there’s already SQL::Translator that can do most of the job for us.

use SQL::Translator;

print SQL::Translator->new (
    parser      => 'SQL::Translator::Parser::DBIx::Class',
    parser_args => {
        dbic_schema => $schema,
    producer    => 'JSON',

Of course, since we are talking web service, we will want to pass everything back and forth using JSON, including database entries. Well, that’s hardly a problem if we use DBIx::Class::Helper::Row::ToJSON.

So it seems we have the database side covered. For the web framework? You’ll probably not be surprised to see me go with Dancer. Not only can we leverage the serializers and plugins like Dancer::Plugin::DBIC, but setting routes are ridiculously easy.

get '/_tables' => sub {
    return [ schema->sources ];

Even niftier: remember that Dancer routes are defined at runtime, so we can introspect that schema as much as we want and come up with any route we can dream of.

my @primary_key = schema->source($table)->primary_columns;
my $row_url = join '/', undef, $table, ( '*' ) x @primary_key;
 # GET ///
get $row_url => sub {
    my @ids = splat;
    return $schema->resultset($table)->find({
        zip @primary_key, @ids
 # GET /
get "/$table" => sub {
    my @things = $schema->resultset($table)->search({ params() })->all;
    return \@things;
 # create new entry
post "/$table" => sub {
    $schema->resultset($table)->create({ params() });

Added bonus: the way Dancer’s params() conglomerate parameters defined in the query string and in the serialized body of the request plays in our favor: simple queries can be passed directly via the url, and more complicated ones can be defined as JSON structures.

So, you put all of this together, and you obtain waack. All it needs is a dsn pointing to the right database (and credentials, if needed). To illustrate, let’s try with my Digikam SQLite database.

$ waack dbi:SQLite:digikam4.db
>> Dancer 1.3124 server 28914 listening on
>> Dancer::Plugin::DBIC (0.2100)
== Entering the development dance floor ...

And now, let’s fire up App::Presto as our REST client.

$ presto http://enkidu:3000

http://enkidu:3000> type application/json

First, we can retrieve all the table names.

http://enkidu:3000> GET /_tables

We can also get the whole schema.

http://enkidu:3000> GET /_schema
   "translator" : {
      "producer_args" : {},
      "show_warnings" : 0,
      "add_drop_table" : 0,
      "parser_args" : {
         "dbic_schema" : null
      "filename" : null,
      "no_comments" : 0,
      "version" : "0.11018",
      "parser_type" : "SQL::Translator::Parser::DBIx::Class",
      "trace" : 0,
      "producer_type" : "SQL::Translator::Producer::JSON"
   "schema" : {
      "tables" : {
         "ImageRelations" : {
            "options" : [],
            "indices" : [],
            "order" : "12",
            "name" : "ImageRelations",
            "constraints" : [
                  "type" : "UNIQUE",
                  "deferrable" : 1,
                  "name" : "subject_object_type_unique",
                  "on_delete" : "",
                  "reference_fields" : [],
                  "fields" : [
                  "match_type" : "",
                  "reference_table" : "",
                  "options" : [],
                  "expression" : "",
                  "on_update" : ""

Too much? We can get the columns of a single table.

http://enkidu:3000> GET /Tag/_schema
   "iconkde" : {
      "is_nullable" : 1,
      "data_type" : "text",
      "is_serializable" : 1
   "name" : {
      "is_serializable" : 1,
      "data_type" : "text",
      "is_nullable" : 0
   "id" : {
      "is_nullable" : 0,
      "data_type" : "integer",
      "is_auto_increment" : 1,
      "is_serializable" : 1
   "icon" : {
      "is_nullable" : 1,
      "data_type" : "integer",
      "is_serializable" : 1
   "pid" : {
      "is_serializable" : 1,
      "is_nullable" : 1,
      "data_type" : "integer"

Query that table, with a simple condition…

http://enkidu:3000> GET /Tag id=1
      "name" : "orchid",
      "icon" : null,
      "id" : 1,
      "pid" : 0,
      "iconkde" : null

… or with something a little more oomphie.

$ curl -XGET -H Content-Type:application/json --data '{"name":{"LIKE":"%bulbo%"}}' http://enkidu:3000/Tag
      "pid" : 1,
      "name" : "Bulbophyllum 'Melting Point'",
      "icon" : null,
      "id" : 32,
      "iconkde" : "/home/yanick/Pictures/My Plants/IMG_0461.JPG"
      "id" : 56,
      "iconkde" : "tag",
      "icon" : null,
      "pid" : 39,
      "name" : "Bulbophyllum ebergardetii"
      "name" : "bulbophyllum",
      "pid" : 564,
      "iconkde" : null,
      "id" : 565,
      "icon" : 0

Btw: I cheated for that last one. Presto doesn’t send body with GET requests. And Dancer doesn’t deserialize GET bodies either. Patches will be written tonight.

Anyway, back with the show. We can also select specific rows by primary keys.

http://enkidu:3000> GET /Tag/1
   "id" : 1,
   "iconkde" : null,
   "pid" : 0,
   "icon" : null,
   "name" : "orchid"

Create new rows.

http://enkidu:3000> POST /Tag '{"name":"nepenthes","pid":0}'
   "pid" : 0,
   "name" : "nepenthes",
   "iconkde" : null,
   "icon" : null,
   "id" : 569

And do updates.

http://enkidu:3000> PUT /Tag/569 '{"icon":"img.png"}'
   "icon" : "img.png",
   "iconkde" : null,
   "pid" : 0,
   "name" : "nepenthes",
   "id" : 569

Not too shabby, isn’t? Mostly considering that, if you look at the source of waack, you’ll see that it barely clock over 100 lines of code. Take a minute and let this sink in.

One hundred lines of code. For a universal database REST web service.

If that’s not standing on the shoulders of giants, then I don’t know what is.

Categories: DBA Blogs

Internet Scale Design: Part Two

Fri, 2014-06-13 08:01

In my previous blog post, I emphasized that internet scale design can be implemented for any type of company. Whether it’s a small, bootstrapped startup or a rapidly growing, well-funded tier 2. But if it’s suitable for that many companies, why isn’t everyone moving into the cloud? In my opinion, there are two reasons.

First, the model of utility computing doesn’t work for all business models. It is most effective in models where demand changes, where there are peaks and valleys for larger scale systems. It also works well as a way to get your startup or project off the ground with little-to-no capital investment. In the story I began in my previous blog post, the seasonality of their environment made them a perfect candidate.

The second is more of a people problem. In many companies,  IT leadership, SysAdmins, Developers, DBAs, and everyone else involved in service management, have been working with whatever technology stack that company has been using for years. It’s important to remember that most SysAdmins see their primary job as keeping things up and running, so we typically prefer working with things we know vs things we don’t.

If a C-level executive or VP returns from a conference about cloud, and issues a mandate that they need to “move everything to the cloud!” to remain “competitive” the SysAdmins will likely fail. Why? Not because they’re not smart enough, but because they simply don’t know enough about it.

While it would be ideal for the COO to say, “I want us to look into moving our platform into AWS, so I’m going to send you to get Amazon certified,” it rarely happens. Usually it sounds more like, “You’re smart, you’ll figure it out. Oh sure, you already have a full-time job keeping the lights on, but just squeeze it in when you can. We’ll need to see a POC by the end of the quarter.”

I don’t need to tell you how this ends ? it’s obvious. It will fail almost every time.

One of the amazing benefits to the Pythian model is that our teams are exposed to a wide variety of systems. We have built auto-scaling systems in AWS, OpenStack systems, VMWare systems, as well as legacy physical systems we support. Our teams are not relegated to whichever tech stack they happened to be stuck with for the last five years.

The bottom line here is that it doesn’t matter what kind of company you’re at – Whether it’s a small retailer, midsize tier 3, or larger tier 2, if you’re willing to sign on for the concept of site reliability engineering and commit to it, together we can accomplish some amazing things, all for a price you can afford.

Categories: DBA Blogs

Log Buffer #375, A Carnival of the Vanities for DBAs

Fri, 2014-06-13 07:57

What’s better; Watching Football World Cup or Reading Log Buffer? Yes, right Log Buffer, but please also take out sometime to watch the matches, as this tournament comes only once in 4 years. No? Ok, as you say. So read along then.


Alan Hargreaves reasons as Why you should Patch NTP.

This post examines the results of a recent database protection survey conducted by Database Trends and Applications (DBTA) Magazine.

Keep jobs active with screen command.

ORE Getting Connected: ore.connect and other commands.

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.4 : Transforming Data using Python & Hive Streaming

SQL Server:

Enabling and Configuring Reporting and Logging for Maintenance Plans in SQL Server 2012

The articles collected here will help you understand the theories and methodologies behind every stage of the database delivery pipeline, starting when database changes are checked in, and ending when they’re deployed to production.

Stairway to Database Source Control Level 2: Getting a Database into Source Control .

What does it take to become a database administrator, or what kinds of traits should I be looking for when I am hiring a DBA. Those traits can be summarized it two categories: Technical and Personal.

Display all errors thrown within a catch block by a restore statement in a stored procedure using extended events.


RHEL7 & the transition from MySQL to MariaDB: A first look.

FairWarning Privacy Monitoring Solutions Rely on MySQL to Secure Patient Data

MariaDB 5.5.38 Overview and Highlights

Recruiters Looking for MySQL DBAs and MySQL Developers

MariaDB Galera Cluster 10.0.11 now available

On-disk/block-level encryption for MariaDB

Categories: DBA Blogs

Internet Scale Design: Part One

Thu, 2014-06-12 07:52

Whether your company is a small, bootstrapped startup or a rapidly growing, well-funded tier 2 or 3 business, everyone dreams of having ‘Internet scale’ systems which are highly automated, anti-fragile, self-healing, and inexpensive to maintain.

The problem, however, is that those types of complex systems are only within the reach of well-funded companies who can afford to hire elite technical resources and expensive, leading edge technology systems, right?


Recently, I was working with one of my smaller retail clients. Their business is highly seasonal, doing the vast majority of their business during two separate two-month periods each year. During these periods, the demand on their system is extremely high, and downtime would be catastrophic because of the amount of their annual revenue generated during those times.

They run on a LAMP stack (Linux Apache MySQL PHP) and they’re hosted on a mix of shared VMs, and colocation equipment.  Their hosting costs are fixed year round, even though most of the year they don’t need two thirds of it.

They were beginning their annual budget review when we brought up what the next steps were for them. About a year ago, I began introducing the idea of Amazon Web Services (AWS) cloud, taking more of a site reliability engineering (SRE) approach to their business, so we scheduled a conference call with their Chief Financial Officer, a few VPs, and one of my SREs.

“We would like to move you 100% into the Amazon Cloud,” I said. “In doing so, we will automate all of your systems using image and configuration management, and set up detailed monitoring, graphics, and trending systems. We will work with your developers to redesign your applications to take advantage of the platform and its flexibility. The end result will be a system that is considerably more anti-fragile, runs in multiple regions, offers significantly faster recovery in the event of major failures, and can scale down to very little and up to larger scale in a moment’s notice.” We assured them that the systems could be self-healing and would require very little management.

The Sr. VP of Product Management’s response surprised me.

“Bill, this sounds awesome. But we’re a pretty small company, and we couldn’t possibly afford that sort of system.” I laughed a bit and responded, “Actually, not only can you afford it, but it will save you tens of thousands of dollars.”

I explained the benefits of auto-scaling and the cost savings it would bring to their business model. We discussed how those cost savings would fund the expenses of having Pythian build and manage the new service.

His response? “This is amazing – why isn’t everyone doing this?”

The answer is twofold. I’ll be sharing my insights in part two, so stay tuned…

Categories: DBA Blogs

Availability Group – Delay with Listener Connectivity After Failover

Wed, 2014-06-11 07:45

The beauty of working for multiple clients from different industries, is that you get exposed to a myriad of environment setup and configuration. Every company has its own standards for Network and Server configuration, as well different hardware vendors.  This introduces their own kinks and excitement to your everyday work – half of which you’ll likely not encounter if you are working in-house and using the same hardware.

The past week we encountered a rare and interesting issue with High Availability Group. The issue was two-fold, first it was initially not failing over automatically one one node though that was not as exciting as the second part – when it was able to failover correctly, our client was experiencing delays with the availability of the Listener Name outside its own sub-net  after failover, automatic or otherwise.  It is reachable within its own subnet but takes more than thirty minutes to be reachable outside of it even though the Failover happened smoothly and without error.

The first part was fairly straightforward. Checking on the cluster logs and event logs the automatic failover was throwing the error below when trying to failover on one of the nodes.

Cluster network name resource 'Listener_DNS_NAME' failed registration of one or more associated DNS name(s) for the following reason:
DNS operation refused.

Ensure that the network adapters associated with dependent IP address resources are configured with at least one accessible DNS server.

The error is as it says, the Computer object does not have the appropriate permissions on the Domain to register the DNS Name Resource for the Listener.  For the cluster  to perform this operation smoothly “Authenticated Users” should have read/write all permissions on the Computer Object for the cluster, its nodes and the Listener DNS Name. To do this, Log in to the Active Directory Server

  1. Open Active Directory Users and Computers.
  2. On the View menu, select Advanced Features.
  3. Right-click the object  and then click Properties.
  4. On the Security tab, click Advanced to view all of the permission entries that exist for the object.
  5. Verify that the Authenticated Users is in the list and has the permission to Read and Write All. Add the required permissions then Save the changes.

Now after doing that and testing the fail over, it is now encountering a different error, Kerberos-related one showed below.

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server ComputerName$. The target name used was HTTP/ This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (Domain.COM) is different from the client domain (Doamin.COM), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Ah, the often over-looked SPN. This should be part of your installation process – setting the SPN. To keep the story short and so we can get to the exciting part, you can refer here for the detailed instructions on how to configure the SPN for SQL Server.  Aside from registering the SPN for each of the Nodes as specified on the MSDN Link, You’ll also need to register the SPN for the Listener, as always 1433 is the port being used by your SQL Server:

setspn -A MSSQLSvc/ DOMAIN/SQLServiceAccount

This will enable Kerberos for the client connection to the Availability Group Listener and address the errors we received above. After configuring the SPN for the servers Automatic Fail over is now running smoothly, or so we thought.

The client came back to us that it was taking some time for the application to connect to the Listener Name. Checking on the cluster logs and SQL Server’s end, everything appears to be in order. No errors being thrown and Server is accessible. Now it get’s interesting. Ping test within the Database sub net is successful but ping test outside of it was timing out. It was able to connect though after a substantial amount of time.  After a few more test and checking the behavior is the same on both nodes. It takes more than thirty minutes for the Name to be reachable outside of the database sub net.  After involving the Network Admin we found out that  a MAC Address conflict is happening.  That’s our “Aha!” moment.  Windows 2003 servers and later issues a Gratuitous ARP (GARP) requests during failover. There are some switches/devices that does not forward Gratuitous ARP by default.   This causes the devices on the other end of the switch to not have the correct MAC address associated to the Name.  This causes the failure. It often corrects itself when the router detects the failures and do a broadcast and gets the correct value. That’s why it becomes accessible after some time. This KB details the issue. To address this, changes must be done on the configuration of the switches, you’ll need to check with your hard ware vendor for this.

Now, that could be the end of our problems, but after enabling the switch to forward GARP, we found out that the server itself is not sending a GARP request.  This is a server configuration issue and requires some Registry changes. Open the Registry for the server and locate the key below:


From there check if there is a key for ArpRetryCount, if there is make sure that the Value is not set to 0. The value could be between 0-3. After changing this and restarting the servers Everything works perfectly.

Last two issues are a bit rare, and something I wouldn’t have experienced if the client wasn’t using that particular hard ware and that particular standard configuration.

Categories: DBA Blogs

2014 Hadoop Summit Summary

Mon, 2014-06-09 07:41


Last week I was at the 2014 Hadoop Summit in San Jose, trying to keep abreast of the ever-changing Apache landscape: what projects are up-and-coming, what projects are ready for production, and most importantly which projects can solve problems for our clients. It was also a great chance to hear about real, production deployments – both in the halls and in some of the presentations. And Doug Cutting and Arun Murthy had an excellent panel about the future of the platform, both from a business and technology perspective.


Hadoop Security was an incredibly popular topic this year, reflecting the fact that Hadoop deployments are growing up and fulfilling their stated purpose: to consolidate organizations’ data and make it visible to everyone. “Visible to everyone” is a noble goal, but in reality PII and other sensitive information needs to be guarded, and access needs to be audited and limited. Apache Knox makes it possible to audit all user interactions with the cluster, tying user access to an existing identity management system. Cloudera Sentry provides fine-grained user permissions for Hive, HBase and Search, similar to existing RDBMSes. During the conference Cloudera also announced their acquisition of Gazzang, who make  a platform for key management and data encryption at rest in Hadoop (similar to Intel’s Project Rhino).

Booz Allen Hamilton also put on an excellent presentation about a real client system storing sensitive data on EMR using ephemeral storage – I strongly recommend looking at this as an example of what’s possible now, and also how difficult it is to implement cell or row-level security policies in Hadoop.


YARN is the new “data operating system” responsible for all computations running on your clutster. It handles container placement and resource allocation to allow multiple frameworks like MapReduce, Tez and Spark to co0exist on the same nodes without competing for resources. Applications can also be written to run directly on YARN, opening up the Hadoop cluster to support more general purpose tasks (Yahoo is apparently encoding video on YARN with low latency, although details were scarce), and making it easier for developers to provide distributed, fault-tolerant applications. Early adopters have been using YARN in production for a while, but now every major vendor is including it in their distribution, and features like the High-Availability ResourceManager (or “HARMful YARN”) are available.

Many talks from different sources (Twitter, Yahoo, HortonWorks) focused on different aspects of YARN: new features, production deployment hints, and the general architecture.


I thought with Spark becoming so popular and widely supported – in every major distribution – Spark Streaming would supplant Storm as the leading complex event processing engine. Visiting Hadoop Summit, however, it seems like Storm has plenty of momentum. It’s been ported to YARN to work seamlessly within your cluster, and multiple presentations demonstrated real-world systems running on Storm right now, as well as integrations with other technologies like R and Pig. Spark overall had nearly as many presentations, but these were more technical and theoretical: it might be another year before we see many presentations about Spark and Spark Streaming applications being deployed at scale.


Apache Falcon had two talks this summit, and it’s been incubating since last year. It caught my attention as an open-source project which is aiming to supplant existing proprietary tools. Falcon allows you to declaratively define ETL flows in terms of sources, sinks and transformations, and schedule them on a regular basis. Flows are monitored and idempotent, and late data can be handled according to user-defined rules. Right now the emphasis is on power: an XML config coordinates Hive, Pig, Oozie and distcp, but more user-friendly features like libraries of transformations and a web UI for visualizing flows will bring Falcon closer to the feature set of commerical ETL tools.

SQL on Hadoop

This space has settled down a lot since last year, when Stinger and Impala seemed to invade every track and time slot. Yahoo still put on a number of Hive-on-Tez architecture and performance reviews, and less established projects like Apache Tajo (incubating), BlinkDB, Actian’s Vortex and Facebook’s Presto made appearances. Even though performance has been increasing year over year, SQL-on-Hadoop engines are still wildly variable in their features and performance, and there aren’t any clear winners right now – new entrants still have a chance to make their mark. If you’re curious about choosing a SQL-on-Hadoop engine, check out my presentation this year surveying the landscape.

More to Watch

There were so many great presentations, it was hard to choose for every time slot. Once the videos are released I also recommend watching:

  • Ted Dunning’s stunningly simple anomaly detection
  • Jagane Sundar (of WanDisco) explaining Paxos
  • Koji Noguchi (of Yahoo) with tips about stabilising your Hadoop clusters

Were you at Hadoop Summit? What were your favourite presentations and what trends did you notice?

Categories: DBA Blogs

Log Buffer #374, A Carnival of the Vanities for DBAs

Fri, 2014-06-06 07:51

This high-spirited Log Buffer Edition covers some of the best blog posts of the previous week from Oracle, SQL Server and MySQL bloggers from across the planet.


GoldenGate 12c Trail Encryption and Credentials with Oracle Wallet.

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution.

New Rapid Install StartCD for EBS 12.2 Now Available.

Oracle XE 11.2 for Windows 64 bit is here.

Oracle Accelerate is Oracle’s approach for providing simple to deploy, packaged, enterprise-class software solutions to growing midsize organizations through its network of expert partners.

SQL Server:

The ubiquity of databases and the potentially valuable information stored in them makes them attractive targets for people who want to steal data or harm its owner by tampering with it.

Who Truncated That Table & What Do I Do Now?

SQL Server Hardware Configuration Best Practices.

The DBA Team tackles data corruption.

T-SQL User-Defined Functions: Ten Questions You Were Too Shy To Ask.


Percona Replication Manager – Renaming Cluster Hostnames.

Performance impact of MySQL query cache on modern hardware.

How To Install Nginx With PHP5 (And PHP-FPM) And MySQL Support On CentOS 6.5.

Build a MySQL Fabric Farm in one step using AWS CloudFormation.

MariaDB 10 is a Sandbox killjoy?

Categories: DBA Blogs

A Look Back on SREcon14

Thu, 2014-06-05 07:57

The landscape of the Systems Administration industry is shifting. As the trend continues towards Infrastructure as a Service (IaaS), Software as a Service (SaaS) and other *aaS offerings, the traditional role of systems administrations is being challenged. While this traditional model will remain (well beyond its expiration date) in large enterprises who have invested heavily in their hardware, processes, and infrastructure more and more companies will be looking to maximize their investment and get the most business value out of their resources.

The industry is being called on with increasing frequency to reinvent systems administration to meet the needs and the demands of a modern business and technology stack.

The traditional way that we have been doing systems administration is on the decline as the desire to break down the old silos between operations and development to maximize business value and inter-group communication and collaboration force both sides to evolve new skills, and at the core adopt new philosophies.

One such philosophy is Site Reliability Engineering, or SRE for short.

Generally accepted to have started at Google, the SRE movement has now spread well beyond to other companies such as Dropbox, Netflix, Twitter, Facebook, LinkedIn, and others.

As my colleague Bill Lincoln will explain in an upcoming post, although this has started with internet scale organizations, SRE is a philosophy put into action that companies of all sizes can benefit from.

At its core, the prime directive of SRE is reliability of a service as a whole and this subtle, yet important paradigm shift is what is driving change within the Systems Administration and Software Development industries towards a place where both groups have a unified objective of reliability and the differences between SRE and SWE become subtle and fluid.

I have been a strong advocate for the SRE philosophy as a major emerging trend in the Systems Administration space with the Pythian Leadership and was thrilled to be able to attend the USENIX Site Reliability Engineering Conference (SRECon14) which was held on Friday, May 30, 2014 in Santa Clara California USA along with two of my colleagues from the Pythian Enterprise Infrastructure Services Group.

It was a single day, but from the first keynote delivered by Ben Treynor, Vice President, Engineering and Google Site Reliability Tsar, to the final Talk by Michael “Mikey” Dickerson on how Silicon Valley’s SREs saved, the information delivered was packed full of value, and a good amount of inspiration.

With a prime directive of “reliability” the talks delivered ran the entire lifecycle of an IT Service from Designing for Reliability, Deploying at Scale, Metrics and Monitoring for Reliability, Cascading Failure of a Service and Disaster Preparedness.

The call to action was also clear; You are absolutely within your rights to not like that it is happening, but there is no denying that change is coming. We (SysAdmins and Software Engineers) can choose to evolve, we can choose to challenge ourselves and “up our game” or we can run the very real risk of being left behind.

SRECon14 was a great success and I look forward to attending the event again. I would enthusiastically recommend it to my friends and colleagues who are in systems administration / software engineering roles.

Finally I end with a Quote Ben Treynor “Let the Packets flow, and the pager remain silent”

Categories: DBA Blogs

Expanding the Couchbase Collector for Diamond

Wed, 2014-06-04 08:24
The code

For the impatient ones, the couchbase collector can be found in github:

Couchbase Collector

Follow the instructions in the README file to get it installed under your diamond!


If you have been involved with metric collections at any point you may have heard of BrightCove’s Diamond.

Diamond is literally a little piece of diamond regarding metrics collection. With its modular architecture it is possible to collect metrics from a large collection of operating system and software components.

In addition to that, it is able to ship those metrics to a diverse range of trending software including Graphite, RRD or anything that supports StatsD.

While recently working with Couchbase, I needed to collect and ship metrics using Diamond; a github project was brought to my attention doing exactly that.

Unfortunately the author zooldk, has only one entry in the commit history listed as “Initial skeleton of collector” and the only statistic currently collected is itemCount from basicStats.

Luckily the python code is quite simple and straightforward, so I went ahead and extended it.

First let’s have an overview of the metrics you can expect to see in Graphite after installing the collector.

What did we achieve?

The target is to graph, as many of the useful Couchbase metrics as possible.

After installing the extended Couchbase Collector this is what we can expect to see in Graphite:


Here is a plot of memory used by Couchbase on my (memory starved) vm:

A bit of theory: (Data) Buckets and Couchbase cluster metrics

Couchbase is a NoSQL database using JSON for Documents. It is highly scalable and very easy to create a cluster.

For the sake of working on extending the above mentioned metrics collector, I installed the Couchbase server, community edition on two VMs.

My VMs have IP addresses: and

I mostly used the default parameters in the setup and installed both demo databases “beersample” and “gamesim-sample”.

My Couchbase user interface now looks like:

couchbase_basic_installation Metrics in Couchbase

Collecting metrics from Couchbase Buckets is as simple as executing a GET request:

For example:

$ curl -s

Now this is not very readable so let’s reformat it using Python’s JSON library. I am only pasting the output that is useful for metric collection.

$ curl -s | python -mjson.tool
    "basicStats": {
        "dataUsed": 49454080,
        "diskFetches": 0.0,
        "diskUsed": 50731634,
        "itemCount": 7303,
        "memUsed": 35406928,
        "opsPerSec": 0.0,
        "quotaPercentUsed": 33.76667785644531
    "name": "beer-sample",
    "nodes": [
            "clusterCompatibility": 131072,
            "clusterMembership": "active",
            "couchApiBase": "",
            "hostname": "",
            "interestingStats": {
                "cmd_get": 0.0,
                "couch_docs_actual_disk_size": 138325417,
                "couch_docs_data_size": 137479323,
                "couch_views_actual_disk_size": 637700,
                "couch_views_data_size": 616830,
                "curr_items": 7888,
                "curr_items_tot": 7889,
                "ep_bg_fetched": 0.0,
                "get_hits": 0.0,
                "mem_used": 99496472,
                "ops": 0.0,
                "vb_replica_curr_items": 1
            "mcdMemoryAllocated": 1501,
            "mcdMemoryReserved": 1501,
            "memoryFree": 932651008,
            "memoryTotal": 1968685056,
            "os": "x86_64-unknown-linux-gnu",
            "otpNode": "ns_1@",
            "ports": {
                "direct": 11210,
                "proxy": 11211
            "replication": 0.0,
            "status": "healthy",
            "systemStats": {
                "cpu_utilization_rate": 18.0,
                "mem_free": 932651008,
                "mem_total": 1968685056,
                "swap_total": 855629824,
                "swap_used": 112218112
            "thisNode": true,
            "uptime": "353144",
            "version": "2.2.0-837-rel-community"
    "quota": {
        "ram": 104857600,
        "rawRAM": 104857600

So what are interesting statistics to collect? The array basicStats sounds like a good candidate as it contains keys like:

'diskUsed', 'memUsed', 'diskFetches', 'quotaPercentUsed', 'opsPerSec', 'dataUsed', 'itemCount'

All of those sound great values to graph, so we will keep/collect them.

Then there is the quota object, showing ram which is useful to graph as well, so we keep this too.

Finally there is nodes which is an array. This object is an array because it includes statistics for each node forming the cluster. If the bucket does not occupy more than one nodes, there will be a single entry in this array.

In my setup, the gamesim-sample Bucket spans across two virtual machines, hence ‘nodes’ contains two items in its array corresponding to each vm.

Following I am showing side-by-side the keys used for each of nodes array members (note that this is for the gamesim-sample bucket):

nodes[0]                nodes[1]
====================    ====================
clusterCompatibility    clusterCompatibility 
clusterMembership       clusterMembership    
couchApiBase            couchApiBase         
hostname                hostname             
interestingStats        interestingStats     
mcdMemoryAllocated      mcdMemoryAllocated   
mcdMemoryReserved       mcdMemoryReserved    
memoryFree              memoryFree           
memoryTotal             memoryTotal          
os                      os                   
otpNode                 otpNode              
ports                   ports                
replication             replication          
status                  status               
systemStats             systemStats          
uptime                  uptime               
version                 version

thisNode is a boolean that helps us understand which array member corresponds to the machine we are querying.

In this case I got those stats from:


To determine exactly which stats refer to which node, the couchApiBase key can be used for more detail:



This further confirms that nodes[0] refers to my second vm ( and nodes[1] to the first vm.

Installing/Configuring the Couchbase collector on Diamond

Get the Couchbase Collector and copy it under:


Edit the python file and enter your IP/port/name of databag/username/password; mine looks like so:

class CouchBaseCollector(diamond.collector.Collector):

def get_default_config(self):
config = super(CouchBaseCollector, self).get_default_config()
‘host’: ‘localhost’,
‘port’: 8091,
‘path’: ‘beer-sample’,
‘username’: ‘Administrator’,
‘password’: ‘obfuscated’
return config

You will also need to create a config file under:


With the contents:

$ cat CouchBaseCollector.conf
enabled = True

Cluster Metrics

The collector has the intelligence to present only the nodes statistics that are applicable for the node it polls.
For clustered couchbase environments, every node will be running a diamond collector of it’s own.

This is how Graphite presents the two nodes of the cluster, corresponding to my two vm’s:

Categories: DBA Blogs