Skip navigation.

Pythian Group

Syndicate content
Official Pythian Blog - Love Your Data
Updated: 1 hour 20 min ago

Availability Group – Delay with Listener Connectivity After Failover

Wed, 2014-06-11 07:45

The beauty of working for multiple clients from different industries, is that you get exposed to a myriad of environment setup and configuration. Every company has its own standards for Network and Server configuration, as well different hardware vendors.  This introduces their own kinks and excitement to your everyday work – half of which you’ll likely not encounter if you are working in-house and using the same hardware.

The past week we encountered a rare and interesting issue with High Availability Group. The issue was two-fold, first it was initially not failing over automatically one one node though that was not as exciting as the second part – when it was able to failover correctly, our client was experiencing delays with the availability of the Listener Name outside its own sub-net  after failover, automatic or otherwise.  It is reachable within its own subnet but takes more than thirty minutes to be reachable outside of it even though the Failover happened smoothly and without error.

The first part was fairly straightforward. Checking on the cluster logs and event logs the automatic failover was throwing the error below when trying to failover on one of the nodes.

Cluster network name resource 'Listener_DNS_NAME' failed registration of one or more associated DNS name(s) for the following reason:
DNS operation refused.

Ensure that the network adapters associated with dependent IP address resources are configured with at least one accessible DNS server.

The error is as it says, the Computer object does not have the appropriate permissions on the Domain to register the DNS Name Resource for the Listener.  For the cluster  to perform this operation smoothly “Authenticated Users” should have read/write all permissions on the Computer Object for the cluster, its nodes and the Listener DNS Name. To do this, Log in to the Active Directory Server

  1. Open Active Directory Users and Computers.
  2. On the View menu, select Advanced Features.
  3. Right-click the object  and then click Properties.
  4. On the Security tab, click Advanced to view all of the permission entries that exist for the object.
  5. Verify that the Authenticated Users is in the list and has the permission to Read and Write All. Add the required permissions then Save the changes.

Now after doing that and testing the fail over, it is now encountering a different error, Kerberos-related one showed below.

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server ComputerName$. The target name used was HTTP/ This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (Domain.COM) is different from the client domain (Doamin.COM), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Ah, the often over-looked SPN. This should be part of your installation process – setting the SPN. To keep the story short and so we can get to the exciting part, you can refer here for the detailed instructions on how to configure the SPN for SQL Server.  Aside from registering the SPN for each of the Nodes as specified on the MSDN Link, You’ll also need to register the SPN for the Listener, as always 1433 is the port being used by your SQL Server:

setspn -A MSSQLSvc/ DOMAIN/SQLServiceAccount

This will enable Kerberos for the client connection to the Availability Group Listener and address the errors we received above. After configuring the SPN for the servers Automatic Fail over is now running smoothly, or so we thought.

The client came back to us that it was taking some time for the application to connect to the Listener Name. Checking on the cluster logs and SQL Server’s end, everything appears to be in order. No errors being thrown and Server is accessible. Now it get’s interesting. Ping test within the Database sub net is successful but ping test outside of it was timing out. It was able to connect though after a substantial amount of time.  After a few more test and checking the behavior is the same on both nodes. It takes more than thirty minutes for the Name to be reachable outside of the database sub net.  After involving the Network Admin we found out that  a MAC Address conflict is happening.  That’s our “Aha!” moment.  Windows 2003 servers and later issues a Gratuitous ARP (GARP) requests during failover. There are some switches/devices that does not forward Gratuitous ARP by default.   This causes the devices on the other end of the switch to not have the correct MAC address associated to the Name.  This causes the failure. It often corrects itself when the router detects the failures and do a broadcast and gets the correct value. That’s why it becomes accessible after some time. This KB details the issue. To address this, changes must be done on the configuration of the switches, you’ll need to check with your hard ware vendor for this.

Now, that could be the end of our problems, but after enabling the switch to forward GARP, we found out that the server itself is not sending a GARP request.  This is a server configuration issue and requires some Registry changes. Open the Registry for the server and locate the key below:


From there check if there is a key for ArpRetryCount, if there is make sure that the Value is not set to 0. The value could be between 0-3. After changing this and restarting the servers Everything works perfectly.

Last two issues are a bit rare, and something I wouldn’t have experienced if the client wasn’t using that particular hard ware and that particular standard configuration.

Categories: DBA Blogs

2014 Hadoop Summit Summary

Mon, 2014-06-09 07:41


Last week I was at the 2014 Hadoop Summit in San Jose, trying to keep abreast of the ever-changing Apache landscape: what projects are up-and-coming, what projects are ready for production, and most importantly which projects can solve problems for our clients. It was also a great chance to hear about real, production deployments – both in the halls and in some of the presentations. And Doug Cutting and Arun Murthy had an excellent panel about the future of the platform, both from a business and technology perspective.


Hadoop Security was an incredibly popular topic this year, reflecting the fact that Hadoop deployments are growing up and fulfilling their stated purpose: to consolidate organizations’ data and make it visible to everyone. “Visible to everyone” is a noble goal, but in reality PII and other sensitive information needs to be guarded, and access needs to be audited and limited. Apache Knox makes it possible to audit all user interactions with the cluster, tying user access to an existing identity management system. Cloudera Sentry provides fine-grained user permissions for Hive, HBase and Search, similar to existing RDBMSes. During the conference Cloudera also announced their acquisition of Gazzang, who make  a platform for key management and data encryption at rest in Hadoop (similar to Intel’s Project Rhino).

Booz Allen Hamilton also put on an excellent presentation about a real client system storing sensitive data on EMR using ephemeral storage – I strongly recommend looking at this as an example of what’s possible now, and also how difficult it is to implement cell or row-level security policies in Hadoop.


YARN is the new “data operating system” responsible for all computations running on your clutster. It handles container placement and resource allocation to allow multiple frameworks like MapReduce, Tez and Spark to co0exist on the same nodes without competing for resources. Applications can also be written to run directly on YARN, opening up the Hadoop cluster to support more general purpose tasks (Yahoo is apparently encoding video on YARN with low latency, although details were scarce), and making it easier for developers to provide distributed, fault-tolerant applications. Early adopters have been using YARN in production for a while, but now every major vendor is including it in their distribution, and features like the High-Availability ResourceManager (or “HARMful YARN”) are available.

Many talks from different sources (Twitter, Yahoo, HortonWorks) focused on different aspects of YARN: new features, production deployment hints, and the general architecture.


I thought with Spark becoming so popular and widely supported – in every major distribution – Spark Streaming would supplant Storm as the leading complex event processing engine. Visiting Hadoop Summit, however, it seems like Storm has plenty of momentum. It’s been ported to YARN to work seamlessly within your cluster, and multiple presentations demonstrated real-world systems running on Storm right now, as well as integrations with other technologies like R and Pig. Spark overall had nearly as many presentations, but these were more technical and theoretical: it might be another year before we see many presentations about Spark and Spark Streaming applications being deployed at scale.


Apache Falcon had two talks this summit, and it’s been incubating since last year. It caught my attention as an open-source project which is aiming to supplant existing proprietary tools. Falcon allows you to declaratively define ETL flows in terms of sources, sinks and transformations, and schedule them on a regular basis. Flows are monitored and idempotent, and late data can be handled according to user-defined rules. Right now the emphasis is on power: an XML config coordinates Hive, Pig, Oozie and distcp, but more user-friendly features like libraries of transformations and a web UI for visualizing flows will bring Falcon closer to the feature set of commerical ETL tools.

SQL on Hadoop

This space has settled down a lot since last year, when Stinger and Impala seemed to invade every track and time slot. Yahoo still put on a number of Hive-on-Tez architecture and performance reviews, and less established projects like Apache Tajo (incubating), BlinkDB, Actian’s Vortex and Facebook’s Presto made appearances. Even though performance has been increasing year over year, SQL-on-Hadoop engines are still wildly variable in their features and performance, and there aren’t any clear winners right now – new entrants still have a chance to make their mark. If you’re curious about choosing a SQL-on-Hadoop engine, check out my presentation this year surveying the landscape.

More to Watch

There were so many great presentations, it was hard to choose for every time slot. Once the videos are released I also recommend watching:

  • Ted Dunning’s stunningly simple anomaly detection
  • Jagane Sundar (of WanDisco) explaining Paxos
  • Koji Noguchi (of Yahoo) with tips about stabilising your Hadoop clusters

Were you at Hadoop Summit? What were your favourite presentations and what trends did you notice?

Categories: DBA Blogs

Log Buffer #374, A Carnival of the Vanities for DBAs

Fri, 2014-06-06 07:51

This high-spirited Log Buffer Edition covers some of the best blog posts of the previous week from Oracle, SQL Server and MySQL bloggers from across the planet.


GoldenGate 12c Trail Encryption and Credentials with Oracle Wallet.

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution.

New Rapid Install StartCD for EBS 12.2 Now Available.

Oracle XE 11.2 for Windows 64 bit is here.

Oracle Accelerate is Oracle’s approach for providing simple to deploy, packaged, enterprise-class software solutions to growing midsize organizations through its network of expert partners.

SQL Server:

The ubiquity of databases and the potentially valuable information stored in them makes them attractive targets for people who want to steal data or harm its owner by tampering with it.

Who Truncated That Table & What Do I Do Now?

SQL Server Hardware Configuration Best Practices.

The DBA Team tackles data corruption.

T-SQL User-Defined Functions: Ten Questions You Were Too Shy To Ask.


Percona Replication Manager – Renaming Cluster Hostnames.

Performance impact of MySQL query cache on modern hardware.

How To Install Nginx With PHP5 (And PHP-FPM) And MySQL Support On CentOS 6.5.

Build a MySQL Fabric Farm in one step using AWS CloudFormation.

MariaDB 10 is a Sandbox killjoy?

Categories: DBA Blogs

A Look Back on SREcon14

Thu, 2014-06-05 07:57

The landscape of the Systems Administration industry is shifting. As the trend continues towards Infrastructure as a Service (IaaS), Software as a Service (SaaS) and other *aaS offerings, the traditional role of systems administrations is being challenged. While this traditional model will remain (well beyond its expiration date) in large enterprises who have invested heavily in their hardware, processes, and infrastructure more and more companies will be looking to maximize their investment and get the most business value out of their resources.

The industry is being called on with increasing frequency to reinvent systems administration to meet the needs and the demands of a modern business and technology stack.

The traditional way that we have been doing systems administration is on the decline as the desire to break down the old silos between operations and development to maximize business value and inter-group communication and collaboration force both sides to evolve new skills, and at the core adopt new philosophies.

One such philosophy is Site Reliability Engineering, or SRE for short.

Generally accepted to have started at Google, the SRE movement has now spread well beyond to other companies such as Dropbox, Netflix, Twitter, Facebook, LinkedIn, and others.

As my colleague Bill Lincoln will explain in an upcoming post, although this has started with internet scale organizations, SRE is a philosophy put into action that companies of all sizes can benefit from.

At its core, the prime directive of SRE is reliability of a service as a whole and this subtle, yet important paradigm shift is what is driving change within the Systems Administration and Software Development industries towards a place where both groups have a unified objective of reliability and the differences between SRE and SWE become subtle and fluid.

I have been a strong advocate for the SRE philosophy as a major emerging trend in the Systems Administration space with the Pythian Leadership and was thrilled to be able to attend the USENIX Site Reliability Engineering Conference (SRECon14) which was held on Friday, May 30, 2014 in Santa Clara California USA along with two of my colleagues from the Pythian Enterprise Infrastructure Services Group.

It was a single day, but from the first keynote delivered by Ben Treynor, Vice President, Engineering and Google Site Reliability Tsar, to the final Talk by Michael “Mikey” Dickerson on how Silicon Valley’s SREs saved, the information delivered was packed full of value, and a good amount of inspiration.

With a prime directive of “reliability” the talks delivered ran the entire lifecycle of an IT Service from Designing for Reliability, Deploying at Scale, Metrics and Monitoring for Reliability, Cascading Failure of a Service and Disaster Preparedness.

The call to action was also clear; You are absolutely within your rights to not like that it is happening, but there is no denying that change is coming. We (SysAdmins and Software Engineers) can choose to evolve, we can choose to challenge ourselves and “up our game” or we can run the very real risk of being left behind.

SRECon14 was a great success and I look forward to attending the event again. I would enthusiastically recommend it to my friends and colleagues who are in systems administration / software engineering roles.

Finally I end with a Quote Ben Treynor “Let the Packets flow, and the pager remain silent”

Categories: DBA Blogs

Expanding the Couchbase Collector for Diamond

Wed, 2014-06-04 08:24
The code

For the impatient ones, the couchbase collector can be found in github:

Couchbase Collector

Follow the instructions in the README file to get it installed under your diamond!


If you have been involved with metric collections at any point you may have heard of BrightCove’s Diamond.

Diamond is literally a little piece of diamond regarding metrics collection. With its modular architecture it is possible to collect metrics from a large collection of operating system and software components.

In addition to that, it is able to ship those metrics to a diverse range of trending software including Graphite, RRD or anything that supports StatsD.

While recently working with Couchbase, I needed to collect and ship metrics using Diamond; a github project was brought to my attention doing exactly that.

Unfortunately the author zooldk, has only one entry in the commit history listed as “Initial skeleton of collector” and the only statistic currently collected is itemCount from basicStats.

Luckily the python code is quite simple and straightforward, so I went ahead and extended it.

First let’s have an overview of the metrics you can expect to see in Graphite after installing the collector.

What did we achieve?

The target is to graph, as many of the useful Couchbase metrics as possible.

After installing the extended Couchbase Collector this is what we can expect to see in Graphite:


Here is a plot of memory used by Couchbase on my (memory starved) vm:

A bit of theory: (Data) Buckets and Couchbase cluster metrics

Couchbase is a NoSQL database using JSON for Documents. It is highly scalable and very easy to create a cluster.

For the sake of working on extending the above mentioned metrics collector, I installed the Couchbase server, community edition on two VMs.

My VMs have IP addresses: and

I mostly used the default parameters in the setup and installed both demo databases “beersample” and “gamesim-sample”.

My Couchbase user interface now looks like:

couchbase_basic_installation Metrics in Couchbase

Collecting metrics from Couchbase Buckets is as simple as executing a GET request:

For example:

$ curl -s

Now this is not very readable so let’s reformat it using Python’s JSON library. I am only pasting the output that is useful for metric collection.

$ curl -s | python -mjson.tool
    "basicStats": {
        "dataUsed": 49454080,
        "diskFetches": 0.0,
        "diskUsed": 50731634,
        "itemCount": 7303,
        "memUsed": 35406928,
        "opsPerSec": 0.0,
        "quotaPercentUsed": 33.76667785644531
    "name": "beer-sample",
    "nodes": [
            "clusterCompatibility": 131072,
            "clusterMembership": "active",
            "couchApiBase": "",
            "hostname": "",
            "interestingStats": {
                "cmd_get": 0.0,
                "couch_docs_actual_disk_size": 138325417,
                "couch_docs_data_size": 137479323,
                "couch_views_actual_disk_size": 637700,
                "couch_views_data_size": 616830,
                "curr_items": 7888,
                "curr_items_tot": 7889,
                "ep_bg_fetched": 0.0,
                "get_hits": 0.0,
                "mem_used": 99496472,
                "ops": 0.0,
                "vb_replica_curr_items": 1
            "mcdMemoryAllocated": 1501,
            "mcdMemoryReserved": 1501,
            "memoryFree": 932651008,
            "memoryTotal": 1968685056,
            "os": "x86_64-unknown-linux-gnu",
            "otpNode": "ns_1@",
            "ports": {
                "direct": 11210,
                "proxy": 11211
            "replication": 0.0,
            "status": "healthy",
            "systemStats": {
                "cpu_utilization_rate": 18.0,
                "mem_free": 932651008,
                "mem_total": 1968685056,
                "swap_total": 855629824,
                "swap_used": 112218112
            "thisNode": true,
            "uptime": "353144",
            "version": "2.2.0-837-rel-community"
    "quota": {
        "ram": 104857600,
        "rawRAM": 104857600

So what are interesting statistics to collect? The array basicStats sounds like a good candidate as it contains keys like:

'diskUsed', 'memUsed', 'diskFetches', 'quotaPercentUsed', 'opsPerSec', 'dataUsed', 'itemCount'

All of those sound great values to graph, so we will keep/collect them.

Then there is the quota object, showing ram which is useful to graph as well, so we keep this too.

Finally there is nodes which is an array. This object is an array because it includes statistics for each node forming the cluster. If the bucket does not occupy more than one nodes, there will be a single entry in this array.

In my setup, the gamesim-sample Bucket spans across two virtual machines, hence ‘nodes’ contains two items in its array corresponding to each vm.

Following I am showing side-by-side the keys used for each of nodes array members (note that this is for the gamesim-sample bucket):

nodes[0]                nodes[1]
====================    ====================
clusterCompatibility    clusterCompatibility 
clusterMembership       clusterMembership    
couchApiBase            couchApiBase         
hostname                hostname             
interestingStats        interestingStats     
mcdMemoryAllocated      mcdMemoryAllocated   
mcdMemoryReserved       mcdMemoryReserved    
memoryFree              memoryFree           
memoryTotal             memoryTotal          
os                      os                   
otpNode                 otpNode              
ports                   ports                
replication             replication          
status                  status               
systemStats             systemStats          
uptime                  uptime               
version                 version

thisNode is a boolean that helps us understand which array member corresponds to the machine we are querying.

In this case I got those stats from:


To determine exactly which stats refer to which node, the couchApiBase key can be used for more detail:



This further confirms that nodes[0] refers to my second vm ( and nodes[1] to the first vm.

Installing/Configuring the Couchbase collector on Diamond

Get the Couchbase Collector and copy it under:


Edit the python file and enter your IP/port/name of databag/username/password; mine looks like so:

class CouchBaseCollector(diamond.collector.Collector):

def get_default_config(self):
config = super(CouchBaseCollector, self).get_default_config()
‘host’: ‘localhost’,
‘port’: 8091,
‘path’: ‘beer-sample’,
‘username’: ‘Administrator’,
‘password’: ‘obfuscated’
return config

You will also need to create a config file under:


With the contents:

$ cat CouchBaseCollector.conf
enabled = True

Cluster Metrics

The collector has the intelligence to present only the nodes statistics that are applicable for the node it polls.
For clustered couchbase environments, every node will be running a diamond collector of it’s own.

This is how Graphite presents the two nodes of the cluster, corresponding to my two vm’s:

Categories: DBA Blogs

Pythian Named 2014 Global Outsourcing 100 Rising Star

Wed, 2014-06-04 07:50

Pythian has once again been named a 2014 Global Outsourcing 100® Rising Star by the International Association of Outsourcing Professionals® (IAOP®). The list is an annual ranking of the world’s best outsourcing service providers and highlights on-the-rise companies in the outsourcing space.

“As companies continue to accumulate more data than ever before, IT teams struggle to manage all of that information in-house,” said Rob Hamel, Vice President of Service Delivery at Pythian. “More businesses are rethinking their internal data infrastructure management processes and rightsourcing to find flexible, cost-effective solutions to optimize performance, reduce IT burdens, and improve productivity. Pythian brings together top-tier talent, proven processes, and a customer-centric business model to help our clients preserve the wellness of their valuable data infrastructures.”

See the full press release here.

Categories: DBA Blogs

Log Buffer #373, A Carnival of the Vanities for DBAs

Fri, 2014-05-30 07:22

There are many ways to express the knowledge learned but nothing is better than a comprehensive and concise blog post. This Log Buffer Edition gathers such blog posts across Oracle, SQL Server and MySQL.


It’s All About CPU But There Is NO CPU Bottleneck! What?

Understanding the Value of SOA

Java Magazine: Developer Tools and More

The Evolution of Oracle Forms Survey: Results are In!

Australian Oracle job market status update is out.

FAST Refresh of Materialized View Returns ORA-12004

SQL Server:

Learn how relationships work in Powerpivot workbooks and how you can build them using the DAX language.

Branching and Merging: Ten Pretty-Good Practices

Survey: Which new database platforms are you adopting?

Stairway to SQLCLR Level 3: Security (General and SAFE Assemblies)

Free eBook: SQL Server Execution Plans, Second Edition

Transparent Data Encryption (TDE) in SQL Server


MariaDB 10.0.11 Overview and Highlights.

Webinar-on-demand: Set up & operate real-time data loading into Hadoop

Using MySQL Sandbox to setup a MySQL Fabric cluster in development

How MySQL ‘queries’ and ‘questions’ are measured

From zero to full visibility of MySQL in 3 minutes with Percona Cloud Tools

There is one new feature in MySQL 5.6 that didn’t get the attention it deserved

Categories: DBA Blogs