Skip navigation.

Pythian Group

Syndicate content
Official Pythian Blog - Love Your Data
Updated: 7 hours 12 min ago

Offline Visualization of Azkaban Workflows

Mon, 2014-08-11 07:51

As mentioned in my past adventures, I’m often working with the workflow management tool ominously called Azkaban. Its foreboding name is not really deserved; it’s relatively straightforward to use, and offers a fairly decent workflow visualization. For that last part, though, there is a catch: to be able to visualize the workflow, you have to (quite obviously) upload the project bundle to the server. Mind you, it’s not that much of a pain, and could easily managed by, say, a Gulp-fueled watch job. But still, it would be nice to tighten the feedback loop there, and be able to look at the graphs without having to go through the server at all.

Happily enough, all the information we need is available in the Azkaban job files themselves, and in a format that isn’t too hard to deal with. Typically, a job file will be called ‘foo.job’ and look like

type=command
command=echo "some command goes here"
dependencies=bar,baz

So what we need to do to figure out a whole workflow is to begin at its final job, and recursively walk down all its dependencies.

use 5.12.0;

use Path::Tiny;

sub create_workflow {
  my $job = path(shift);
  my $azkaban_dir = $job->parent;

  my %dependencies;

  my @files = ($job);

  while( my $file = shift @files ) {
    my $job = $file->basename =~ s/\.job//r;

    next if $dependencies{$job}; # already processed

    my @deps = map  { split /\s*,\s*/ }
               grep { s/^dependencies=\s*// }
                    $file->lines( { chomp => 1 } );

    $dependencies{$job} = \@deps;

    push @files, map { $azkaban_dir->child( $_.'.job' ) } @deps;
  }

  return %dependencies;
}

Once we have that dependency graph, it’s just a question of drawing the little boxes and the little lines. Which, funnily enough, is a much harder job one would expect. And better left off to the pros. In this case, I decided to go with Graph::Easy, which output text and svg.

use Graph::Easy;

my $graph = Graph::Easy->new;

while( my( $job, $deps ) = each %dependencies ) {
    $graph->add_edge( $_ => $job ) for @$deps;
}

print $graph->as_ascii;

And there we go. We put those two parts together in a small script, and we have a handy cli workflow visualizer.

$ azkaban_flow.pl target/azkaban/foo.job

  +------------------------+
  |                        v
+------+     +-----+     +-----+     +-----+
| zero | --> | baz | --> | bar | --> | foo |
+------+     +-----+     +-----+     +-----+
               |                       ^
               +-----------------------+

Or, for the SVG-inclined,

$ azkaban_flow.pl -f=svg target/azkaban/foo.job

which gives us

Screen Shot 2014-08-10 at 3.09.42 PM
Categories: DBA Blogs

12c: Fun with WITH!

Fri, 2014-08-08 11:30

Last night I couldn’t sleep and what else you’re going to do? I was thinking about Oracle stuff.

In Oracle version 12, Oracle has enhanced the WITH clause – traditionally used for sub-query factoring – to allow the declaration of functions and procedures. This can be (ab)used to create a very interesting scenario, that is not very common in Oracle: Reading data within the same SELECT statement, but from two different points in time. And the points in time are in the future, and not in the past.

Let’s say I want to take a snapshot of the current SCN, and then another one 5 or 10 seconds after that. Traditionally we’d have to store that somewhere. What if I could take two snapshots – at different SCNs – using a single SELECT statement ? Without creating any objects ?

col value for a50
set lines 200 pages 99

with  
procedure t (secs in number, scn out varchar2)
  is
    pragma autonomous_transaction;
  begin
    dbms_lock.sleep(secs);
    select 'at ' || to_char(sysdate,'HH24:MI:SS') || ' SCN: ' 
                 || dbms_flashback.get_system_change_number 
      into scn 
      from dual;
  end;
function wait_for_it (secs in number) 
 return varchar2 is
    l_ret varchar2(32767);
  begin
    t(secs, l_ret);
    return l_ret;
  end;
select 1 as time, 'at ' || to_char(sysdate,'HH24:MI:SS') || ' SCN: ' 
                || dbms_flashback.get_system_change_number as value 
  from dual
union all
select 5, wait_for_it(5) from dual
union all
select 10, wait_for_it(5) from dual
/

And the result is:

      TIME VALUE
---------- --------------------------------------------------
         1 at 09:55:49 SCN: 3366336
         5 at 09:55:54 SCN: 3366338
        10 at 09:55:59 SCN: 3366339

 


We can clearly see there, that the SCN is different, and the time shown matches the intervals we’ve chosen, 5 seconds apart. I think there could be some very interesting uses for this. What ideas can you folks come up with ?

Categories: DBA Blogs

Log Buffer #383, A Carnival of the Vanities for DBAs

Fri, 2014-08-08 07:34

This Log Buffer Edition picks few of the informative blog posts from Oracle, SQL Server, and MySQL fields of database.


Oracle:

g1gc logs – Ergonomics -how to print and how to understand

In Solaris 11.2, svcs gained a new option, “-L”.  The -L option allows a user to easily look at the most recent log events for a service.

ADF Thematic Map component from DVT library was updated in ADF 12c with marker zoom option and area layer styling

When cloning pluggable databases Oracle gives you also SNAPSHOT COPY clause to utilize storage system snapshot capabilities to save on storage space.

It is normal for bloggers including myself to post about the great things they have done.

SQL Server:

In six years Microsoft has come from almost zero corporate knowledge about how cloud computing works to it being an integral part of their strategy.

A brief overview of Columnstore index and its usage with an example.

The Road To Hell – new article from the DBA Team

Encryption brings data into a state which cannot be interpreted by anyone who does not have access to the decryption key, password, or certificates.

How to test what a SQL Server application would do in the past or in the future with date and time differences.

MySQL:

MySQL for Visual Studio 1.2.3 GA has been released

An approach to MySQL dynamic cross-reference query.

The MySQL replication and load balancing plugin for PHP, PECL/mysqlnd_ms, aims to make using a cluster of MySQL servers instead of a single server as transparent as possible.

Picking the Right Clustering for MySQL: Cloud-only Services or Flexible Tungsten Clusters? New webinar-on-demand.

Collation options for new MySQL schemas and tables created in MySQL for Excel

Categories: DBA Blogs

Alter Session Kill on Steroids

Wed, 2014-08-06 10:27

Perhaps you have encountered something like this: A session that is consuming too many resources needs to be killed. You locate the session and use ALTER SYSTEM KILL SESSION ‘SID,SERIAL#’ to kill the session. As you continue to monitor the database you find that the status of the session in v$session is ‘KILLED’, but the session does not go away. You also notice that the SERIAL# is continually changing.

Now you find there is no OS process associated with the session, but the session continues as PMON is unable to finish cleanup for the session. Usually when this happens, the session will be holding a lock. When that happens, the only method to release the lock is to bounce the database. There are some bugs that may be responsible for this problem, such as this one described by Oracle Support:

Pmon Spins While Cleaning Dead Process (Doc ID 1130713.1)

This particular bug affects Oracle 10.2.0.1 – 11.1.0.7. I have personally seen this same behavior happen on many versions of the database from 7.0 on. To avoid these hanging sessions many DBA’s have adopted the habit of first killing the OS process with an OS utility, and if the session is still visible in v$session, issue the ALTER SYSTEM KILL command.

The OS command used on linux/unix is usually ‘kill -9′. On windows it is OraKill. This method usually avoids the problems encountered when killing a session that is holding a lock and processing DML.

I don’t know just what circumstances trigger this behavior, as I have never been able to reproduce it at will. When it does happen though, it is more than annoying as the only way to clear locks held by the recalcitrant session is to bounce the database.

Quite some time ago (at least as far back as Oracle 8i) Oracle introduced the new IMMEDIATE keyword to use with ALTER SYSTEM KILL SESSION. Using this keyword removes the need to use an OS command to kill a session – Oracle will do it for you! To test this I am using Oracle 10.2.0.4 on Oracle Linux 5.5. I have previously run these same tests in 11.2.0.3 with the same results. Had I access to an 8i or 9i database I would have run the tests there. To start with let’s see what happens when a session is killed without the immediate keyword.

Login to the session to be killed:

$ sqlplus scott/tiger@10gr2

Login as SYSDBA from another terminal and check for scott’s session:

SQL> l
  1  select
  2     s.username,
  3     s.sid,
  4     s.serial#,
  5     p.spid spid
  6  from v$session s, v$process p
  7  where s.username = 'SCOTT'
  8*    and p.addr = s.paddr
SQL> /

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 133         35 22870

1 row selected.

All that has happened at this point is that Oracle has made an internal call that has disconnected Scott’s session. (tracing that operation is a different topic.) The process on the server has not been terminated. This can be seen by the following experiment:

Logon again as Scott.

In a SYSDBA session check for Scott’s:

 SQL> @scott

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 146         81 23678

Now check for the shadow process associated with scott’s session on the server:


[root@ora10gR2 tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Kill the session and check the status:

SQL> alter system kill session '146,81';

SQL> l
  1  select
  2     s.username,
  3     s.sid,
  4     s.serial#,
  5     p.spid spid
  6  from v$session s, v$process p
  7  where s.username = 'SCOTT'
  8*    and p.addr = s.paddr
SQL>/

no rows selected

Check again on the server for the process:

[root@ora10gR2 tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Interesting, isn’t it? We know the process is still alive on the server, but the session information is no longer associated with the process. This happens because Oracle has disconnected the session, which allows the process to continue until the sqlplus session is terminated. The session information is still available in v$session, but is no longer associated with a server process:

select
  2     s.username,
  3     s.status,
  4     s.sid,
  5     s.serial#
  6  from v$session s
  7* where s.username = 'SCOTT'
SQL>/

USERNAME                       STATUS          SID    SERIAL#
------------------------------ -------- ---------- ----------
SCOTT                          KILLED          146         81

1 row selected.

 1* select pid,spid from v$process where pid = 146
SQL>/

no rows selected

When exiting the Scott session, I can see that the session was killed:

SQL> exit
ERROR:
ORA-00028: your session has been killed

Let’s perform the experiment again, but this time use the IMMEDIATE keyword.

Logon as scott:

> sqlplus scott/tiger@10gr2

SQL*Plus: Release 11.2.0.3.0 Production on Tue Aug 5 17:18:53 2014

Logon as SYSDBA and check for the scott session;

SQL> @scott

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 146         83 23939

1 row selected.

Before killing scott’s session:

  • get my OS PID
  • enable 10046 trace

The OS PID will be used for strace on the SYSDBA session shadow process on the server.
The 10046 trace is so we can see what is happening in the strace output.

SQL> l
1 select
2 s.username,
3 s.sid,
4 s.serial#,
5 p.spid spid
6 from v$session s, v$process p
7 where s.username is not null
8 and p.addr = s.paddr
9 and userenv('SESSIONID') = s.audsid
10* order by username, sid
SQL>/

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SYS                                   145         65 23947

1 row selected.

SQL> alter session set events '10046 trace name context forever, level 12';

Session altered.

Now ssh to the db server , check for Scott session shadow process and start strace:

[root@ora10gR2 tmp]# strace -o 23947.strace -p 23947
^Z
[1]+ Stopped strace -o 23947.strace -p 23947
[root@ora10gR2 tmp]# bg
[1]+ strace -o 23947.strace -p 23947 &

[root@ora10gR2 tmp]# ps -p 23939
PID TTY TIME CMD
23939 ? 00:00:00 oracle

Now kill Scott’s session and exit the SYSDBA session:

SQL> alter system kill session '146,83' immediate;

System altered.

The strace command will now have exited on the server.

First check again for Scott’s session:

[root@ora10gR2 tmp]# ps -p 23939
PID TTY TIME CMD
[root@ora10gR2 tmp]#

So the Scott shadow process has terminated.

As the 10046 trace was enabled, the output to the oracle trace file will appear in the strace file, which allows searching for ‘alter system kill’ in the strace file.

From the strace file:

write(5, "alter system kill session '146,8"..., 44) = 44

Now searching for the PID of scott’s session 23939:

read(10, "23939 (oracle) S 1 23939 23939 0"..., 999) = 228
close(10) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) S 1 23939 23939 0"..., 999) = 228
close(10) = 0
kill(23939, SIGKILL) = 0
kill(23939, SIGCONT) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) Z 1 23939 23939 0"..., 999) = 178
close(10) = 0

From the previous text I can see that Oracle opened the status file for PID 23939.
Why it did so twice I am not sure.

What happens after that is the interesting part.

kill(23939, SIGKILL) = 0

That line means that the SIGKILL signal was successfully sent to Scott’s shadow process.

What does that mean? Run kill -l to get a list of signals:

kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

Notice that signal 9 (kill -9) is SIGKILL.

So when killing a session with ALTER SYSTEM KILL SESSION ‘PID,SERIAL#’ IMMEDIATE Oracle is actually doing the kill -9 for you, and has been for many years now.

Though not shown here, this same test was run when the session was killed without using the IMMEDIATE keyword, and there were no attempts to kill the session. This was inferred as well by the fact the the process was still running on the server up until the time the Scott sqlplus session was exited.

Categories: DBA Blogs

Some Observations on Puppetrun with Foreman

Tue, 2014-08-05 11:50

After joining Pythian I was introduced to several configuration management systems and Puppet was one of them. Foreman is a system management tool which can be integrated with Puppet to manage puppet modules and to initiate puppet runs on hosts from web interface. This is very useful if you want to configure large number of systems.

Puppet kick, which was previously used to initiate puppet run from foreman is deprecated now.

For initiating puppet run from foreman interface, I used mcollective. Mcollective can be used to execute parallel jobs in remote systems. There are 3 main components,

    Client – Connects to the mcollective Server and send commands.
    Server – Runs on all managed systems and execute commands.
    Middleware – A message broker like activemq.

I used mcollective puppet module from Puppetlabs for my setup.

# puppet module install puppetlabs-mcollective

My setup includes middleware(activemq) and mcollective client in the puppet server and mcollective servers in all managed systems.

After the implementation, I found that Puppet run from foreman web interface is failing for some servers.

I found following in /var/log/foreman-proxy/proxy.log,

D, [2014-04-18T07:20:54.392392 #4256] DEBUG — : about to execute: /usr/bin/sudo /usr/bin/mco puppet runonce -I server.pythian.com
W, [2014-04-18T07:20:56.167627 #4256] WARN — : Non-null exit code when executing ‘/usr/bin/sudo/usr/bin/mcopuppetrunonce-Ireg-app-02.prod.tprweb.net’
E, [2014-04-18T07:20:56.175034 #4256] ERROR — : Failed puppet run: Check Log files

You can see that mco command is trying to execute a puppet run in server.pythian.com and failing. mco command uses several sub commands called ‘applications’ to interact with all systems and ‘puppet’ is one of them.

While running the command in commandline, I received following,

# mco puppet runonce -I server.pythian.com| [ > ] 0 / 1warn 2014/04/11 08:05:34: client.rb:218:in `start_receiver’ Could not receive all responses. Expected : 1. Received : 0

Finished processing 0 / 1 hosts in 22012.79 ms

No response from:

server.pythian.com

I am able to ping the server.

When I ran ‘mco ping’ I found that the server with issue is identified with short hostnames and others with fqdn.

$ mco pingserver time=89.95 ms
server3.pythian.com time=95.26 ms
server2.pythian.com time=96.16 ms

So mcollective is exporting a short hostname when foreman is expecting an FQDN (Fully Qualified Domain Name) from this server.

Foreman takes node name information from puppet certificate name and that is used for filtering while sending mco commands.

Mcollective exports identity differently. From http://docs.puppetlabs.com/mcollective/configure/server.html#facts-identity-and-classes,

identity
The node’s name or identity. This should be unique for each node, but does not need to be.Default: The value of Ruby’s Socket.gethostname method, which is usually the server’s FQDN.
Sample value: web01.example.com
Allowed values: Any string containing only alphanumeric characters, hyphens, and dots — i.e. matching the regular expression /\A[\w\.\-]+\Z/

I passed FQDN as identity in the servers using mcollective module, which resulted in following setting,

# cat /etc/mcollective/server.cfg |grep identity
identity = server.pythian.com

This allowed the command to run successfully and getting ‘Puppet Run’ from foreman to work.

# mco puppet runonce -I server.pythian.com* [ ============================================================> ] 1 / 1

Now ‘mco ping’ looks good as well.

$ mco pingserver.pythian.com time=91.34 ms
server3.pythian.com time=91.23 ms
server2.pythian.com time=82.16 ms

Now let us check why this was happening.

mcollective identity is exported from ruby function Socket.gethostname.

From ruby source code you can see that Socket.gethostname is getting the value from gethostname().

./ext/socket/socket.c#ifdef HAVE_GETHOSTNAME
/*
* call-seq:
* Socket.gethostname => hostname
*
* Returns the hostname.
*
* p Socket.gethostname #=> “hal”
*
* Note that it is not guaranteed to be able to convert to IP address using gethostbyname, getaddrinfo, etc.
* If you need local IP address, use Socket.ip_address_list.
*/
static VALUE
sock_gethostname(VALUE obj)
{
#if defined(NI_MAXHOST)
# define RUBY_MAX_HOST_NAME_LEN NI_MAXHOST
#elif defined(HOST_NAME_MAX)
# define RUBY_MAX_HOST_NAME_LEN HOST_NAME_MAX
#else
# define RUBY_MAX_HOST_NAME_LEN 1024
#endif

char buf[RUBY_MAX_HOST_NAME_LEN+1];

rb_secure(3);
if (gethostname(buf, (int)sizeof buf – 1) < 0)
rb_sys_fail(“gethostname(3)”);

buf[sizeof buf - 1] = ”;
return rb_str_new2(buf);
}

gethostname is a glibc function which calls uname system call and copy the value from returned nodename.

So when foreman uses the FQDN value which it collects from puppet certificate name, mcollective exports the hostname returned by gethostname().

Now let us see how gethostname() gives different values in different systems.

When passing the complete FQDN in HOSTNAME parameter in /etc/sysconfig/network, we can see that Socket.gethostname is returning FQDN.

[root@centos ~]# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=centos.pythian.com[root@centos ~]# hostname -v
gethostname()=`centos.pythian.com’
centos.pythian.com

[root@centos ~]# irb
1.9.3-p484 :001 > require ‘socket’
=> true
1.9.3-p484 :002 > Socket.gethostname
=> “centos.pythian.com”
1.9.3-p484 :003 >

The system which was having problem was having following configuration.

[root@centos ~]# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=server[root@centos ~]# hostname -v
gethostname()=`server’
server

[root@centos ~]# irb
1.9.3-p484 :001 > require ‘socket’
=> true
1.9.3-p484 :002 > Socket.gethostname
=> “server”
1.9.3-p484 :003 >

Here ruby is only returning the short hostname for Socket.gethostname. But it was having following entry in /etc/hosts.

192.168.122.249 server.pythain.com server

This allowed system to resolve FQDN.

[root@centos ~]# hostname -f -v
gethostname()=`server’
Resolving `server’ …
Result: h_name=`server.pythain.com’
Result: h_aliases=`server’
Result: h_addr_list=`192.168.122.249′
server.pythain.com

From ‘man hostname’.

The FQDN of the system is the name that the resolver(3) returns for the
host name.Technically: The FQDN is the name gethostbyname(2) returns for the host name returned by gethost-
name(2). The DNS domain name is the part after the first dot.

As the resolver is able to resolve the hostname from /etc/hosts, puppet is able to pick up the fqdn value for certificate which it later used by foreman.
But mcollective exports the short hostname returned by gethostname().

To fix the issue in Red Hat based linux distributions, we can try any of the following,

* Pass an FQDN in /etc/sysconfig/network like below.

# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=server.pythian.com

OR

* Use a short hostname as HOSTNAME but make sure that it would not resolve to an FQDN in /etc/hosts or DNS (not really suggested).

OR

* Pass short hostname or FQDN as HOSTNAME but, make sure that there is an entry like below in /etc/hosts and mcollective is exporting fqdn as identity.

192.168.122.249 server.pythian.com server
Categories: DBA Blogs

How to Configure an Azure Point-to-Site VPN – Part 1

Tue, 2014-08-05 06:24

This blog post is the first in a series of three which will demonstrate how to configure a Point-to-Site VPN step-by-step. Today’s post will teach you how to configure a virtual network and a dynamic routing gateway, and the following blog posts will demonstrate how to create the certificates, and how to configure the VPN client.

Nowadays we are opting to move parts of, or even entire systems to the cloud. In order to build a hybrid environment, we need to find a way to connect our enterprise/local network, also known as on-premises, and the cloud.

Currently, we have two options to connect Azure and On-Premises:

  1. Using a Point-to-Site VPN
  2. Using a Site-to-Site VPN

The first option, using a Point-to-Site VPN is the option I’ll be demonstrating. It is recommended when you need to connect only some servers of your network to Azure. On the other hand, the Site-to-Site VPN connects your entire on-premises network to Azure.

CONFIGURE A VIRTUAL NETWORK AND A DYNAMIC ROUTING GATEWAY

To start, connect to your Azure account (https://manage.windowsazure.com/) and click in the “add button”, in the bottom left corner.

    1. Now follow the options that you can see in the image, and create a custom virtual network:|Screen Shot 2014-07-29 at 23.41.53
    2. Fill the Virtual Network name and the location you want to create.Screen Shot 2014-07-29 at 23.44.36
    3. Check “Configure a Point-to-Site VPN” (DNS server is an option setting, used for name resolution between this virtual network and your on-premises network):Screen Shot 2014-07-29 at 23.45.59
    4. Set the the IP range accordingly, after verify if this range is not overlapping with your on-premises network.Screen Shot 2014-07-29 at 23.54.26
    5. Click in the “add gateway subnet” button and than in the finish button (check mark).Screen Shot 2014-07-29 at 23.57.52
    6. Now you need to wait few minutes, while the virtual network is being created.Screen Shot 2014-07-29 at 23.58.11
    7. You will see a message like this when the process is done:Screen Shot 2014-07-30 at 00.00.24
    8. At this stage, you will be able to see the network created, under the network section.Screen Shot 2014-07-30 at 00.22.20
    9. Now we need to create a “Dynamic Routing Gateway”. To complete this, click on the network you just created and go to the Dashboard.Screen Shot 2014-07-30 at 00.31.00
    10. Click on “CREATE GATEWAY” button, in the page bottom and confirm your intention by selecting “Yes”.Screen Shot 2014-07-30 at 00.58.58
    11. It may take few minutes. You will see the message “CREATING GATEWAY”, as shown in the image bellow:Screen Shot 2014-07-30 at 00.59.47
    12. After a successfully creating, you will see the following:Screen Shot 2014-07-30 at 01.22.39

At this point, we are done with the Virtual Network creation. Now we can proceed to the certificate creation steps… Stay tuned for my next two posts.

Categories: DBA Blogs

Log Buffer #382, A Carnival of the Vanities for DBAs

Fri, 2014-08-01 07:41

Leading the way are the blogs which are acting as beacons of information guiding the way towards new vistas of innovation. This Log Buffer edition appreciates that role and presents you with few of those blogs.

Oracle:

Is there any recommended duration after which Exalytics Server should be rebooted for optimal performance of Server?

GlassFish On the Cloud Consulting Services by C2B2

This introduction to SOA Governance series contains two videos. The first one explains SOA Governance and why we need it by using a case study. The second video introduces Oracle Enterprise Repository (OER), and how it can help with SOA Governanc.

Oracle BI APPs provide two data warehouse generated fiscal calendars OOTB.

If you’re a community manager who’s publishing, monitoring, engaging, and analyzing communities on multiple social networks manually and individually, you need a hug.

SQL Server:

Spackle: Making sure you can connect to the DAC

Test-Driven Development (TDD) has a misleading name, because the objective is to design and specify that the system you are developing behaves in the ways that the customer expects, and to prove that it does so for the lifetime of the system.

Set a security standard across environments that developers can see and run, but not change.

Resilient T-SQL code is code that is designed to last, and to be safely reused by others. The goal of defensive database programming, the goal of this book, is to help you to produce resilient T-SQL code that robustly and gracefully handles cases of unintended use, and is resilient to common changes to the database environment.

One option to get notified when TempDB grows is to create a SQL Alert to fire a SQL Agent Job that will automatically send an email alerting the DBA when the Tempdb reaches a specific file size.

MySQL:

By default when using MySQL’s standard replication, all events are logged in the binary log and those binary log events are replicated to all slaves (it’s possible to filter out some schema).

Testing MySQL repository packages: how we make sure they work for you

If your project does not have something that you can adapt that quote to, odds are your testing is inadequate.

Compare and Synchronize with Updated Comparison Tools!

Beyond the FRM: ideas for a native MySQL Data Dictionary.

Categories: DBA Blogs

SQL Server and OS Error 1117, Error 9001, Error 823

Thu, 2014-07-31 08:32

small__3212904193 Along with other administrators, life of us, the DBAs are no different but full of adventure.  At times, we encounter an issue which is very new for us, rather, one that we have not faced in the past.  Today, I will be writing about such case.  Not so long back, in the beginning of June, I was having my morning tea I got a page from a customer we normally do not receive pages from. While I was analyzing the error logs, I noticed several lines of error like the ones below:

2014-06-07 21:03:40.57 spid6s Error: 17053, Severity: 16, State: 1.
LogWriter: Operating system error 21(The device is not ready.) encountered.
2014-06-07 21:03:40.57 spid6s Write error during log flush.
2014-06-07 21:03:40.57 spid67 Error: 9001, Severity: 21, State: 4.
The log for database 'SSCDB' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
2014-06-07 21:03:40.58 spid67 Database SSCDB was shutdown due to error 9001 in routine 'XdesRMFull::Commit'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
2014-06-07 21:03:40.65 spid25s Error: 17053, Severity: 16, State: 1.
fcb::close-flush: Operating system error (null) encountered.
2014-06-07 21:03:40.65 spid25s Error: 17053, Severity: 16, State: 1.
fcb::close-flush: Operating system error (null) encountered.

I had never seen this kind of error in the past so my next step was to check Google , which returned too many results. There were two sites that were worthwhile: The first site covers the OS Error 1117 , a Microsoft KB article, whereas the second site by Erin Stellato ( B | T ) talks about other errors like Error 823, Error 9001.  Further, I checked the server details and found that it’s exactly what the issue is here,  the server is using  PVSCSI (Para Virtualized SCSI) controller to LSI on the VMWare host. 

Resolving the issue

I had a call with client and have his consent to restart the service. This was quick, and after it came back, I ran checkdb – “We are good!” I thought.

But wait. This was the temporary fix. Yes, you read that correctly. This was the temporary fix, and this issue is actually lies with the VMWare, it’s a known issue according to VMWare KB Article. To fix this issue, we’ll have to upgrade to vSphere 5.1 according to the VMWare KB article.

Please be advised that the first thing that I did here is to apply the temporary fix, the root cause analysis – I did that last, after the server is up and running fine.

photo credit: Andreas.  via photopin CC

Categories: DBA Blogs

Interesting Behavior of MaxCmdsInTran Parameter

Wed, 2014-07-30 10:06

I recently worked on transactional replication issue and discovered interesting behavior of the log reader agent switch called MaxCmdsInTran and wanted to share it with you guys.

Lets take a look at  the use of this switch by looking at the msdn documentation below,

MaxCmdsInTran number_of_commands

Specifies the maximum number of statements grouped into a transaction as the Log Reader writes commands to the distribution database. Using this parameter allows the Log Reader Agent and Distribution Agent to divide large transactions (consisting of many commands) at the Publisher into several smaller transactions when applied at the Subscriber. Specifying this parameter can reduce contention at the Distributor and reduce latency between the Publisher and Subscriber. Because the original transaction is applied in smaller units, the Subscriber can access rows of a large logical Publisher transaction prior to the end of the original transaction, breaking strict transnational atomicity. The default is 0, which preserves the transaction boundaries of the Publisher.

However, I observed that if you do any update on Primary Column which won’t be split into multiple smaller transactions as described in the documentation.

Looking further on this reveals that  it probably the effect of bounded update. Bounded update has to be processed as a whole since it send all delete followed by all insert, can’t break into smaller transactions as it won’t know what would be a safe boundary.

The key difference comes from the fact that how updates are replicated when you update PK column and non-PK column. Let’s take an example to look at this (In this example C1 is non-PK and C2 is PK column)

If you update the non-PK column it replicated as update.

– Updating non-PK column

begin tran My_Deferred_Update_1_Row

update T1 set c1 = 1 where C1=2

commit tran My_Deferred_Update_1_Row

– Below is what gets added in msrepl_commands

exec sp_replshowcmds 1000

xact_seqno                                      command

0x0000016E000005330004 {CALL [dbo].[sp_MSupd_dbot1] (1,,2,0×01)}

What is bounded update?

However when you do a update on PK/Clustered index columns are replicated as Delete/Insert pair.

– Updating unique column

begin tran My_Bounded_Update_2_Rows

update T1 set c2 = c2 + 1000

commit tran My_Bounded_Update_2_Rows

– Below is what get added in msrepl_commands

exec sp_replshowcmds 1000

xact_seqno                                        command

0x00000017000000B5000E  {CALL?[dbo].[sp_MSdel_dboT1] (1)}

0x00000017000000B5000E  {CALL?[dbo].[sp_MSdel_dboT1] (2)}

0x00000017000000B5000E  {CALL?[dbo].[sp_MSins_dboT1] (1,3000,1)}

0x00000017000000B5000E  {CALL?[dbo].[sp_MSins_dboT1] (2,1002,2)}

As you can see in above case when we do update on PK/clustered index column, the updates are sent as deletes followed by inserts( this is called bounded update). This is one single transaction which is converted into delete and update pair. All deletes are sent first followed by insert.

We cannot break this transaction (PK update) as it will cause the delete (few or all) to happen first and then insert in separate transaction and will break transaction boundary, breaking this operation into multiple transaction will cause inconsistency and that’s most probably reason for this switch won’t work in this situation.

Why replication sending all deletes first and then all inserts and not the pairs delete/insert in order?

Let’s assume table A contains two rows, unique column C1 values being 1 and 2.

Now user runs the following: update A set c1 = c1 + 1.

The log records will be like

LOP_BEGIN_UPDATE

Del 1

Ins 2

Del 2

Ins 3

LOP_END_UPDATE

And the commands posted in the distribution database will be like

{CALL [sp_MSdel_dboA] (1)}

{CALL [sp_MSdel_dboA] (2)}

{CALL [sp_MSins_dboA] (1,2)}

{CALL [sp_MSins_dboA] (2,3)}

But if its send update directly, you’ll see

Update A set c1 = 2

Update A set c1 = 3

In that case, the first update will fail since c1 = 2 already exist. that’s why it deletes the row first before inserting them back to the new value.

I would recommend to look at the option of publishing the stored procedure execution to avoid this kind of huge updates which will cause performance issues in replication.

Happy Reading!

 

 

Categories: DBA Blogs

In-Memory Column Store: 10046 May Be Lying to You!

Wed, 2014-07-30 07:46

The Oracle In-Memory Column Store (IMC) is a new database option available to Oracle Database Enterprise Edition (EE) customers. It introduces a new memory area housed in your SGA, which makes use of the new compression functionality brought by the Oracle Exadata platform, as well as the new column oriented data storage vs the traditional row oriented storage. Note: you don’t need to be running on Exadata to be able to use the IMC!

 

Part I – How does it work?

In this part we’ll take a peek under the hood of the IMC and check out some of its internal mechanics.

Let’s create a sample table which we will use for our demonstration:


create table test inmemory priority high
as
select a.object_name as name, rownum as rn,
sysdate + rownum / 10000 as dt
from all_objects a, (select rownum from dual connect by level <= 500)
/

Almost immediately upon creating this table, the w00? processes will wake up from sleeping on the event ‘Space Manager: slave idle wait’ and start their analysis to check out the new table. By the way, the sleep times for this event are between 3 and 5 seconds, so it’s normal if you experience a little bit of a delay.

The process who picked it up will then create a new entry in the new dictionary table compression$, such as this one:

SQL> exec pt('select ts#,file#,block#,obj#,dataobj#,ulevel,sublevel,ilevel,flags,bestsortcol, tinsize,ctinsize,toutsize,cmpsize,uncmpsize,mtime,spare1,spare2,spare3,spare4 from compression$');
TS# : 4
FILE# : 4
BLOCK# : 130
OBJ# : 20445
DATAOBJ# : 20445
ULEVEL : 5
SUBLEVEL : 9
ILEVEL : 1582497813
FLAGS :
BESTSORTCOL : -1
TINSIZE : 16339840
CTINSIZE :
TOUTSIZE : 9972219
CMPSIZE :
UNCMPSIZE :
MTIME : 13-may-2014 23:14:46
SPARE1 : 31
SPARE2 : 5256
SPARE3 : 571822
SPARE4 :



Plus, there is also a BLOB column in compression$, which holds the analyzer’s findings:


SQL> select analyzer from compression$;

ANALYZER
——————————————————————————
004B445A306AD5025A0000005A6B8E0200000300000000000001020000002A0000003A0000004A(output truncated for readability)


A quick check reveals that this is indeed our object:


SQL> exec pt('select object_name, object_type, owner from dba_objects where data_object_id = 20445');
OBJECT_NAME : TEST
OBJECT_TYPE : TABLE
OWNER : FOO
-----------------

PL/SQL procedure successfully completed.


And we can see the object is now stored in the IMC by looking at v$im_segments:

SQL> exec pt('select * from v$im_segments');
OWNER : FOO
SEGMENT_NAME : TEST
PARTITION_NAME :
SEGMENT_TYPE : TABLE
TABLESPACE_NAME : USERS
INMEMORY_SIZE : 102301696
BYTES : 184549376
BYTES_NOT_POPULATED : 0
POPULATE_STATUS : COMPLETED
INMEMORY_PRIORITY : HIGH
INMEMORY_DISTRIBUTE : AUTO
INMEMORY_DUPLICATE : NO DUPLICATE
INMEMORY_COMPRESSION : FOR QUERY LOW
CON_ID : 0
-----------------

PL/SQL procedure successfully completed.



Thus, we are getting the expected performance benefit of it being in the IMC:

SQL> alter session set inmemory_query=disable;

Session altered.

Elapsed: 00:00:00.01
SQL> select count(*) from test;

COUNT(*)
———-
4187500

Elapsed: 00:00:03.96
SQL> alter session set inmemory_query=enable;

Session altered.

Elapsed: 00:00:00.01
SQL> select count(*) from test;

COUNT(*)
———-
4187500

Elapsed: 00:00:00.13


So far, so good.


Part II – Execution Plans

Some things we need to be aware of, though, when we are using the IMC in 12.1.0.2. One of them being that we can’t always trust in the execution plans anymore.

Let’s go back to our original sample table and recreate it using the default setting of INMEMORY PRIORITY NONE.


drop table test purge
/

create table test inmemory priority none
as
select a.object_name as name, rownum as rn,
sysdate + rownum / 10000 as dt
from all_objects a, (select rownum from dual connect by level <= 500)
/



Now let’s see what plan we’d get if we were to query it right now:


SQL> explain plan for select name from test where name = 'ALL_USERS';

Explained.

SQL> @?/rdbms/admin/utlxpls

PLAN_TABLE_OUTPUT
——————————————————————————————————————————————————————————————————–
Plan hash value: 1357081020

———————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
———————————————————————————–
| 0 | SELECT STATEMENT | | 614 | 12280 | 811 (73)| 00:00:01 |
|* 1 | TABLE ACCESS INMEMORY FULL| TEST | 614 | 12280 | 811 (73)| 00:00:01 |
———————————————————————————–

Predicate Information (identified by operation id):
—————————————————

1 – inmemory(“NAME”=’ALL_USERS’)
filter(“NAME”=’ALL_USERS’)

14 rows selected.


Okay, you might say now that EXPLAIN PLAN is only a guess. It’s not the real plan, and the real plan has to be different. And you would be right. Usually.

Watching the slave processes, there is no activity related to this table. Since it’s PRIORITY is NONE, it won’t be loaded into IMC until it’s actually queried for the first or second time around.

So let’s take a closer look than, shall we:

SQL> alter session set tracefile_identifier='REAL_PLAN';

Session altered.

SQL> alter session set events ’10046 trace name context forever, level 12′;

Session altered.

SQL> select name from test where name = ‘ALL_USERS’;



Now let’s take a look at the STAT line on that tracefile. Note: I closed the above session to make sure that we’ll get the full trace data.


PARSING IN CURSOR #140505885438688 len=46 dep=0 uid=64 oct=3 lid=64 tim=32852930021 hv=3233947880 ad='b4d04b00' sqlid='5sybd9b0c4878'
select name from test where name = 'ALL_USERS'
END OF STMT
PARSE #140505885438688:c=6000,e=10014,p=0,cr=2,cu=0,mis=1,r=0,dep=0,og=1,plh=1357081020,tim=32852930020
EXEC #140505885438688:c=0,e=58,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=32852930241
WAIT #140505885438688: nam='SQL*Net message to client' ela= 25 driver id=1650815232 #bytes=1 p3=0 obj#=20466 tim=32852930899
WAIT #140505885438688: nam='direct path read' ela= 13646 file number=4 first dba=21507 block cnt=13 obj#=20466 tim=32852950242
WAIT #140505885438688: nam='direct path read' ela= 2246 file number=4 first dba=21537 block cnt=15 obj#=20466 tim=32852953528
WAIT #140505885438688: nam='direct path read' ela= 1301 file number=4 first dba=21569 block cnt=15 obj#=20466 tim=32852955406

FETCH #140505885438688:c=182000,e=3365871,p=17603,cr=17645,cu=0,mis=0,r=9,dep=0,og=1,plh=1357081020,tim=32857244740
STAT #140505885438688 id=1 cnt=1000 pid=0 pos=1 obj=20466 op='TABLE ACCESS INMEMORY FULL TEST (cr=22075 pr=22005 pw=0 time=865950 us cost=811 size=12280 card=614)'



So that’s still the wrong one right there, and the STAT line even clearly shows that we’ve actually done 22005 physical reads, and therefore likely no in-memory scan, but a full scan from disk. There’s clearly a bug there with the execution plan reported, which is plain wrong.

Thus, be careful about using INMEMORY PRIORITY NONE, as you may not get what you expect. Since the PRIORITY NONE settings may also be overridden by any other PRIORITY settings, your data may get flushed out of the IMC, even though your execution plans will say otherwise. And I’m sure many of you know it’s often not slow response times on queries which cause a phone ringing hot. It’s inconsistent response times. This feature, if used inappropriately will pretty much guarantee inconsistent response times.

Apparently, what we should be doing is size up the In Memory Column store appropriately, to hold the objects we actually need to be in there. And make sure they’re always in there by setting a PRIORITY of LOW or higher. Use CRITICAL and HIGH to ensure the most vital objects of the application are populated first.

There was one other oddity that I noticed while tracing the W00? processes.

Part III – What are you scanning, Oracle ?

The m000 process’ trace file reveals many back-to-back executions of this select:


PARSING IN CURSOR #140670951860040 len=104 dep=1 uid=0 oct=3 lid=0 tim=23665542991 hv=2910336760 ad='fbd06928' sqlid='24uqc4aqrhdrs'
select /*+ result_cache */ analyzer from compression$ where obj#=:1 and ulevel=:2



They all supply the same obj# bind value, which is our table’s object number. The ulevel values used vary between executions.


However, looking at the related WAIT lines for this cursor, we see:


WAIT #140670951860040: nam='direct path read' ela= 53427 file number=4 first dba=18432 block cnt=128 obj#=20445 tim=23666569746
WAIT #140670951860040: nam='direct path read' ela= 38073 file number=4 first dba=18564 block cnt=124 obj#=20445 tim=23666612210
WAIT #140670951860040: nam='direct path read' ela= 38961 file number=4 first dba=18816 block cnt=128 obj#=20445 tim=23666665534
WAIT #140670951860040: nam='direct path read' ela= 39708 file number=4 first dba=19072 block cnt=128 obj#=20445 tim=23666706469
WAIT #140670951860040: nam='direct path read' ela= 40242 file number=4 first dba=19328 block cnt=128 obj#=20445 tim=23666749431
WAIT #140670951860040: nam='direct path read' ela= 39147 file number=4 first dba=19588 block cnt=124 obj#=20445 tim=23666804243
WAIT #140670951860040: nam='direct path read' ela= 33654 file number=4 first dba=19840 block cnt=128 obj#=20445 tim=23666839836
WAIT #140670951860040: nam='direct path read' ela= 38908 file number=4 first dba=20096 block cnt=128 obj#=20445 tim=23666881932
WAIT #140670951860040: nam='direct path read' ela= 40605 file number=4 first dba=20352 block cnt=128 obj#=20445 tim=23666924029
WAIT #140670951860040: nam='direct path read' ela= 32089 file number=4 first dba=20612 block cnt=124 obj#=20445 tim=23666962858
WAIT #140670951860040: nam='direct path read' ela= 36223 file number=4 first dba=20864 block cnt=128 obj#=20445 tim=23667001900
WAIT #140670951860040: nam='direct path read' ela= 39733 file number=4 first dba=21120 block cnt=128 obj#=20445 tim=23667043146
WAIT #140670951860040: nam='direct path read' ela= 17607 file number=4 first dba=21376 block cnt=128 obj#=20445 tim=23667062232

… and several more.


Now, compression$ contains only a single row. Its total extent size is neglibile as well:


SQL> select sum(bytes)/1024/1024 from dba_extents where segment_name = 'COMPRESSION$';

SUM(BYTES)/1024/1024
——————–
.0625


So how come Oracle is reading so many blocks ? Note that each of the above waits is a multi-block read, of 128 blocks.

Let’s take a look at what Oracle is actually reading there:

begin
pt('select segment_name, segment_type, owner
from dba_extents where file_id = 4
and 18432 between block_id and block_id + blocks - 1');
end;
/

SEGMENT_NAME : TEST
SEGMENT_TYPE : TABLE
OWNER : FOO
—————–

PL/SQL procedure successfully completed.

There’s our table again. Wait. What ?

There must be some magic going on underneath the covers here. In my understanding, a plain select against table A, is not scanning table B.

If I manually run the same select statement against compression$, I get totally normal trace output.

This reminds me of the good old:

SQL> select piece from IDL_SB4$;
ERROR:
ORA-00932: inconsistent datatypes: expected CHAR got B4



But I digress.

It could simply be a bug that results in these direct path reads being allocated to the wrong cursor. Or it could be intended, as it’s indeed this process’ job to analyze and load this table, and using this the resource usage caused by this is instrumented and can be tracked?

Either way, to sum things up we can say that:

- Performance benefits can potentially be huge
- Oracle automatically scans and caches segments marked as INMEMORY PRIORITY LOW|MEDIUM|HIGH|CRITICAL (they don’t need to be queried first!)
- Oracle scans segments marked as INMEMORY PRIORITY NONE (the default) only after they’re accessed the second time – and they may get overridden by higher priorities
- Oracle analyzes the table and stores the results in compression$
- Based on that analysis, Oracle may decide to load one or the other column only into IMC, or the entire table, depending on available space, and depending on the INMEMORY clause used
- It’s the W00? processes using some magic to do this analysis and read the segment into IMC.
- This analysis is also likely to be triggered again, whenever space management of the IMC triggers again, but I haven’t investigated that yet.

Categories: DBA Blogs

Create Windows Service for Oracle RAC

Tue, 2014-07-29 08:08

It’s my first time on RAC system for Windows and I’m happy to learn something new to share.

I created a new service for database (restoredb) only to find out the ORACLE_HOME for the service is “c:\\oracle\\product\\10.2.0\\asm_1″

Any ideas as to what was wrong?

C:\\dba_pythian>set oracle_home=C:\\oracle\\product\\10.2.0\\db_1

C:\\dba_pythian>echo %ORACLE_HOME%
C:\\oracle\\product\\10.2.0\\db_1

C:\\dba_pythian>oradim -NEW -SID restoredb -STARTMODE manual
Instance created.

C:\\dba_pythian>env
 1 STOPPED agent11g1Agent                                    c:\\oracle\\app\\11.1.0\\agent11g
 2 STOPPED agent11g1AgentSNMPPeerEncapsulator                c:\\oracle\\app\\11.1.0\\agent11g\\bin\\encsvc.exe
 3 STOPPED agent11g1AgentSNMPPeerMasterAgent                 c:\\oracle\\app\\11.1.0\\agent11g\\bin\\agntsvc.exe
 4 RUNNING +ASM1                                             c:\\oracle\\product\\10.2.0\\asm_1
 5 RUNNING ClusterVolumeService                              C:\\oracle\\product\\10.2.0\\crs
 6 RUNNING CRS                                               C:\\oracle\\product\\10.2.0\\crs
 7 RUNNING CSS                                               C:\\oracle\\product\\10.2.0\\crs
 8 RUNNING EVM                                               C:\\oracle\\product\\10.2.0\\crs
 9 STOPPED JobSchedulerDWH1                                  c:\\oracle\\product\\10.2.0\\db_1
10 STOPPED JobSchedulerRMP1                                  c:\\oracle\\product\\10.2.0\\db_1
11 RUNNING OraASM10g_home1TNSListenerLISTENER_PRD-DB-10G-01  C:\\oracle\\product\\10.2.0\\asm_1
12 STOPPED OraDb10g_home1TNSListener                         c:\\oracle\\product\\10.2.0\\db_1
13 STOPPED ProcessManager                                    "C:\\oracle\\product\\10.2.0\\crs"
14 RUNNING DWH1                                              c:\\oracle\\product\\10.2.0\\db_1
15 RUNNING RMP1                                              c:\\oracle\\product\\10.2.0\\db_1
16 RUNNING agent12c1Agent                                    C:\\agent12c\\core\\12.1.0.4.0
17 RUNNING restoredb                                         c:\\oracle\\product\\10.2.0\\asm_1
18 STOPPED JobSchedulerrestoredb                             c:\\oracle\\product\\10.2.0\\asm_1

Check the PATH variable to find HOME for ASM is listed before DB.

C:\\dba_pythian>set
Path=C:\\oracle\\product\\10.2.0\\asm_1\\bin;C:\\oracle\\product\\10.2.0\\db_1\\bin;C:\\WINDOWS\\system32;C:\\WINDOWS

Create database service specifying the fullpath to oradim from the DB HOME

C:\\dba_pythian>oradim -DELETE -SID restoredb
Instance deleted.

C:\dba_pythian>env
 1 STOPPED agent11g1Agent                                    c:\\oracle\\app\11.1.0\\agent11g
 2 STOPPED agent11g1AgentSNMPPeerEncapsulator                c:\\oracle\\app\11.1.0\\agent11g\\bin\\encsvc.exe
 3 STOPPED agent11g1AgentSNMPPeerMasterAgent                 c:\\oracle\\app\11.1.0\\agent11g\\bin\\agntsvc.exe
 4 RUNNING +ASM1                                             c:\\oracle\\product\\10.2.0\\asm_1
 5 RUNNING ClusterVolumeService                              C:\\oracle\\product\\10.2.0\\crs
 6 RUNNING CRS                                               C:\\oracle\\product\\10.2.0\\crs
 7 RUNNING CSS                                               C:\\oracle\\product\\10.2.0\\crs
 8 RUNNING EVM                                               C:\\oracle\\product\\10.2.0\\crs
 9 STOPPED JobSchedulerDWH1                                  c:\\oracle\\product\\10.2.0\\db_1
10 STOPPED JobSchedulerRMP1                                  c:\\oracle\\product\\10.2.0\\db_1
11 RUNNING OraASM10g_home1TNSListenerLISTENER_PRD-DB-10G-01  C:\\oracle\\product\\10.2.0\\asm_1
12 STOPPED OraDb10g_home1TNSListener                         c:\\oracle\\product\\10.2.0\\db_1
13 STOPPED ProcessManager                                    "C:\\oracle\\product\\10.2.0\\crs"
14 RUNNING DWH1                                              c:\\oracle\\product\\10.2.0\\db_1
15 RUNNING RMP1                                              c:\\oracle\\product\\10.2.0\\db_1
16 RUNNING agent12c1Agent                                    C:\\agent12c\\core\\12.1.0.4.0

C:\\dba_pythian>dir C:\\oracle\\product\\10.2.0\\db_1\\BIN\\orad*
 Volume in drive C has no label.
 Volume Serial Number is D4FE-B3A8

 Directory of C:\\oracle\\product\\10.2.0\\db_1\\BIN

07/08/2010  10:01 AM           121,344 oradbcfg10.dll
07/20/2010  05:20 PM             5,120 oradim.exe
07/20/2010  05:20 PM             3,072 oradmop10.dll
               3 File(s)        129,536 bytes
               0 Dir(s)  41,849,450,496 bytes free

C:\\dba_pythian>C:\\oracle\\product\\10.2.0\\db_1\\BIN\\oradim.exe -NEW -SID restoredb -STARTMODE manual
Instance created.

C:\\dba_pythian>env
 1 STOPPED agent11g1Agent                                    c:\\oracle\\app\\11.1.0\\agent11g
 2 STOPPED agent11g1AgentSNMPPeerEncapsulator                c:\\oracle\\app\\11.1.0\\agent11g\\bin\\encsvc.exe
 3 STOPPED agent11g1AgentSNMPPeerMasterAgent                 c:\\oracle\\app\\11.1.0\\agent11g\\bin\\agntsvc.exe
 4 RUNNING +ASM1                                             c:\\oracle\\product\\10.2.0\\asm_1
 5 RUNNING ClusterVolumeService                              C:\\oracle\\product\\10.2.0\\crs
 6 RUNNING CRS                                               C:\\oracle\\product\\10.2.0\\crs
 7 RUNNING CSS                                               C:\\oracle\\product\\10.2.0\\crs
 8 RUNNING EVM                                               C:\\oracle\\product\\10.2.0\\crs
 9 STOPPED JobSchedulerDWH1                                  c:\\oracle\\product\\10.2.0\\db_1
10 STOPPED JobSchedulerRMP1                                  c:\\oracle\\product\\10.2.0\\db_1
11 RUNNING OraASM10g_home1TNSListenerLISTENER_PRD-DB-10G-01  C:\\oracle\\product\\10.2.0\\asm_1
12 STOPPED OraDb10g_home1TNSListener                         c:\\oracle\\product\\10.2.0\\db_1
13 STOPPED ProcessManager                                    "C:\\oracle\\product\\10.2.0\\crs"
14 RUNNING DWH1                                              c:\\oracle\\product\\10.2.0\\db_1
15 RUNNING RMP1                                              c:\\oracle\\product\\10.2.0\\db_1
16 RUNNING agent12c1Agent                                    C:\\agent12c\\core\\12.1.0.4.0
17 RUNNING restoredb                                         c:\\oracle\\product\\10.2.0\\db_1
18 STOPPED JobSchedulerrestoredb                             c:\\oracle\\product\\10.2.0\\db_1

C:\\dba_pythian>
Categories: DBA Blogs

How SQL Server Browser Service Works

Tue, 2014-07-29 08:07

Some of you may wonder the role SQL browser service plays in the SQL Server instance. In this blog post, I’ll provide an overview of the how SQL Server browser plays crucial role in connectivity and understand the internals of it by capturing the network monitor output during the connectivity with different scenario.

Here is an executive summary of the connectivity flow:   ExecutiveWorkflow

 

Here is another diagram to explain the SQL Server connectivity status for Named & Default instance under various scenarios:

 DefaultvsNamed

Network Monitor output for connectivity to Named instance when SQL Browser is running:

In the diagram below, we can see that an UDP request over 1434 was sent from a local machine (client) to SQL Server machine (server) and response came from server 1434 port over UDP to client port with list of instances and the port in which it is listening:

image003

 

Network Monitor output for connectivity to Named instance when SQL Browser is stopped/disabled:

 We can see that client sends 5 requests which ended up with no response from UDP 1434 of server. so connectivity will never be established to the named instance.

 image004

 

Network Monitor output for connectivity to Named instance with port number specified in connection string & SQL Browser is stopped/disabled:

 There is no call made to the server’s 1434 port over UDP instead connection is directly made to the TCP port specified in the connection string.

image005  Network Monitor output for connectivity to Default instance when SQL Browser running:

 We can see that no calls were made to server’s 1434 port over UDP in which SQL Server browser is listening.

 image006

 

Network Monitor output for connectivity to Default instance which is configured to listen on different port other than default 1433 when SQL Browser running:

 We can see that connectivity failed after multiple attempts because client assumes that default instance of SQL Server always listens on TCP port 1433.

You can refer the blog below to see some workarounds to handle this situation here:

http://blogs.msdn.com/b/dataaccesstechnologies/archive/2010/03/03/running-sql-server-default-instance-on-a-non-default-or-non-standard-tcp-port-tips-for-making-application-connectivity-work.aspx

image007 References:

SQL Server Browser Service - http://msdn.microsoft.com/en-us/library/ms181087.aspx

Ports used by SQL Server and Browser Service - http://msdn.microsoft.com/en-us/library/ms175483.aspx

SQL Server Resolution Protocol Specification - http://msdn.microsoft.com/en-us/library/cc219703(v=prot.10).aspx

Thanks for reading!

 

Categories: DBA Blogs

TechTalk v5.0 – The Age of Big Data with Alex Morrise

Mon, 2014-07-28 14:06

Who: Hosted by Blackbird, with a speaking session by Alex Morrise, Chief Data Scientist at Pythian.

What: TechTalk presentation, beer, wine, snacks and Q&A

Where: Blackbird HQ – 712 Tehama Street (corner of 8th and Tehama) San Francisco, CA

When: Thursday July 31, 2014 from 6:00-8:00 PM

How: RSVP here!

TechTalk v5.0 welcomes to the stage, Alex Morrise, Chief Data Scientist at Pythian. Alex previously worked with Idle Games, Quid, and most recently Beats Music where he led the development of an adaptive, contextual music recommendation server.  Alex earned a PhD in Theoretical Physics from UC Santa Cruz.

This edition of TechTalk will be based on how the age of big data allows statistical inference on an unprecedented scale. Inference is the process of extracting knowledge from data, many times uncovering latent variables unifying seemingly diverse pieces of information. As data grows in complexity and dimension, visualization becomes increasingly difficult. How do we represent complex data to discover implicit and explicit relationships? We discuss how to Visualize Inference in some interesting data sets that uncover topics as diverse as the growth of technology, social gaming, and music.

You won’t want to miss this event, so be sure to RSVP.

 

Categories: DBA Blogs

Unexpected Shutdown Caused by ASR

Mon, 2014-07-28 13:45

In past few days I had two incidents and an outage, for just a few minutes. However, outage in a production environment is related to cost relatively and strictly. The server that had outage was because of failing over and then failing back about 4 to 5 times in 15 minutes. I was holding pager, and was then involved in investigating root cause for this fail-over and failed-back. Looking at the events in SQL Server error logs did not give me any clue towards what was happening, or why so I looked at the Windows Event View’s System log. I thought, “Maybe I have something there!”

There were two events that came to my attention:

Event Type:        Error

Event Source:    EventLog

Event Category:                None

Event ID:              6008

Date:                     7/24/2014

Time:                     1:14:12 AM

User:                     N/A

Computer:          SRV1

Description:

The previous system shutdown at 1:00:31 AM on 7/24/2014 was unexpected.

 

Event Type:        Information

Event Source:    Server Agents

Event Category:                Events

Event ID:              1090

Date:                     7/24/2014

Time:                     1:15:16 AM

User:                     N/A

Computer:          SRV1

Description:

System Information Agent: Health: The server is operational again.  The server has previously been shutdown by the Automatic Server Recovery (ASR) feature and has just become operational again.

 

 

The errors are closely related to the feature called Automatic Server Recovery (ASR) which is mainly configured with the server, and comes with the hardware. In our case, HP Blade, ProLiant server. There has been some resources/threads already discussed around similar topic. Most of the hardware vendor has somewhat similar software with similar functionality made available for servers.

In my case, my understanding was that maybe firmware are out of date and requiring updating, or the servers are aged. Further, I have sent my findings to customer with an incident report.  In a couple of hours, I had a reply and the feedback I received was just what I was expecting, the hardware was aged.  This may be the case with you when you see a message in event viewer which reads like “System Information Agent: Health: The server is operational again.  The server has previously been shutdown by the Automatic Server Recovery (ASR) feature and has just become operational again.”  Go check with your system administrator. The root cause of this unexepcted shutdown may not be related or caused by the SQL Server, rather, the system itself.  Please keep in mind that this could be one of the reasons, and certainly not the only.

References:

Automatic System Recovery

 

Categories: DBA Blogs

Logging for Slackers

Mon, 2014-07-28 07:41

When I’m not working on Big Data infrastructure for clients, I develop a few internal web applications and side projects. It’s very satisfying to write a Django app in an afternoon and throw it on Heroku, but there comes a time when people actually start to use it. They find bugs, they complain about downtime, and suddenly your little side project needs some logging and monitoring infrastructure. To be clear, the right way to do this would be to subscribe to a SaaS logging platform, or to create some solution with ElasticSearch and Kibana, or just use Splunk. Today I was feeling lazy, and I wondered if there wasn’t an easier way.

Enter Slack

Slack is a chat platform my team already uses to communicate – we have channels for different purposes, and people subscribe to keep up to date about Data Science, our internal Hadoop cluster, or a bunch of other topics. I already get notifications on my desktop and my phone, and the history of messages is visible and searchable for everyone in a channel. This sounds like the ideal lazy log repository.

Slack offers a rich REST API where you can search, work with files, and communicate in channels. They also offer an awesome (for the lazy) Incoming WebHooks feature – this allows you to POST a JSON message with a secret token, which is posted to a pre-configured channel as a user you can configure in the web UI. The hardest part of setting up a new WebHook was choosing which emoji would best represent application errors – I chose a very sad smiley face, but the devil is also available.

The Kludge

Django already offers the AdminEmailHandler, which emails log messages to the admins listed in your project. I could have created a mailing list, added it to the admins list, and let people subscribe. They could then create a filter in their email to label the log messages. That sounds like a lot of work, and there wouldn’t be a history of the messages except in individual recipients’ inboxes.

Instead, I whipped up this log handler for Django which will post the message (and a stack trace, if possible) to your Slack endpoint:

from logging import Handler
import requests, json, traceback
class SlackLogHandler(Handler):
   def __init__(self, logging_url="", stack_trace=False):
      Handler.__init__(self)
      self.logging_url = logging_url
      self.stack_trace = stack_trace
   def emit(self, record):
      message = '%s' % (record.getMessage())
      if self.stack_trace:
         if record.exc_info:
            message += '\n'.join(traceback.format_exception(*record.exc_info))
            requests.post(self.logging_url, data=json.dumps({"text":message} ))

There you go: install the requests library, generate an Incoming WebHook URL at api.slack.com, stick the SlackLogHandler in your Django logging configuration, and your errors will be logged to the Slack channel of your choice. Stack traces are optional – I’ve also been using this to post hourly reports of active users, etc. to the channel under a difference username.

For reference, here’s a log configuration for the Django settings.py. Now go write some code, you slacker.

LOGGING = {
    'version':1,
    'disable_existing_loggers':False,
    'handlers': {
        'console': {
            'level':'DEBUG',
            'class':'logging.StreamHandler',
        },
        'slack-error': {
            'level':'ERROR',
            'class':'SlackLogHandler.SlackLogHandler',
            'logging_url':'<SuperSecretWebHookURL>',
            'stack_trace':True
        }
    }
    'loggers': {
        'django': {
            'level': 'INFO',
            'handlers': ['console', 'slack-error']
        }
    }
}
Categories: DBA Blogs

Log Buffer #381, A Carnival of the Vanities for DBAs

Fri, 2014-07-25 08:22

Thy rhythm of blog posts regarding database technology has remained consistent throughout the week. Few of those posts have been plucked by this Log Buffer Edition for your pleasure.

Oracle:

Sayan has shared a Standalone sqlplus script for plans comparing.

Gartner Analysis: PeopleSoft Update Manager Delivers Significant Improvements to the Upgrade Tools and Processes.

Timely blackouts, of course, are essential to keeping the numbers up and (more importantly) preventing Target Down notifications from being sent out.

Are you experiencing analytics pain points?

Bug with xmltable, xmlnamespaces and xquery_string specified using bind variable.

SQL Server:

SQL Server 2012 introduced columnstore indexes, which can immensely improve the performance of OLAP queries.

Restoring the SQL Server Master Database Even Without a Backup .

There times when you need to write T-SQL code that creates specific T-SQL Code and executes it. When you do this you are creating dynamic T-SQL code.

A lot of numbers that we use everyday such as Bank Card numbers, Identification numbers, and ISBN codes, have check digits.

SQL-only ETL using a bulk insert into a temporary table (SQL Spackle).

MySQL:

How MariaDB makes Stored Procedures usable.

DBaaS, OpenStack and Trove 101: Introduction to the basics.

MySQL Fabric is a tool included on MySQL Utilities that helps you to manage your MySQL instances.

Showing all available MySQL data types when creating a new table with MySQL for Excel.

Why TokuDB hates Transparent HugePages.

Categories: DBA Blogs

Happy System Administrator Appreciation Day

Fri, 2014-07-25 07:59

Today is our day. July 25, 2014 marks the 15th annual System Administrator Appreciation Day. On this day we pause and take a moment to forget the impossible tasks, nonexistent budgets, and often unrealistic timelines to say thank you to those people who keeps everything working — system administrators.

So much of what has become a part of everyday life, from doing our jobs, to playing games online, shopping, and connecting with friends and family around the world is only possible due in large part to the tireless efforts of the system administrators who are in the trenches every hour of every day of the year keeping the tubes clear and the packets flowing. The fact that technology has become so common place in our lives, and more often than not “just works” has afforded us the luxury of forgetting (or not evening knowing) the immense infrastructure complexity which the system administrator works with to deliver the services we have come to rely on.

SysAdmin Appreciation Day started 15 years ago thanks to Ted Kekatos. According to Wikipedia, “Kekatos was inspired to create the special day by a Hewlett-Packard magazine advertisement in which a system administrator is presented with flowers and fruit-baskets by grateful co-workers as thanks for installing new printers. Kekatos had just installed several of the same model printers at his workplace.” Ever since then, SysAdmin Appreciation Day has been celebrated on the last Friday in July.

At Pythian, I have the privilege of being part of the Enterprise Infrastructure Services group.  We are a SysAdmin dream team of the best of the best, from around the globe. Day in and day out, our team is responsible for countless servers, networks, and services that millions of people use every day.

To all my colleagues and to anyone who considers themselves a SysAdmin, regardless of which flavour – thank you, and know that you are truly doing work that matters.

Categories: DBA Blogs

Exploring Options of Using RMAN Configure to Simplify Backup

Thu, 2014-07-24 14:06

I am a simple person who likes simple things, especially RMAN backup implementation.

I have yet to understand why RMAN backup implementation does not use configure command, and if you have a good explanation, please share.

Examples for RMAN configure command

configure device type disk parallelism 2 backup type to compressed backupset;
configure channel device type disk format '/oradata/backup/%d_%I_%T_%U' maxopenfiles 1;
configure channel 1 device type disk format '/oradata/backup1/%d_%I_%T_%U' maxopenfiles 1;
configure archivelog deletion policy to backed up 2 times to disk;
configure backup optimization on;

Do you know if backup is using parallelism?
Where is the backup to?
Is the backup to tape?

RMAN> show all;

RMAN configuration parameters for database with db_unique_name SAN are:
CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
CONFIGURE BACKUP OPTIMIZATION ON;
CONFIGURE DEFAULT DEVICE TYPE TO DISK;
CONFIGURE CONTROLFILE AUTOBACKUP ON;
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/oradata/backup/%d_%F.ctl';
CONFIGURE DEVICE TYPE DISK PARALLELISM 2 BACKUP TYPE TO COMPRESSED BACKUPSET;
CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE CHANNEL DEVICE TYPE DISK FORMAT   '/oradata/backup/%d_%I_%T_%U' MAXOPENFILES 1;
CONFIGURE CHANNEL 1 DEVICE TYPE DISK FORMAT   '/oradata/backup1/%d_%I_%T_%U' MAXOPENFILES 1;
CONFIGURE MAXSETSIZE TO UNLIMITED; # default
CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 2 TIMES TO DISK;
CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/snapcf_san.f'; # default

RMAN>

Simple RMAN script.

set echo on;
connect target;
show all;
backup incremental level 0 check logical database filesperset 1 tag "fulldb"
plus archivelog filesperset 8 tag "archivelog";

Simple RMAN run.

$ rman @simple.rman

Recovery Manager: Release 11.2.0.4.0 - Production on Thu Jul 24 11:12:19 2014

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

RMAN> set echo on;
2> connect target;
3> show all;
4> backup incremental level 0 check logical database filesperset 1 tag "fulldb"
5> plus archivelog filesperset 8 tag "archivelog";
6>
echo set on

connected to target database: SAN (DBID=2792912513)

using target database control file instead of recovery catalog
RMAN configuration parameters for database with db_unique_name SAN are:
CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
CONFIGURE BACKUP OPTIMIZATION ON;
CONFIGURE DEFAULT DEVICE TYPE TO DISK;
CONFIGURE CONTROLFILE AUTOBACKUP ON;
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/oradata/backup/%d_%F.ctl';
CONFIGURE DEVICE TYPE DISK PARALLELISM 2 BACKUP TYPE TO COMPRESSED BACKUPSET;
CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE CHANNEL DEVICE TYPE DISK FORMAT   '/oradata/backup/%d_%I_%T_%U' MAXOPENFILES 1;
CONFIGURE CHANNEL 1 DEVICE TYPE DISK FORMAT   '/oradata/backup1/%d_%I_%T_%U' MAXOPENFILES 1;
CONFIGURE MAXSETSIZE TO UNLIMITED; # default
CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 2 TIMES TO DISK;
CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/snapcf_san.f'; # default


Starting backup at 2014-JUL-24 11:12:21
current log archived
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=20 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=108 device type=DISK
channel ORA_DISK_1: starting compressed archived log backup set
channel ORA_DISK_1: specifying archived log(s) in backup set
input archived log thread=1 sequence=326 RECID=337 STAMP=853758742
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:12:24
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:12:25
piece handle=/oradata/backup1/SAN_2792912513_20140724_8dpe6koo_1_1 tag=ARCHIVELOG comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 2014-JUL-24 11:12:25

Starting backup at 2014-JUL-24 11:12:25
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting compressed incremental level 0 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00003 name=/oradata/SAN/datafile/o1_mf_undotbs1_9oqwsjk6_.dbf
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:12:26
channel ORA_DISK_2: starting compressed incremental level 0 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00008 name=/oradata/SAN/datafile/o1_mf_user_dat_9wvp8s78_.dbf
channel ORA_DISK_2: starting piece 1 at 2014-JUL-24 11:12:26
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:13:01
piece handle=/oradata/backup1/SAN_2792912513_20140724_8epe6koq_1_1 tag=FULLDB comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:35
channel ORA_DISK_1: starting compressed incremental level 0 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00001 name=/oradata/SAN/datafile/o1_mf_system_9oqwr5tm_.dbf
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:13:04
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:13:29
piece handle=/oradata/backup1/SAN_2792912513_20140724_8gpe6kpu_1_1 tag=FULLDB comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:25
channel ORA_DISK_1: starting compressed incremental level 0 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/oradata/SAN/datafile/o1_mf_sysaux_9oqwrv2b_.dbf
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:13:30
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:13:45
piece handle=/oradata/backup1/SAN_2792912513_20140724_8hpe6kqp_1_1 tag=FULLDB comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting compressed incremental level 0 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00005 name=/oradata/SAN/datafile/o1_mf_ggs_data_9or2h3tw_.dbf
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:13:45
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:13:48
piece handle=/oradata/backup1/SAN_2792912513_20140724_8ipe6kr9_1_1 tag=FULLDB comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:03
channel ORA_DISK_1: starting compressed incremental level 0 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00006 name=/oradata/SAN/datafile/o1_mf_testing_9rgp1q31_.dbf
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:13:49
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:13:52
piece handle=/oradata/backup1/SAN_2792912513_20140724_8jpe6krc_1_1 tag=FULLDB comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:03
channel ORA_DISK_2: finished piece 1 at 2014-JUL-24 11:14:44
piece handle=/oradata/backup/SAN_2792912513_20140724_8fpe6koq_1_1 tag=FULLDB comment=NONE
channel ORA_DISK_2: backup set complete, elapsed time: 00:02:18
Finished backup at 2014-JUL-24 11:14:44

Starting backup at 2014-JUL-24 11:14:44
current log archived
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting compressed archived log backup set
channel ORA_DISK_1: specifying archived log(s) in backup set
input archived log thread=1 sequence=327 RECID=338 STAMP=853758885
channel ORA_DISK_1: starting piece 1 at 2014-JUL-24 11:14:46
channel ORA_DISK_1: finished piece 1 at 2014-JUL-24 11:14:47
piece handle=/oradata/backup1/SAN_2792912513_20140724_8kpe6kt6_1_1 tag=ARCHIVELOG comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 2014-JUL-24 11:14:47

Starting Control File Autobackup at 2014-JUL-24 11:14:48
piece handle=/oradata/backup/SAN_c-2792912513-20140724-05.ctl comment=NONE
Finished Control File Autobackup at 2014-JUL-24 11:14:55

Recovery Manager complete.

-----

$ ls -l backup*
backup:
total 501172
-rw-r-----. 1 oracle oinstall 505167872 Jul 24 11:14 SAN_2792912513_20140724_8fpe6koq_1_1
-rw-r-----. 1 oracle oinstall   8028160 Jul 24 11:14 SAN_c-2792912513-20140724-05.ctl

backup1:
total 77108
-rw-r-----. 1 oracle oinstall   237056 Jul 24 11:12 SAN_2792912513_20140724_8dpe6koo_1_1
-rw-r-----. 1 oracle oinstall  1236992 Jul 24 11:12 SAN_2792912513_20140724_8epe6koq_1_1
-rw-r-----. 1 oracle oinstall 39452672 Jul 24 11:13 SAN_2792912513_20140724_8gpe6kpu_1_1
-rw-r-----. 1 oracle oinstall 34349056 Jul 24 11:13 SAN_2792912513_20140724_8hpe6kqp_1_1
-rw-r-----. 1 oracle oinstall  2539520 Jul 24 11:13 SAN_2792912513_20140724_8ipe6kr9_1_1
-rw-r-----. 1 oracle oinstall  1073152 Jul 24 11:13 SAN_2792912513_20140724_8jpe6krc_1_1
-rw-r-----. 1 oracle oinstall    67072 Jul 24 11:14 SAN_2792912513_20140724_8kpe6kt6_1_1

If this does not hit the nail on the head, then I don’t know what will.

Imagine someone, maybe me or yourself, deleting archivelog accidentally.

RMAN> delete noprompt archivelog all;

using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=108 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=20 device type=DISK
RMAN-08138: WARNING: archived log not deleted - must create more backups
archived log file name=/oradata/SAN/archivelog/arc_845895297_1_326.dbf thread=1 sequence=326
RMAN-08138: WARNING: archived log not deleted - must create more backups
archived log file name=/oradata/SAN/archivelog/arc_845895297_1_327.dbf thread=1 sequence=327

RMAN>

-----

RMAN> configure archivelog deletion policy to none;

old RMAN configuration parameters:
CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 2 TIMES TO DISK;
new RMAN configuration parameters:
CONFIGURE ARCHIVELOG DELETION POLICY TO NONE;
new RMAN configuration parameters are successfully stored

RMAN> delete noprompt archivelog all;

released channel: ORA_DISK_1
released channel: ORA_DISK_2
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=108 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=20 device type=DISK
List of Archived Log Copies for database with db_unique_name SAN
=====================================================================

Key     Thrd Seq     S Low Time
------- ---- ------- - --------------------
337     1    326     A 2014-JUL-24 11:04:17
        Name: /oradata/SAN/archivelog/arc_845895297_1_326.dbf

338     1    327     A 2014-JUL-24 11:12:21
        Name: /oradata/SAN/archivelog/arc_845895297_1_327.dbf

deleted archived log
archived log file name=/oradata/SAN/archivelog/arc_845895297_1_326.dbf RECID=337 STAMP=853758742
deleted archived log
archived log file name=/oradata/SAN/archivelog/arc_845895297_1_327.dbf RECID=338 STAMP=853758885
Deleted 2 objects


RMAN>

Will you be using configure for your next RMAN implementation?

Categories: DBA Blogs

Using SaltStack for Configuration Management

Wed, 2014-07-23 12:40

In my last blog post I mentioned that SaltStack is a fully featured configuration management solution, but we never looked into using the tool in that way. Today we will begin to explore some basic examples of configuration management with SaltStack.  We will look at two aspects of configuration management, installing a package, and will manage a service.

The scenario

A great repeatable task which can be automated with configuration management, and one which is faced by many systems administrators is having to add more capacity to an existing front end webserver pool.

Without a configuration management solution, you generally have to rely on an install document that is maintained by your systems administration team. One of those admins gets the job of preparing the new box, and follows the steps in that document to install all of the required packages and configure all of the required services to make that box a “webserver”

This method introduces a high potential for human error. The person following the document might miss step #17 on page 3, and you end up with a webserver in the pool that delivers content to your users in a strange and inconsistent way.  Depending on the maturity of your infrastructure, you also may or may not have the tools in place to even identify that the webserver is acting strangely due to this misconfiguration until clients begin to complain that your service delivers an unreliable experience.

From a resourcing point of view, this task can tie up 2 resources. The person doing the box install, and a second person you need to “QA” the box after the install is done to catch the fact that the first person missed step #17 on page 3.

Using a configuration management tool you define what your box should look like (a model) at a higher, abstracted level and the tool knows what is required to bring the server in line with it’s desired state. The tool does not need to be told that on a RedHat based system you use “yum” to install a package and on Debian systems you use “apt” as the operator you just say that the systems needs to have the package and the tool takes it from there.

By modelling your systems the tool can then provide accurate repeatability of the task of bringing your systems into line with the defined specifications of the model. And while this does shift the responsibility of eliminating any human error within the model itself, once it has been tested and validated the result is that each subsequent execution will be done programmatically without error.

Using SaltStack to install a package and manage a service

The first thing that we will need to do is tell the salt master that we would like to start using it for configuration management. We do this by uncommenting, or adding the following to our /etc/salt/master config:


file_roots:

base:
- /srv/salt

in the /srv directory as root make a “salt” subdir.

mkdir -p /srv/salt

Everything else, from this point forward will be written under the assumption that you are working in the /srv/salt dir.

Salt formulas

In SALT the set of instructions, or “model” that you define is known as a formula. Salt uses PyYALM as it’s configuration syntax. The first thing that we need to defile a base formula called “top.sls”


base:

'*':
- motd
'web*':
- apache
- webserver

This tells salt that all boxes should have the motd formula and that minions with hostnames starting with “web” should also get the apache formula.

Our Apache formula (apache.sls) is very basic for the purposes of this post:


httpd:

pkg:
- installed
service:
- running
- require:
- pkg: httpd

This tells the minion that it needs to install the package named httpd (remember the minion knows how to do this) and that the service should be running and that the service has a dependency on the package being installed. That is to say, you can’t manage the service unless the package that provides that server also is there.

When we apply the formula you can see that the minion receives the instruction. The minion installs the package and it’s dependant packages. Then it starts the service.


[root@ip-10-0-0-170 salt]# salt '*' state.sls apache

ip-10-0-0-171.ec2.internal:
----------
ID: httpd
Function: pkg.installed
Result: True
Comment: The following packages were installed/updated: httpd.
Changes:
----------
apr:
----------
new:
1.5.0-2.11.amzn1
old:

apr-util:
----------
new:
1.4.1-4.14.amzn1
old:

apr-util-ldap:
----------
new:
1.4.1-4.14.amzn1
old:

httpd:
----------
new:
2.2.27-1.2.amzn1
old:

httpd-tools:
----------
new:
2.2.27-1.2.amzn1
old:

mailcap:
----------
new:
2.1.31-2.7.amzn1
old:

----------
ID: httpd
Function: service.running
Result: True
Comment: Started Service httpd
Changes:
----------
httpd:
True

Summary
------------
Succeeded: 2
Failed: 0
------------
Total: 2

On subsequent runs, you can see that the package is already installed and the service is already running.


[root@ip-10-0-0-170 salt]# salt '*' state.sls apache

ip-10-0-0-171.ec2.internal:
----------
ID: httpd
Function: pkg.installed
Result: True
Comment: Package httpd is already installed
Changes:
----------
ID: httpd
Function: service.running
Result: True
Comment: The service httpd is already running
Changes:

Summary
------------
Succeeded: 2
Failed: 0
------------
Total: 2

If either was not true, if I were to go onto the box and stop the service:


[root@ip-10-0-0-171 ~]# service httpd stop

Stopping httpd: [ OK ]
[root@ip-10-0-0-171 ~]#

The next salt run would start the service again bringing the box back into compliance with my defined model.


ip-10-0-0-171.ec2.internal:

----------
ID: httpd
Function: pkg.installed
Result: True
Comment: Package httpd is already installed
Changes:
----------
ID: httpd
Function: service.running
Result: True
Comment: Started Service httpd
Changes:
----------
httpd:
True

Summary
------------
Succeeded: 2
Failed: 0
------------
Total: 2
[root@ip-10-0-0-171 ~]# service httpd status
httpd (pid 2493) is running...
[root@ip-10-0-0-171 ~]#

This becomes a powerful auditing tool which can allow you to quickly ensure that all boxes of a specific type match each other, and eliminates the above mentioned problem of missing step #17 on page 3 of your install doc.  With the heavy lifting of this task moved from human operators to the tool, and knowing that each node will be built identical to the others you can now scale up much quicker in response to your changing business needs, a task which previously could take a few days is now done in minutes.

 

Categories: DBA Blogs

How To Correlate Oracle Database Transaction with GoldenGate

Mon, 2014-07-21 13:19

So there I was troubleshooting GoldenGate issue and was puzzled as to why GoldenGate transactions were not seen from Oracle database.

I had the transaction XID correct; however, I was filtering by ACTIVE transaction from Oracle which was causing the issue.

Please allow me to share a test case so that you don’t get stumped like I did.

Identify current log and update table

ARROW:(SOE@san):PRIMARY> select max(sequence#)+1 from v$log_history;

MAX(SEQUENCE#)+1
----------------
             196

ARROW:(SOE@san):PRIMARY> update INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where PRODUCT_ID=171 and WAREHOUSE_ID=560;

1 row updated.

ARROW:(SOE@san):PRIMARY>

From GoldenGate, find opened transactions for duration of 10 minutes

$ ./ggsci

Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.21 18343248 OGGCORE_11.2.1.0.0OGGBP_PLATFORMS_140404.1029_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr  4 2014 15:18:36

Copyright (C) 1995, 2014, Oracle and/or its affiliates. All rights reserved.



GGSCI (arrow.localdomain) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
EXTRACT     RUNNING     ESAN        00:00:00      00:00:05
EXTRACT     STOPPED     PSAN_LAS    00:00:00      68:02:14
REPLICAT    STOPPED     RLAS_SAN    00:00:00      68:02:12


GGSCI (arrow.localdomain) 2> send esan, status

Sending STATUS request to EXTRACT ESAN ...


EXTRACT ESAN (PID 2556)
  Current status: Recovery complete: At EOF

  Current read position:
  Redo thread #: 1
  Sequence #: 196
  RBA: 5861376
  Timestamp: 2014-07-21 10:52:59.000000
  SCN: 0.1653210
  Current write position:
  Sequence #: 7
  RBA: 1130
  Timestamp: 2014-07-21 10:52:52.621948
  Extract Trail: /u01/app/ggs01/dirdat/ss



GGSCI (arrow.localdomain) 3> send esan, showtrans duration 10m

Sending showtrans request to EXTRACT ESAN ...


Oldest redo log file necessary to restart Extract is:

Redo Log Sequence Number 196, RBA 4955152

------------------------------------------------------------
XID:                  3.29.673
Items:                1
Extract:              ESAN
Redo Thread:          1
Start Time:           2014-07-21:10:41:41
SCN:                  0.1652053 (1652053)
Redo Seq:             196
Redo RBA:             4955152
Status:               Running


GGSCI (arrow.localdomain) 4>

Note the Redo Seq: 196 matches the sequence when the update was performed from Oracle database.
Also, note XID: 3.29.673

Let’s find the transaction from the database an notice the XID matches between GoldenGate and Oracle database.

ARROW:(SYS@san):PRIMARY> @trans.sql

START_TIME           XID              STATUS          SID    SERIAL# USERNAME           STATUS   SCHEMANAME         SQLID              CHILD
-------------------- ---------------- -------- ---------- ---------- ------------------ -------- ------------------ ------------- ----------
07/21/14 10:41:39    3.29.673         INACTIVE        105          9 SOE                INACTIVE SOE                6cmmk52wfnr7r          0

ARROW:(SYS@san):PRIMARY> @xplan.sql
Enter value for sqlid: 6cmmk52wfnr7r
Enter value for child: 0
SQL_ID  6cmmk52wfnr7r, child number 0
-------------------------------------
update INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where
PRODUCT_ID=171 and WAREHOUSE_ID=560

Plan hash value: 2141863993

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT   |              |       |       |     3 (100)|          |
|   1 |  UPDATE            | INVENTORIES  |       |       |            |          |
|*  2 |   INDEX UNIQUE SCAN| INVENTORY_PK |     1 |    14 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PRODUCT_ID"=171 AND "WAREHOUSE_ID"=560)


20 rows selected.

ARROW:(SYS@san):PRIMARY>

For fun, switched logfile and perform another update.

ARROW:(MDINH@san):PRIMARY> select max(sequence#)+1 from v$log_history;

MAX(SEQUENCE#)+1
----------------
             196

ARROW:(MDINH@san):PRIMARY> alter system switch logfile;

System altered.

ARROW:(MDINH@san):PRIMARY> /

System altered.

ARROW:(MDINH@san):PRIMARY> /

System altered.

ARROW:(MDINH@san):PRIMARY> /

System altered.

ARROW:(MDINH@san):PRIMARY> select max(sequence#)+1 from v$log_history;

MAX(SEQUENCE#)+1
----------------
             200

ARROW:(MDINH@san):PRIMARY> update SOE.INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where PRODUCT_ID=170;

883 rows updated.

ARROW:(MDINH@san):PRIMARY>

Check GoldenGate transactions to find 2 open transactions, one from Redo Seq: 196 and one from Redo Seq: 200

GGSCI (arrow.localdomain) 1> send esan, showtrans

Sending SHOWTRANS request to EXTRACT ESAN ...


Oldest redo log file necessary to restart Extract is:

Redo Log Sequence Number 196, RBA 4955152

------------------------------------------------------------
XID:                  3.29.673
Items:                1
Extract:              ESAN
Redo Thread:          1
Start Time:           2014-07-21:10:41:41
SCN:                  0.1652053 (1652053)
Redo Seq:             196
Redo RBA:             4955152
Status:               Running


------------------------------------------------------------
XID:                  4.20.516
Items:                883
Extract:              ESAN
Redo Thread:          1
Start Time:           2014-07-21:11:03:20
SCN:                  0.1654314 (1654314)
Redo Seq:             200
Redo RBA:             5136
Status:               Running


GGSCI (arrow.localdomain) 2>

Let’s kill the transaction by SOE user.

ARROW:(SYS@san):PRIMARY> @trans.sql

START_TIME           XID              STATUS          SID    SERIAL# USERNAME           STATUS   SCHEMANAME         SQLID              CHILD
-------------------- ---------------- -------- ---------- ---------- ------------------ -------- ------------------ ------------- ----------
07/21/14 10:41:39    3.29.673         INACTIVE        105          9 SOE                INACTIVE SOE                6cmmk52wfnr7r          0
07/21/14 11:03:19    4.20.516         INACTIVE         18         53 MDINH              INACTIVE MDINH              a5qywm8993bqg          0

ARROW:(SYS@san):PRIMARY> @xplan.sql
Enter value for sqlid: a5qywm8993bqg
Enter value for child: 0
SQL_ID  a5qywm8993bqg, child number 0
-------------------------------------
update SOE.INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where
PRODUCT_ID=170

Plan hash value: 1060265186

------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT  |                |       |       |    28 (100)|          |
|   1 |  UPDATE           | INVENTORIES    |       |       |            |          |
|*  2 |   INDEX RANGE SCAN| INV_PRODUCT_IX |   900 | 12600 |     4   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PRODUCT_ID"=170)


20 rows selected.

ARROW:(SYS@san):PRIMARY> alter system kill session '105,9' immediate;

System altered.

ARROW:(SYS@san):PRIMARY> @trans.sql

START_TIME           XID              STATUS          SID    SERIAL# USERNAME           STATUS   SCHEMANAME         SQLID              CHILD
-------------------- ---------------- -------- ---------- ---------- ------------------ -------- ------------------ ------------- ----------
07/21/14 11:03:19    4.20.516         INACTIVE         18         53 MDINH              INACTIVE MDINH              a5qywm8993bqg          0

ARROW:(SYS@san):PRIMARY>

Verify transaction from killed session is removed from GoldenGate

GGSCI (arrow.localdomain) 1> send esan, status

Sending STATUS request to EXTRACT ESAN ...


EXTRACT ESAN (PID 2556)
  Current status: Recovery complete: At EOF

  Current read position:
  Redo thread #: 1
  Sequence #: 200
  RBA: 464896
  Timestamp: 2014-07-21 11:06:40.000000
  SCN: 0.1654584
  Current write position:
  Sequence #: 7
  RBA: 1130
  Timestamp: 2014-07-21 11:06:37.435383
  Extract Trail: /u01/app/ggs01/dirdat/ss



GGSCI (arrow.localdomain) 2> send esan, showtrans

Sending SHOWTRANS request to EXTRACT ESAN ...


Oldest redo log file necessary to restart Extract is:

Redo Log Sequence Number 200, RBA 5136

------------------------------------------------------------
XID:                  4.20.516
Items:                883
Extract:              ESAN
Redo Thread:          1
Start Time:           2014-07-21:11:03:20
SCN:                  0.1654314 (1654314)
Redo Seq:             200
Redo RBA:             5136
Status:               Running


GGSCI (arrow.localdomain) 3>

-- trans.sql
set lines 200 pages 1000
col xid for a16
col username for a18
col schemaname for a18
col osuser for a12
select t.start_time, t.xidusn||'.'||t.xidslot||'.'||t.xidsqn xid, s.status,
s.sid,s.serial#,s.username,s.status,s.schemaname,
decode(s.sql_id,null,s.prev_sql_id) sqlid, decode(s.sql_child_number,null,s.prev_child_number) child
from v$transaction t, v$session s
where s.saddr = t.ses_addr
order by t.start_time
;

 

Categories: DBA Blogs