Skip navigation.

Feed aggregator

Does Linux tell the Gilgamesh story of hacker culture?

Sean Hull - Wed, 2015-10-28 10:08
Is the command line still essential? Was Stephenson right about his Linux It’s been a while since I read Stephenson’s essay on Linux. It’s one of those pieces that’s so well written, we need to go back to it now & then. Join 28,000 others and follow Sean Hull on twitter @hullsean. This quote caught … Continue reading Does Linux tell the Gilgamesh story of hacker culture? →

Trace Files -- 5.2 : Interpreting the SQL Trace Summary level

Hemant K Chitale - Wed, 2015-10-28 09:08
Picking up the same SQL Trace file from my previous post, I run (the well known) utility tkprof on it.

[oracle@ora11204 Desktop]$ tkprof /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_3039.trc My_Query.PRF sys=no

TKPROF: Release - Development on Wed Oct 28 22:53:46 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.

[oracle@ora11204 Desktop]$
[oracle@ora11204 Desktop]$ cat My_Query.PRF

TKPROF: Release - Development on Wed Oct 28 22:53:46 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.

Trace file: /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_3039.trc
Sort options: default

count = number of times OCI procedure was executed
cpu = cpu time in seconds executing
elapsed = elapsed time in seconds executing
disk = number of physical reads of buffers from disk
query = number of buffers gotten for consistent read
current = number of buffers gotten in current mode (usually for update)
rows = number of rows processed by the fetch or execute call

SQL ID: dc03x7r071fvn Plan Hash: 0


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 1 0.00 0.03 0 0 0 1
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 1 0.00 0.03 0 0 0 1

Misses in library cache during parse: 0
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 43

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 1 0.00 0.00
SQL*Net message from client 1 11.45 11.45

SQL ID: 7c1rnh08dp922 Plan Hash: 3580537945

select count(*)

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.01 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 0.01 1 1 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 0.03 1 1 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 43
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
1 1 1 SORT AGGREGATE (cr=1 pr=1 pw=0 time=14407 us)
107 107 107 INDEX FULL SCAN EMP_EMAIL_UK (cr=1 pr=1 pw=0 time=14577 us cost=1 size=0 card=107)(object id 16404)

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
Disk file operations I/O 1 0.00 0.00
db file sequential read 1 0.00 0.00
SQL*Net message from client 2 13.27 13.27

SQL ID: 9wuhwhad81d36 Plan Hash: 0


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 1
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.00 0.00 0 0 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 43



call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 2 0.00 0.02 0 0 0 0
Execute 3 0.00 0.03 0 0 0 2
Fetch 2 0.00 0.01 1 1 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 7 0.01 0.06 1 1 0 3

Misses in library cache during parse: 2
Misses in library cache during execute: 1

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 3 0.00 0.00
SQL*Net message from client 3 13.27 24.73
db file sequential read 1 0.00 0.00
Disk file operations I/O 1 0.00 0.00


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 43 0.00 0.00 0 0 0 0
Fetch 90 0.00 0.01 3 165 0 68
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 133 0.00 0.01 3 165 0 68

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 3 0.01 0.01

3 user SQL statements in session.
12 internal SQL statements in session.
15 SQL statements in session.
Trace file: /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_3039.trc
Trace file compatibility:
Sort options: default

1 session in tracefile.
3 user SQL statements in trace file.
12 internal SQL statements in trace file.
15 SQL statements in trace file.
15 unique SQL statements in trace file.
284 lines in trace file.
24 elapsed seconds in trace file.

[oracle@ora11204 Desktop]$

With the "sys=no" command-line flag, I have excluded reporting the recursive (SQL) calls although the summary statistics on them do appear.  There were a total of 43 executions of recursive calls.  The SQL*Net message from client is a wait of 13.27seconds (which is normally considered as an "idle wait").  We see this also in the previous post on the raw trace as the wait event before Tracing is disabled.
The only significant wait event was "db file sequential read" which is less than 1centisecond so is reported as 0.00second.  However, from the previous post, we can see that the raw trace file shows a wait time of 6,784 microseconds

Categories: DBA Blogs

Trace Files -- 5.1 : Reading an SQL Trace

Hemant K Chitale - Wed, 2015-10-28 08:53
Here's a short introduction to reading an SQL Trace.

First I execute these in my sqlplus session :

SQL> connect hr/oracle

PL/SQL procedure successfully completed.

SQL> select count(*) from employees;



PL/SQL procedure successfully completed.

SQL> select value from v$diag_info where name = 'Default Trace File';



Now, I extract selective portions of the trace file.

The header of the trace file gives me inormation about the platform and session/module :

Trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_3039.trc
Oracle Database 11g Enterprise Edition Release - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/orcl
System name: Linux
Node name: ora11204
Release: 2.6.39-400.17.1.el6uek.x86_64
Version: #1 SMP Fri Feb 22 18:16:18 PST 2013
Machine: x86_64
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 27
Unix process pid: 3039, image: oracle@ora11204 (TNS V1-V3)

*** 2015-10-28 22:19:42.291
*** SESSION ID:(141.7) 2015-10-28 22:19:42.291
*** CLIENT ID:() 2015-10-28 22:19:42.291
*** SERVICE NAME:(SYS$USERS) 2015-10-28 22:19:42.291
*** MODULE NAME:(SQL*Plus) 2015-10-28 22:19:42.291
*** ACTION NAME:() 2015-10-28 22:19:42.291

The trace file tells me that I am running 64bit Oracle on a speciic Linux (UEK) kernel on a host called "ora11204".  This is useful information if I have to provide the trace file to Oracle Support.  (True, it doesn't list all the Oracle patches that may have been applied).
It then identifies my Service Name and Module (and Action if I had set it with DBMS_APPLICATION_INFO).

Next, the trace file tells me when and how I have enabled tracing.

PARSING IN CURSOR #140174891232936 len=73 dep=0 uid=43 oct=47 lid=43 tim=1446041982290677 hv=3228613492 ad='99446cc8' sqlid='dc03x7r071fvn'
EXEC #140174891232936:c=1000,e=31608,p=0,cr=0,cu=0,mis=1,r=1,dep=0,og=1,plh=0,tim=1446041982290670
WAIT #140174891232936: nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=13246 tim=1446041982291265

It tells me that I have used DBMS_SESSION to enable tracing.

That cursor is then closed.

*** 2015-10-28 22:19:53.745
WAIT #140174891232936: nam='SQL*Net message from client' ela= 11453710 driver id=1650815232 #bytes=1 p3=0 obj#=13246 tim=1446041993745004
CLOSE #140174891232936:c=0,e=38,dep=0,type=0,tim=1446041993745175

We can identify the CURSOR Number (140174891232936-- this numbering format was introduced in 11g, if I am correct) that was closed.  This number is NOT the SQL_ID (which is dc03x7r071fvn -- yes even PLSQL blocks and procedures have SQL_IDs).  This number is NOT the SQL Hash Value either (the SQL Hash Value is present as hv=3228613492).

After that we start seeing the recursive (SYS) calls executed during the parsing of the actual user query (the query on EMPLOYEES).

PARSING IN CURSOR #140174891988920 len=202 dep=1 uid=0 oct=3 lid=0 tim=1446041993745737 hv=3819099649 ad='997c03e0' sqlid='3nkd3g3ju5ph1'
select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null
EXEC #140174891988920:c=0,e=50,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=2853959010,tim=1446041993745733
FETCH #140174891988920:c=999,e=88,p=0,cr=3,cu=0,mis=0,r=1,dep=1,og=4,plh=2853959010,tim=1446041993745957
CLOSE #140174891988920:c=0,e=11,dep=1,type=3,tim=1446041993746025
PARSING IN CURSOR #140174891955728 len=493 dep=1 uid=0 oct=3 lid=0 tim=1446041993746138 hv=2584065658 ad='997a6590' sqlid='1gu8t96d0bdmu'
select t.ts#,t.file#,t.block#,nvl(t.bobj#,0),nvl(,0),t.intcols,nvl(t.clucols,0),t.audit$,t.flags,t.pctfree$,t.pctused$,t.initrans,t.maxtrans,t.rowcnt,t.blkcnt,t.empcnt,t.avgspc,t.chncnt,t.avgrln,t.analyzetime,t.samplesize,t.cols,,nvl(,1),nvl(t.instances,1),t.avgspc_flb,t.flbcnt,t.kernelcols,nvl(t.trigflag, 0),nvl(t.spare1,0),nvl(t.spare2,0),t.spare4,t.spare6,ts.cachedblk,ts.cachehit,ts.logicalread from tab$ t, tab_stats$ ts where t.obj#= :1 and t.obj# = ts.obj# (+)
EXEC #140174891955728:c=0,e=30,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=3526770254,tim=1446041993746134
FETCH #140174891955728:c=0,e=53,p=0,cr=4,cu=0,mis=0,r=1,dep=1,og=4,plh=3526770254,tim=1446041993746225
CLOSE #140174891955728:c=0,e=8,dep=1,type=3,tim=1446041993746254

I have shown only the first two recursive calls.  These are identified by "dep=1"  (the parent call by the user we shall see is at depth level 0) and also by uid=0, lid=0  (schema id and privilege userid being 0 for SYS).  len=493 is the length of the SQL statement.

The elapsed time in microseconds is the "e=" value while the CPU time is "c=" in microseconds.  The "p=" value is Physical Reads, the "cr=" value is Consistent Gets, "cu=" is Current Gets, "r=" is Number of Rows.  "og=" shows the Optimizer Goal (og=4 is "Choose").  "tim=" is an ever-increasing timestamp (i.e. to calculate elapsed time between two SQL calls you can deduct the first "tim" value from the second.

Following these two are a number of other recursive SQL calls that I shall skip over until I come to my user SQL.

PARSING IN CURSOR #140174891232936 len=30 dep=0 uid=43 oct=3 lid=43 tim=1446041993782992 hv=282764354 ad='99446f28' sqlid='7c1rnh08dp922'
select count(*) from employees
PARSE #140174891232936:c=8998,e=37717,p=3,cr=165,cu=0,mis=1,r=0,dep=0,og=1,plh=3580537945,tim=1446041993782979
EXEC #140174891232936:c=0,e=69,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3580537945,tim=1446041993783188
WAIT #140174891232936: nam='SQL*Net message to client' ela= 5 driver id=1650815232 #bytes=1 p3=0 obj#=13246 tim=1446041993783286
WAIT #140174891232936: nam='Disk file operations I/O' ela= 76 FileOperation=2 fileno=4 filetype=2 obj#=16404 tim=1446041993790690
WAIT #140174891232936: nam='db file sequential read' ela= 6784 file#=4 block#=211 blocks=1 obj#=16404 tim=1446041993797581
FETCH #140174891232936:c=7000,e=14409,p=1,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=3580537945,tim=1446041993797727
STAT #140174891232936 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=1 pr=1 pw=0 time=14407 us)'
STAT #140174891232936 id=2 cnt=107 pid=1 pos=1 obj=16404 op='INDEX FULL SCAN EMP_EMAIL_UK (cr=1 pr=1 pw=0 time=14577 us cost=1 size=0 card=107)'
WAIT #140174891232936: nam='SQL*Net message from client' ela= 486 driver id=1650815232 #bytes=1 p3=0 obj#=16404 tim=1446041993819989
FETCH #140174891232936:c=0,e=4,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,plh=3580537945,tim=1446041993820032
WAIT #140174891232936: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=16404 tim=1446041993820057

The "dep=0" and "uid" and "lid" not being 0 indicate that this is a User SQL.  "oct" is the Oracle Command Type.
Here in addition to information on the PARSE call, the EXEC call, the individual WAITs and the FETCH call, we can also see Row Source Statistics indicated by the STAT lines.
The Parse was 37,717 microseconds and the FETCH time was 14,409 microseconds.  (This is a very quick running SQL but since it had never been parsed in ths instance, the Parse time exceeded the FETCH time),  The "mis=1" for the Parse indicates a Miss in the Library Cache -- so Oracle had to do a Hard Parse.
I would look at EXEC for INSERT/UPDATE/DELETE statements but here for a simple SELECT, I look at the FETCH time.
For the 'db file sequential read' wait of 6,784microseconds, we can also see the File Number ("file#"), the Block ID ("block#"), number of Blocks (1 for this wait event) and Object ID ("obj#")

The STAT lines have additional information about the position ("pos")  and parent id ("pid") in the Execution Plan.  The Object ID is indicated by "obj" and the operation by "op". STAT lines show the Consistent Gets ("cr"), Physical Reads ("pr") , the Time ("time") in microseconds, the cost ("cost") and expected cardinality ("card") for each step of the Execution Plan. Note that the expected cardinality is "card" but the actual count of rows is "cnt".

Next, the cursor is closed, and tracing disabled.

*** 2015-10-28 22:20:07.096
WAIT #140174891232936: nam='SQL*Net message from client' ela= 13276436 driver id=1650815232 #bytes=1 p3=0 obj#=16404 tim=1446042007096511
CLOSE #140174891232936:c=0,e=19,dep=0,type=0,tim=1446042007096668
PARSING IN CURSOR #140174891232936 len=48 dep=0 uid=43 oct=47 lid=43 tim=1446042007097875 hv=2592126054 ad='96fff810' sqlid='9wuhwhad81d36'
PARSE #140174891232936:c=999,e=1129,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=0,tim=1446042007097870
EXEC #140174891232936:c=0,e=313,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=0,tim=1446042007098285

Note how the CURSOR ID get's reused by the next SQL if the previous SQL cursor was closed.  Thus "140174891232936" was used by the PLSQL call  that enabled Tracing, then closed, then by the user SQL query on EMPLOYEES, then Closed, then by the  PLSQL call that disabled Tracing.
(The recursive SQLs in between had different CURSOR IDs).

(As an observation : Notice how the "obj#" doesn't get cleared even when the next wait is an "SQL*Net message to client" or "SQL*Net message from client")

Categories: DBA Blogs


Floyd Teter - Wed, 2015-10-28 08:34
I've noticed a trend lately.  In working with various organizations in the early stages of evaluating SaaS, I'm hearing vigorous defense of limitations. "We can't go to the cloud because our business is so unique."  "We can't consider cloud because our data is too complex to migrate." "We can't entrust our data to a 3rd party."  While there are plenty of additional reasons, I'm sure you've noticed the two important words forming the trend:  "We can't".

One of my favorite authors is Richard Bach.  Yeah, the guy who wrote "Johnathan Livingston Seagull", "Illusions", and "Travels with Puff".  More evidence that I'm an old hippie at heart.  Bach deals with the metaphysical and the spiritual.  It can be some rather deep and mind-bending stuff.  But he also throws out some pearls that stick with the reader.  One of his pearls that stuck with me: "Argue for your limitations and, sure enough, they're yours."  Meaning that those who vigorously defend their limitations rarely move forward in significant ways.  It's the opposing force to innovation, disruption and improvement.

If you're part of an organization considering a move to SaaS, the strategic factors to weigh involve elements like building value through improving the balance sheet and/or lowering operational costs; increasing product or service share and/or quality; increasing agility through reductions in business process or development cycle times.  In other words, "better faster cheaper".  If SaaS delivers for you in those areas, then the limitations simply become challenge to be dealt with on the road to achieving the value offered by SaaS.

"Argue for your limitation and, sure enough, they're yours."  When you begin to hear those two words, "We can't", that's exactly what you're doing.  Don't do it.  Step back and change your perspective.

Oracle OpenWorld 2015 : Tuesday

Tim Hall - Wed, 2015-10-28 07:29

The day started in the normal way, with a quick blog post about the previous day and a visit to the gym.

The original plan for the day was to hit the demo grounds again. I popped into OakTable World for the quick chat with a few folks and ended up staying for quite while. I watched some of the Ted-style talks, specifically Tim Gorman, Jonathan Lewis and Martin Klier. I then got chatting to some folks outside, before heading back in to see Gwen Shapira do a session on Kafka.

Whilst I was there I got to film a few “.com” clips for my videos, with funniest setup being Tanel Poder. He saw me filming some other folks and just launched in, not knowing what was going on and struck a pose. It took a bit of prompting before he realised he had to say something. You’ve got to love the enthusiasm. :)

GrahamWoodI got to admire Connor’s t-shirt and most importantly, I got to meet up with my dad!

From there I headed off to the demo grounds, where I inevitably ended up at the SQL Developer stand, speaking to Kris Rice and who turns up but Connor McDonald. :)

From the demo grounds I went to grab some food with Connor, then I headed back to the hotel to crash out.

It was a good day, which goes to prove my point, you’ve just got to go with the flow when you are at OOW. Plans are good, but don’t worry if they don’t work out.



Oracle OpenWorld 2015 : Tuesday was first posted on October 28, 2015 at 2:29 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

FMW & PaaS Webcasts Series For Partners in November

Are you able to attend Oracle OpenWorld sessions? Even though you are, you might not have enough time to visit all the sessions. Join us for the FY16 Oracle Cloud Platform and FMW Webcast Series...

We share our skills to maximize your revenue!
Categories: DBA Blogs

Tuesday OakTable World – brain fried!

Bobby Durrett's DBA Blog - Tue, 2015-10-27 19:08

Instead of going to the normal OpenWorld events today I went to OakTable World.  Now my brain is fried from information overload. :)

It started at 8 am with a nice talk about hash joins and Bloom filters.  Toon Koppelaars had some nice moving graphics showing how bloom filters work.  I’ve studied Bloom filters before but I’m not sure I understood it with the clarity that I had after this talk.

Then I did my talk at 9 am.  The best part for me was that we had a number of questions.  I ended up skipping several slides because of time but I felt like we helped people get what they wanted out of it by having the questions and discussion.  In retrospect my talk could have used more of an introduction to Delphix itself for this audience but I think we covered the essentials in the end.

Next Kellyn Pot’Vin-Gorman did more of a community service type of talk which was a change of pace.  She had a Raspberry Pi project which was a stuffed bear that would take your picture and post it on Twitter.  It was an example of the type of project that kids could do to get them interested in computer technology.

My brain began to turn to mush with Marco Gralike’s talk on XML and 12c In-Memory Column store.  I’m sure I’m absorbed something but I’m not that familiar with Oracle’s XML features.  Still, at least I know that there are in memory features for XML which I can file away for the future.

Several amusing 10 minute Ted talks followed.  Most notable to me was Jonathan Lewis’ talk about how virtual columns and constraints on virtual columns could improve cardinality estimates and thus query performance.

Cary Millsap talked about a variety of things including things like what he covered in his book.  He shared how he and Jeff Holt were hacking into what I assume is the C standard library to diagnose database performance issues, which was pretty techy.

Gwen Shapira’s talk on Kafka was a departure from the Oracle database topics but it was interesting to hear about this sort of queuing or logging service.  Reminds me in some ways of GGS and Tibco that we use at work but I’m sure it has different features.

Alex Gorbachev gave a high level overview of Internet of Things architectures.  This boiled down to how to connect many possibly low power devices to something that can gather the information and use it in many ways.

Lastly, we went back to the Oracle database and my brain slowly ground to a halt listening to Chris Antognini’s talk on Adaptive Dynamic Sampling.  I had studied this for my OCP but it has slowly leaked out of my brain and by 4 pm I wasn’t 100% efficient.  But, I got a few ideas about things that I can adjust when tuning this feature.

Anyway, brief overview.  I’m back to normal OpenWorld tomorrow but it was all OakTable today.  It was a good experience and I appreciated the chance to speak as well as to listen.


Categories: DBA Blogs

Connecting Hadoop and Oracle

Tanel Poder - Tue, 2015-10-27 18:06

Here are the slides of my yesterday’s OakTableWorld presentation. They also include a few hints about what our hot new venture Gluent is doing (although bigger annoucements come later this year).

[direct link]

Also, if you are at Oracle OpenWorld right now, my other presentation about SQL Monitoring in 12c is tomorrow at 3pm in Moscone South 103. See you there!


NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

Related Posts

Forays into Kafka – Enabling Flexible Data Pipelines

Rittman Mead Consulting - Tue, 2015-10-27 17:46

One of the defining features of “Big Data” from a technologist’s point of view is the sheer number of tools and permutations at one’s disposal. Do you go Flume or Logstash? Avro or Thrift? Pig or Spark? Foo or Bar? (I made that last one up). This wealth of choice is wonderful because it means we can choose the right tool for the right job each time.

Of course, we need to establish that have indeed chosen the right tool for the right job. But here’s the paradox. How do we easily work out if a tool is going to do what we want of it and is going to be a good fit, without disturbing what we already have in place? Particularly if it’s something that’s going to be part of an existing Productionised data pipeline, inserting a new tool partway through what’s there already is going to risk disrupting that. We potentially end up with a series of cloned environments, all diverging from each other, and not necessarily comparable (not to mention the overhead of the resource to host it all).

The same issue arises when we want to change the code or configuration of an existing pipeline. Bugs creep in, ideas to enhance the processing that you’ve currently got present themselves. Wouldn’t it be great if we could test these changes reliably and with no risk to the existing system?

This is where Kafka comes in. Kafka is very useful for two reasons:

  1. You can use it as a buffer for data that can be consumed and re-consumed on demand
  2. Multiple consumers can all pull the data, independently and at their own rate.

So you take your existing pipeline, plumb in Kafka, and then as and when you want to try out additional tools (or configurations of existing ones) you simply take another ‘tap’ off the existing store. This is an idea that Gwen Shapira put forward in May 2015 and really resonated with me.

I see Kafka sitting right on that Execution/Innovation demarcation line of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

Kafka enables us to build a pipeline for our analytics that breaks down into two phases:

  1. Data ingest from source into Kafka, simple and reliable. Fewest moving parts as possible.
  2. Post-processing. Batch or realtime. Uses Kafka as source. Re-runnable. Multiple parallel consumers: –
    • Productionised processing into Event Engine, Data Reservoir and beyond
    • Adhoc/loosely controlled Data Discovery processing and re-processing

These two steps align with the idea of “Obtain” and “Scrub” that Rittman Mead’s Jordan Meyer talked about in his BI Forum 2015 Masterclass about the Data Discovery:

So that’s the theory – let’s now look at an example of how Kafka can enable us to build a more flexible and productive data pipeline and environment.

Flume or Logstash? HDFS or Elasticsearch? … All of them!

Mark Rittman wrote back in April 2014 about using Apache Flume to stream logs from the Rittman Mead web server over to HDFS, from where they could be analysed in Hive and Impala. The basic setup looked like this:

Another route for analysing data is through the ELK stack. It does a similar thing – streams logs (with Logstash) in to a data store (Elasticsearch) from where they can be analysed, just with a different set of tools with a different emphasis on purpose. The input is the same – the web server log files. Let’s say I want to evaluate which is the better mechanism for analysing my log files, and compare the two side-by-side. Ultimately I might only want to go forward with one, but for now, I want to try both.

I could run them literally in parallel:

The disadvantage with this is that I have twice the ‘footprint’ on my data source, a Production server. A principle throughout all of this is that we want to remain light-touch on the sources of data. Whether a Production web server, a Production database, or whatever – upsetting the system owners of the data we want is never going to win friends.

An alternative to running in parallel would be to use one of the streaming tools to load data in place of the other, i.e.


The issue with this is I want to validate the end-to-end pipeline. Using a single source is better in terms of load/risk to the source system, but less so for validating my design. If I’m going to go with Elasticsearch as my target, Logstash would be the better fit source. Ditto HDFS/Flume. Both support connectors to the other, but using native capabilities always feels to me a safer option (particularly in the open-source world). And what if the particular modification I’m testing doesn’t support this kind of connectivity pattern?

Can you see where this is going? How about this:

The key points here are:

  1. One hit on the source system. In this case it’s flume, but it could be logstash, or another tool. This streams each line of the log file into Kafka in the exact order that it’s read.
  2. Kafka holds a copy of the log data, for a configurable time period. This could be days, or months – up to you and depending on purpose (and disk space!)
  3. Kafka is designed to be distributed and fault-tolerant. As with most of the boxes on this logical diagram it would be physically spread over multiple machines for capacity, performance, and resilience.
  4. The eventual targets, HDFS and Elasticsearch, are loaded by their respective tools pulling the web server entries exactly as they were on disk. In terms of validating end-to-end design we’re still doing that – we’re just pulling from a different source.

Another massively important benefit of Kafka is this:

Sooner or later (and if you’re new to the tool and code/configuration required, probably sooner) you’re going to get errors in your data pipeline. These may be fatal and cause it to fall in a heap, or they may be more subtle and you only realise after analysis that some of your data’s missing or not fully enriched. What to do? Obviously you need to re-run your ingest process. But how easy is that? Where is the source data? Maybe you’ll have a folder full of “.processed” source log files, or an HDFS folder of raw source data that you can reprocess. The issue here is the re-processing – you need to point your code at the alternative source, and work out the range of data to reprocess.

This is all eminently do-able of course – but wouldn’t it be easier just to rerun your existing ingest pipeline and just rewind the point at which it’s going to pull data from? Minimising the amount of ‘replumbing’ and reconfiguration to run a re-process job vs. new ingest makes it faster to do, and more reliable. Each additional configuration change is an opportunity to mis-configure. Each ‘shadow’ script clone for re-running vs normal processing is increasing the risk of code diverging and stale copies being run.

The final pipeline in this simple example looks like this:

  • The source server logs are streamed into Kafka, with a permanent copy up onto Amazon’s S3 for those real “uh oh” moments. Kafka, in a sandbox environment with a ham-fisted sysadmin, won’t be bullet-proof. Better to recover a copy from S3 than have to bother the Production server again. This is something I’ve put in for this specific use case, and wouldn’t be applicable in others.
  • From Kafka the web server logs are available to stream, as if natively from the web server disk itself, through Flume and Logstash.

There’s a variation on a theme of this, that looks like this:

Instead of Flume -> Kafka, and then a second Flume -> HDFS, we shortcut this and have the same Flume agent as is pulling from source writing to HDFS. Why have I not put this as the final pipeline? Because of this:

Let’s say that I want to do some kind of light-touch enrichment on the files, such as extracting the log timestamp in order to partition my web server logs in HDFS by the date of the log entry (not the time of processing, because I’m working with historical files too). I’m using a regex_extractor interceptor in Flume to determine the timestamp from the event data (log entry) being processed. That’s great, and it works well – when it works. If I get my regex wrong, or the log file changes date format, the house of cards comes tumbling down. Now I have a mess, because my nice clean ingest pipeline from the source system now needs fixing and re-running. As before, of course it is possible to write this cleanly so that it doesn’t break, etc etc, but from the point of view of decoupling operations for manageability and flexibility it makes sense to keep them separate (remember the Obtain vs Scrub point above?).

The final note on this is to point out that technically we can implement the pipeline using a Kafka Flume channel, which is a slightly neater way of doing things. The data still ends up in the S3 sink, and available in Kafka for streaming to all the consumers.

Kafka in Action

Let’s take a look at the configuration to put the above theory into practice. I’m running all of this on Oracle’s BigDataLite 4.2.1 VM which includes, amongst many other goodies, CDH 5.4.0. Alongside this I’ve installed into /opt :

  • apache-flume-1.6.0
  • elasticsearch-1.7.3
  • kafka_2.10-
  • kibana-4.1.2-linux-x64
  • logstash-1.5.4
The Starting Point – Flume -> HDFS

First, we’ve got the initial Logs -> Flume -> HDFS configuration, similar to what Mark wrote about originally:

source_agent.sources = apache_server  
source_agent.sources.apache_server.type = exec  
source_agent.sources.apache_server.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_server.batchSize = 1  
source_agent.sources.apache_server.channels = memoryChannel

source_agent.channels = memoryChannel  
source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Write to HDFS  
source_agent.sinks = hdfs_sink  
source_agent.sinks.hdfs_sink.type = hdfs = memoryChannel  
source_agent.sinks.hdfs_sink.hdfs.path = /user/oracle/incoming/rm_logs/apache_log  
source_agent.sinks.hdfs_sink.hdfs.fileType = DataStream  
source_agent.sinks.hdfs_sink.hdfs.writeFormat = Text  
source_agent.sinks.hdfs_sink.hdfs.rollSize = 0  
source_agent.sinks.hdfs_sink.hdfs.rollCount = 10000  
source_agent.sinks.hdfs_sink.hdfs.rollInterval = 600

After running this

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \
--conf-file flume_website_logs_02_tail_source_hdfs_sink.conf

we get the logs appearing in HDFS and can see them easily in Hue:

Adding Kafka to the Pipeline

Let’s now add Kafka to the mix. I’ve already set up and started Kafka (see here for how), and Zookeeper’s already running as part of the default BigDataLite build.

First we need to define a Kafka topic that is going to hold the log files. In this case it’s called apache_logs:

$ /opt/kafka_2.10- --zookeeper bigdatalite:2181 \
--create --topic apache_logs  --replication-factor 1 --partitions 1

Just to prove it’s there and we can send/receive message on it I’m going to use the Kafka console producer/consumer to test it. Run these in two separate windows:

$ /opt/kafka_2.10- \
--broker-list bigdatalite:9092 --topic apache_logs

$ /opt/kafka_2.10- \
--zookeeper bigdatalite:2181 --topic apache_logs

With the Consumer running enter some text, any text, in the Producer session and you should see it appear almost immediately in the Consumer window.

Now that we’ve validated the Kafka topic, let’s plumb it in. We’ll switch the existing Flume config to use a Kafka sink, and then add a second Flume agent to do the Kafka -> HDFS bit, giving us this:

The original flume agent configuration now looks like this:

source_agent.sources = apache_log_tail  
source_agent.channels = memoryChannel  
source_agent.sinks = kafka_sink

source_agent.sources.apache_log_tail.type = exec  
source_agent.sources.apache_log_tail.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_log_tail.batchSize = 1  
source_agent.sources.apache_log_tail.channels = memoryChannel

source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Write to Kafka = memoryChannel  
source_agent.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink  
source_agent.sinks.kafka_sink.batchSize = 5  
source_agent.sinks.kafka_sink.brokerList = bigdatalite:9092  
source_agent.sinks.kafka_sink.topic = apache_logs

Restart the from above so that you can see what’s going into Kafka, and then run the Flume agent. You should see the log entries appearing soon after. Remember that is just one consumer of the logs – when we plug in the Flume consumer to write the logs to HDFS we can opt to pick up all of the entries in Kafka, completely independently of what we have or haven’t consumed in

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \ 
--conf-file flume_website_logs_03_tail_source_kafka_sink.conf

[oracle@bigdatalite ~]$ /opt/kafka_2.10- \
--zookeeper bigdatalite:2181 --topic apache_logs - - [06/Sep/2015:08:08:30 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; - free monitoring service;" - - [06/Sep/2015:08:08:35 +0000] "HEAD /blog HTTP/1.1" 301 - "" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" - - [06/Sep/2015:08:08:35 +0000] "GET /blog/ HTTP/1.0" 200 145999 "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service;" - - [06/Sep/2015:08:08:36 +0000] "HEAD /blog/ HTTP/1.1" 200 - "" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" - - [06/Sep/2015:08:08:44 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; - free monitoring service;" - - [06/Sep/2015:08:08:58 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service;" - - [06/Sep/2015:08:08:58 +0000] "GET / HTTP/1.1" 200 36946 "-" "Echoping/6.0.2"

Set up the second Flume agent to use Kafka as a source, and HDFS as the target just as it was before we added Kafka into the pipeline:

target_agent.sources = kafkaSource  
target_agent.channels = memoryChannel  
target_agent.sinks = hdfsSink 

target_agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource  
target_agent.sources.kafkaSource.zookeeperConnect = bigdatalite:2181  
target_agent.sources.kafkaSource.topic = apache_logs  
target_agent.sources.kafkaSource.batchSize = 5  
target_agent.sources.kafkaSource.batchDurationMillis = 200  
target_agent.sources.kafkaSource.channels = memoryChannel

target_agent.channels.memoryChannel.type = memory  
target_agent.channels.memoryChannel.capacity = 100

## Write to HDFS  
target_agent.sinks.hdfsSink.type = hdfs = memoryChannel  
target_agent.sinks.hdfsSink.hdfs.path = /user/oracle/incoming/rm_logs/apache_log  
target_agent.sinks.hdfsSink.hdfs.fileType = DataStream  
target_agent.sinks.hdfsSink.hdfs.writeFormat = Text  
target_agent.sinks.hdfsSink.hdfs.rollSize = 0  
target_agent.sinks.hdfsSink.hdfs.rollCount = 10000  
target_agent.sinks.hdfsSink.hdfs.rollInterval = 600

Fire up the agent:

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent -n target_agent \
-f flume_website_logs_04_kafka_source_hdfs_sink.conf

and as the website log data streams in to Kafka (from the first Flume agent) you should see the second Flume agent sending it to HDFS and evidence of this in the console output from Flume:

15/10/27 13:53:53 INFO hdfs.BucketWriter: Creating /user/oracle/incoming/rm_logs/apache_log/FlumeData.1445954032932.tmp

and in HDFS itself:

Play it again, Sam?

All we’ve done to this point is add Kafka into the pipeline, ready for subsequent use. We’ve not changed the nett output of the data pipeline. But, we can now benefit from having Kafka there, by re-running some of our HDFS load without having to go back to the source files. Let’s say we want to partition the logs as we store them. But, we don’t want to disrupt the existing processing. How? Easy! Just create another Flume agent with the additional configuration in to do the partitioning.

target_agent.sources = kafkaSource  
target_agent.channels = memoryChannel  
target_agent.sinks = hdfsSink

target_agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource  
target_agent.sources.kafkaSource.zookeeperConnect = bigdatalite:2181  
target_agent.sources.kafkaSource.topic = apache_logs  
target_agent.sources.kafkaSource.batchSize = 5  
target_agent.sources.kafkaSource.batchDurationMillis = 200  
target_agent.sources.kafkaSource.channels = memoryChannel  
target_agent.sources.kafkaSource.groupId = new = smallest  
target_agent.sources.kafkaSource.interceptors = i1

target_agent.channels.memoryChannel.type = memory  
target_agent.channels.memoryChannel.capacity = 1000

# Regex Interceptor to set timestamp so that HDFS can be written to partitioned  
target_agent.sources.kafkaSource.interceptors.i1.type = regex_extractor  
target_agent.sources.kafkaSource.interceptors.i1.serializers = s1  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer = timestamp  
# Match this format logfile to get timestamp from it:  
# - - [06/Apr/2014:03:38:07 +0000] "GET / HTTP/1.1" 200 38281 "-" "Pingdom.com_bot_version_1.4_("  
target_agent.sources.kafkaSource.interceptors.i1.regex = (\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s\\+\\d{4})  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.pattern = dd/MMM/yyyy:HH:mm:ss Z  

## Write to HDFS  
target_agent.sinks.hdfsSink.type = hdfs = memoryChannel  
target_agent.sinks.hdfsSink.hdfs.path = /user/oracle/incoming/rm_logs/apache/%Y/%m/%d/access_log  
target_agent.sinks.hdfsSink.hdfs.fileType = DataStream  
target_agent.sinks.hdfsSink.hdfs.writeFormat = Text  
target_agent.sinks.hdfsSink.hdfs.rollSize = 0  
target_agent.sinks.hdfsSink.hdfs.rollCount = 0  
target_agent.sinks.hdfsSink.hdfs.rollInterval = 600

The important lines of note here (as highlighted above) are:

  • the regex_extractor interceptor which determines the timestamp of the log event, then used in the hdfs.path partitioning structure
  • the groupId and configuration items for the kafkaSource.
    • The groupId ensures that this flume agent’s offset in the consumption of the data in the Kafka topic is maintained separately from that of the original agent that we had. By default it is flume, and here I’m overriding it to new. It’s a good idea to specify this explicitly in all Kafka flume consumer configurations to avoid complications.
    • tells the consumer that if no existing offset is found (which is won’t be, if the groupId is new one) to start from the beginning of the data rather than the end (which is what it will do by default).
    • Thus if you want to get Flume to replay the contents of a Kafka topic, just set the groupId to an unused one (eg ‘foo01’, ‘foo02’, etc) and make sure the is smallest

Now run it (concurrently with the existing flume agents if you want):

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent -n target_agent \
-f flume_website_logs_07_kafka_source_partitioned_hdfs_sink.conf

You should see a flurry of activity (or not, depending on how much data you’ve already got in Kafka), and some nicely partitioned apache logs in HDFS:

Crucially, the existing flume agent and non-partitioned HDFS pipeline stays in place and functioning exactly as it was – we’ve not had to touch it. We could then run two two side-by-side until we’re happy the partitioning is working correctly and then decommission the first. Even at this point we have the benefit of Kafka, because we just turn off the original HDFS-writing agent – the new “live” one continues to run, it doesn’t need reconfiguring. We’ve validated the actual configuration we’re going to use for real, we’ve not had to simulate it up with mock data sources that then need re-plumbing prior to real use.

Clouds and Channels

We’re going to evolve the pipeline a bit now. We’ll go back to a single Flume agent writing to HDFS, but add in Amazon’s S3 as the target for the unprocessed log files. The point here is not so much that S3 is the best place to store log files (although it is a good option), but as a way to demonstrate a secondary method of keeping your raw data available without impacting the source system. It also fits nicely with using the Kafka flume channel to tighten the pipeline up a tad:

Amazon’s S3 service is built on HDFS itself, and Flume can use the S3N protocol to write directly to it. You need to have already set up your S3 ‘bucket’, and have the appropriate AWS Access Key ID and Secret Key. To get this to work I added these credentials to /etc/hadoop/conf.bigdatalite/core-site.xml (I tried specifying them inline with the flume configuration but with no success):


Once you’ve set up the bucket and credentials, the original flume agent (the one pulling the actual web server logs) can be amended:

source_agent.sources = apache_log_tail  
source_agent.channels = kafkaChannel  
source_agent.sinks = s3Sink

source_agent.sources.apache_log_tail.type = exec  
source_agent.sources.apache_log_tail.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_log_tail.batchSize = 1  
source_agent.sources.apache_log_tail.channels = kafkaChannel

## Write to Kafka Channel = kafkaChannel  
source_agent.channels.kafkaChannel.type =  
source_agent.channels.kafkaChannel.topic = apache_logs  
source_agent.channels.kafkaChannel.brokerList = bigdatalite:9092  
source_agent.channels.kafkaChannel.zookeeperConnect = bigdatalite:2181

## Write to S3 = kafkaChannel  
source_agent.sinks.s3Sink.type = hdfs  
source_agent.sinks.s3Sink.hdfs.path = s3n://rmoff-test/apache  
source_agent.sinks.s3Sink.hdfs.fileType = DataStream  
source_agent.sinks.s3Sink.hdfs.filePrefix = access_log  
source_agent.sinks.s3Sink.hdfs.writeFormat = Text  
source_agent.sinks.s3Sink.hdfs.rollCount = 10000  
source_agent.sinks.s3Sink.hdfs.rollSize = 0  
source_agent.sinks.s3Sink.hdfs.batchSize = 10000  
source_agent.sinks.s3Sink.hdfs.rollInterval = 600

Here the source is the same as before (server logs), but the channel is now Kafka itself, and the sink S3. Using Kafka as the channel has the nice benefit that the data is now already in Kafka, we don’t need that as an explicit target in its own right.

Restart the source agent using this new configuration:

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \
--conf-file flume_website_logs_09_tail_source_kafka_channel_s3_sink.conf

and you should get the data appearing on both HDFS as before, and now also in the S3 bucket:

Didn’t Someone Say Logstash?

The premise at the beginning of this exercise was that I could extend an existing data pipeline to pull data into a new set of tools, as if from the original source, but without touching that source or the existing configuration in place. So far we’ve got a pipeline that is pretty much as we started with, just with Kafka in there now and an additional feed to S3:

Now we’re going to extend (or maybe “broaden” is a better term) the data pipeline to add Elasticsearch into it:

Whilst Flume can write to Elasticsearch given the appropriate extender, I’d rather use a tool much closer to Elasticsearch in origin and direction – Logstash. Logstash supports Kafka as an input (and an output, if you want), making the configuration ridiculously simple. To smoke-test the configuration just run Logstash with this configuration:

input {  
        kafka {  
                zk_connect => 'bigdatalite:2181'  
                topic_id => 'apache_logs'  
                codec => plain {  
                        charset => "ISO-8859-1"  
                # Use both the following two if you want to reset processing  
                reset_beginning => 'true'  
                auto_offset_reset => 'smallest'


output {  
        stdout {codec => rubydebug }  

A few of things to point out in the input configuration:

  • You need to specify plain codec (assuming your input from Kafka is). The default codec for the Kafka plugin is json, and Logstash does NOT like trying to parse plain text and json as I found out: - - [06/Sep/2015:08:08:30 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; - free monitoring service;" {:exception=>#<NoMethodError: undefined method `[]' for 37.252:Float>, :backtrace=>["/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/event.rb:73:in `initialize'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-codec-json-1.0.1/lib/logstash/codecs/json.rb:46:in `decode'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-1.0.0/lib/logstash/inputs/kafka.rb:169:in `queue_event'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-1.0.0/lib/logstash/inputs/kafka.rb:139:in `run'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:177:in `inputworker'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:171:in `start_input'"], :level=>:error}

  • As well as specifying the codec, I needed to specify the charset. Without this I got \\u0000\\xBA\\u0001 at the beginning of each message that Logstash pulled from Kafka

  • Specifying reset_beginning and auto_offset_reset tell Logstash to pull everything in from Kafka, rather than starting at the latest offset.

When you run the configuration file above you should see a stream of messages to your console of everything that is in the Kafka topic:

$ /opt/logstash-1.5.4/bin/logstash -f logstash-apache_10_kafka_source_console_output.conf

The output will look like this – note that Logstash has added its own special @version and @timestamp fields:

       "message" => " - - [09/Oct/2015:04:13:23 +0000] \"GET /wp-content/uploads/2014/10/JFB-View-Selector-LowRes-300x218.png HTTP/1.1\" 200 53295 \"\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",  
      "@version" => "1",  
    "@timestamp" => "2015-10-27T17:29:06.596Z"  

Having proven the Kafka-Logstash integration, let’s do something useful – get all those lovely log entries streaming from source, through Kafka, enriched in Logstash with things like geoip, and finally stored in Elasticsearch:

input {  
        kafka {  
                zk_connect => 'bigdatalite:2181'  
                topic_id => 'apache_logs'  
                codec => plain {  
                        charset => "ISO-8859-1"  
                # Use both the following two if you want to reset processing  
                reset_beginning => 'true'  
                auto_offset_reset => 'smallest'  

filter {  
        # Parse the message using the pre-defined "COMBINEDAPACHELOG" grok pattern  
        grok { match => ["message","%{COMBINEDAPACHELOG}"] }

        # Ignore anything that's not a blog post hit, characterised by /yyyy/mm/post-slug form  
        if [request] !~ /^\/[0-9]{4}\/[0-9]{2}\/.*$/ { drop{} }

        # From the blog post URL, strip out the year/month and slug  
        #     year  => 2015  
        #     month =>   02  
        #     slug  => obiee-monitoring-and-diagnostics-with-influxdb-and-grafana  
        grok { match => [ "request","\/%{NUMBER:post-year}\/%{NUMBER:post-month}\/(%{NUMBER:post-day}\/)?%{DATA:post-slug}(\/.*)?$"] }

        # Combine year and month into one field  
        mutate { replace => [ "post-year-month" , "%{post-year}-%{post-month}" ] }

        # Use GeoIP lookup to locate the visitor's town/country  
        geoip { source => "clientip" }

        # Store the date of the log entry (rather than now) as the event's timestamp  
        date { match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]}  

output {  
        elasticsearch { host => "bigdatalite"  index => "blog-apache-%{+YYYY.MM.dd}"}  

Make sure that Elasticsearch is running and then kick off Logstash:

$ /opt/logstash-1.5.4/bin/logstash -f logstash-apache_01_kafka_source_parsed_to_es.conf

Nothing will appear to happen on the console:

log4j, [2015-10-27T17:36:53.228]  WARN: org.elasticsearch.bootstrap: JNA not found. native methods will be disabled.  
Logstash startup completed

But in the background Elasticsearch will be filling up with lots of enriched log data. You can confirm this through the useful kopf plugin to see that the Elasticsearch indices are being created:

and directly through Elasticsearch’s RESTful API too:

$ curl -s -XGET http://bigdatalite:9200/_cat/indices?v|sort  
health status index                  pri rep docs.count docs.deleted store.size  
yellow open   blog-apache-2015.09.30   5   1      11872            0       11mb           11mb  
yellow open   blog-apache-2015.10.01   5   1      13679            0     12.8mb         12.8mb  
yellow open   blog-apache-2015.10.02   5   1      10042            0      9.6mb          9.6mb  
yellow open   blog-apache-2015.10.03   5   1       8722            0      7.3mb          7.3mb

And of course, the whole point of streaming the data into Elasticsearch in the first place – easy analytics through Kibana:


Kafka is awesome :-D

We’ve seen in this article how Kafka enables the implementation of flexible data pipelines that can evolve organically without requiring system rebuilds to implement or test new methods. It allows the data discovery function to tap in to the same source of data as the more standard analytical reporting one, without risking impacting the source system at all.

The post Forays into Kafka – Enabling Flexible Data Pipelines appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

EDUCAUSE and Robot Tutors In The Sky: When investors are your main customers

Michael Feldstein - Tue, 2015-10-27 15:46

By Phil HillMore Posts (377)

Yippie i ohhh ohh ohh
Yippie i aye ye ye
Robot tutors in the sky

Before I head out to Indianapolis for the EDUCAUSE conference, I keep thinking back to a comment someone made in response to Michael’s description of Knewton marketing as “selling snake oil”. I can’t find the exact quote, but the gist was:

This is what happens when you start to see VCs as your main customers.

This viewpoint could be applied well beyond Knewton, as they successfully parlay their marketing hype into raising more than $100 million to date (I suspect with another round in the works based on the aggressive marketing). Martin Weller has a post out today looking back at the MOOC investment mania and lessons learned such as “Don’t go cheap – they won’t respect you” and “Big rhetoric wins – allied with the fear factor”. The post is somewhat tongue-in-cheek and cynical in nature . . . but spot on.

Update: Ray Henderson shared a recent WSJ story about Chegg and how they are suffering from trying to increase market valuation by a “ratchet”.

Tech startups eager to land sky-high valuations from investors might want to heed the cautionary tale of Chegg Inc., the textbook rental service whose stock has languished since its IPO in 2013.

In a candid interview, an early investor in Chegg revealed how the company gunned for the highest possible valuation in several funding rounds ahead of its public offering. Chegg in exchange granted venture capitalists a favorable term called a “ratchet” that guaranteed the share price in the IPO would be higher than what they paid.

The move backfired. When Chegg went public, it was motivated to set an IPO price that met the terms of the covenant, or Chegg would have to pay the difference in shares to the early investors. The stock plummeted on the first day of trading and hasn’t recovered.

The entire ed tech market finds itself in the interesting position where it is easier to raise large sums of money from VCs or private equity or strategic buyers than it is to establish real business models with paying customers.

On one hand:

  • Ed Tech private investment (seed, angel, VC, private equity) has hit an all-time high of $3.76 billion for the first 9 months of 2015, according to Ambient Insight; and
  • M&A activity in ed tech is even higher, with $6.8 billion in Q3 of 2015 alone, according to Berkery Noyes.

On the other hand:

  • In the LMS market Blackboard is laying off staff and their owners are trying find an exit and D2L has hit a plateau despite massive investment. Instructure, while set for a half-billion+ IPO later this year has yet to set concrete plans to become profitable, and they are by far the hottest company in this market.
  • In the MOOC market, Coursera is just now getting to a repeatable revenue model, yet that is likely $20 million per year or less.
  • Other than ALEKS and MyLabs (owned by McGraw-Hill and Pearson), it is unlikely that any of the adaptive software providers have yet become profitable.
  • Etc, etc.

I am not one to argue against investment in ed tech, and I do think ed tech has growing potential when properly applied to help improve educational experiences and outcomes. However, there is a real danger when it is much easier for an extended period of time for companies to raise private investment or get bought out at high multiples than it is to establish real revenue models with end user customers – mostly institutions. The risk is that the VCs and private equity funders become the main customers and company marketing and product plans center on pleasing investors more than educators and students.

Knewton has fallen into this trap (although at $100 million + you could argue it is not a trap from their perspective) as have many others others.

What is needed in the market is for more focus to be applied to companies finding and simply delighting customers. This is a balance, as there is a trap on the other side of just supporting the status quo. But the balance right now is heavily tilted towards pleasing investors.

This is one of the main issues I plan to watch for at the EDUCAUSE conference – how much the company messages and products are targeted at educators and students vs. how much they are targeted at investors.

The post EDUCAUSE and Robot Tutors In The Sky: When investors are your main customers appeared first on e-Literate.

ADF 12.2.1 Responsive UI Improvements

Andrejus Baranovski - Tue, 2015-10-27 15:30
ADF 12.2.1 provides much better responsive UI support comparing to previous version ADF 12.1.3. Previously we were using CSS media queries to hide/show facets. This worked, but it was not great from performance point of view. Same region was duplicated into different facets, both loaded into memory, but only one displayed. ADF 12.2.1 comes with new tag af:matchMediaBehaviour, this tag is using CSS media query to detect screen size, and it updates layout component property. No need anymore to use different facets, we can update properties directly.

This is how works, check in the video. Two blocks implemented with ADF regions are re-arranged into top-down layout, when screen size becomes too narrow to render left-right layout:

Here is the example of panel splitter layout component with af:matchMediaBehavior tags:

This tag contains three properties - propertyName, matchedPropertyValue, mediaQuery. You can define property of layout component to override through propertyName. New value for the property is defined by matchedPropertyValue. CSS media query is set to define condition, when layout component property value should be changed based on the screen size (if screen size is less than defined):

Two ADF regions are displayed right to left, when screen size is wide enough:

Same two regions are re-arranged to display in top to down layout:

Download sample application -

Oracle Fusion Middleware 12c ( now Generally Available

Oracle Fusion Middleware is the leading business innovation platform for the enterprise and the cloud. It enables enterprises to create and run agile, intelligent business applications while...

We share our skills to maximize your revenue!
Categories: DBA Blogs

EM12c Agent generating heapDump_*.hprof

Arun Bavera - Tue, 2015-10-27 13:43
By default Em12c Agent has enabled -XX:+HeapDumpOnOutOfMemoryError. We need to change the log directory to other directory or disable the Dump before it fills up Agent partition.
  --  -XX:-HeapDumpOnOutOfMemoryError   or  -XX:HeapDumpPath=/home/oracle/logs/
You can put these entries in AGENT_INST/sysman/config/s_jvm_options.opt
Agent has a auto tuning feature which increases Xmx automatically whenever it goes out of memory and dumps the memory.
We  disabled the Dump by putting this entry   -XX:-HeapDumpOnOutOfMemoryError     in file  AGENT_INST/sysman/config/s_jvm_options.opt and did emctl stop agent;emctl start agent.
Also, if you think if the agent memory auto tuning is not working you can disable that .. if you see a frequent restart or failing to start due to outOfMemory errors.
You can change  in
#Enable auto tuning out of the box
enableAutoTuning=true false

Change the default Xmx
# These are the optional Java flags for the agent
agentJavaDefines=-Xmx512M -XX:MaxPermSize=96M
Note: This is not recommended by Oracle Dev team.

Categories: Development

Oracle OpenWorld 2015 : Monday

Tim Hall - Tue, 2015-10-27 11:21

Monday started with a trip to the gym, where I met Scott Spendolini. At the end on the session, we were sitting on bikes next to each other chatting, whilst peddling at an incredibly slow rate. After getting cleaned up, we headed over to Lori’s Diner and ate more calories than be burned at the gym. :)

From there we headed down to the conference. I spent some time chatting to folks at the OTN Lounge, where I met one of my former colleagues Ian MacDonald. He had just come out of an Oracle Forms 12c session and I had a bunch of questions to ask also, so we headed down to the demo grounds to find the Oracle Forms stand, where then spent ages talking to Michael Ferrante about life, the universe and everything Forms related. :)

As I mentioned the other day, the installation and configuration of Forms and Reports has changed in 12c. During my first run through I noticed the Web Tier that links everything together was present in the domain, but not configured during the process. I was curious if I had done something wrong, if it was expected behaviour or if it was an implied statement of direction. I guess the web tier is surplus to requirements for many people if they are fronting their infrastructure with a reverse proxy or a load balancer. It turned out to be expected behaviour, and we discussed the configuration of the web tier, which is very simple. Just amend a couple of files and copy them to the “moduleconf” directory under the OHS instance. Happy days.

We also got a demo of the installation of the Forms Builder on Windows, which no longer needs a WebLogic installation, making it a much smaller footprint for developer machines. Our developers still use Forms 10g Builder. We then take the finished forms, move them to the server and recompile to 11gR2. It’s a pain, but simpler than putting Forms Builder 11gR2 on their PCs. If we can move to 12c Forms, they should be able to use the latest builder again. :)

From there I moved on to the SQL Developer demo stand, where I got to speak to Kris Rice and Jeff Smith, who are always good value. While I was there Jagjeet Singh, Sanjay Kumar and Baljeet Bhasin came up to say hello to me, which was really nice. Of course, I filmed them doing a group “.com”… :)

After that it I did a tour of the exhibition stands looking for things of interest. I used the GoPro to film a walk around some of the exhibition. I’ll see if I can make a little montage out of that…

Next, I went back to the OTN Lounge and spoke to a whole bunch of people, and filmed a load of “.com” cameos for forthcoming YouTube videos. :)

Then it was the weary walk back to the hotel, where I crashed for the night.

I think tomorrow may well be another demo grounds day…





Oracle OpenWorld 2015 : Monday was first posted on October 27, 2015 at 6:21 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

Pythian More Innovative Than GE?

Pythian Group - Tue, 2015-10-27 08:41


GE is known as an innovative company and according to Forbes is one of the worlds most valuable brands. Late summer they made headlines with the announcement that they were retiring performance management reviews.

No big deal? Think again. GE has built a reputation on Jack Welch’s ridgid performance management programs, cutting the bottom 10 percent of under performers each year.

First, I applaud GE for the bold move. Any company wide change in an organization the size of GE’s would be challenging. Second, what took you so long? In 2011, Pythian acted against the crowd and ditched the prevailing zeitgeist by implementing a performance feedback program.

At the time, with approximately one hundred employees and a brand new HR team we were just beginning to establish our HR programs. Like many small companies, we did not have a structured performance management program. We were growing rapidly and identified a need to provide employees with useful feedback and career discussions.

Ratings, rankings and bell curves, “Oh My!” We didn’t even consider them. We designed a program that removed standard performance management elements like numerical rankings. Our program focus was created to facilitate formal, organized feedback discussions, in a comfortable environment, between an employee and their manager. The idea was to base the discussion on each team member’s career aspirations and journey. During a new hire orientation, the first steps of the career journey begin. Following every six months we schedule time to sit down and have focused career feedback discussions. During these discussions, goals are established, recent successes reviewed, progress on goals updated, and challenges chronicled with suggestions to overcome. Furthermore, career aspirations and plans for professional development are discussed and established.

The feedback program is constantly evolving and improving to meet the changing needs and scale of the company.  Of course we listen to employee feedback about the program and implement changes after a review of the suggestions. Change can be difficult for people. Initially, employees more accustomed to traditional performance management were hesitant, but they quickly responded to the easy and relaxed format of our program.

Regular feedback is key. We encourage two way feedback: up and/or down, across the organization, in real time, all the time. We are always working to improve our programs, processes and ideas, e.g. “upping our game” as a company.   We believe it’s important to build a culture of constant feedback. A culture of two way feedback built on trust and transparency is a team effort by all members of the Pythian Team.

During orientation I enjoy encouraging and empowering all new employees with permission to ask their leaders for feedback anytime. I encourage them to not wait to share what’s going well and to disclose where they need extra support/further direction, etc. In my own team meetings I inquire what I could be doing more of and less of. How can I be a better communicator and leader. I create a safe environment for the team to provide feedback so we can collectively improve.

It will be interesting to see if GE’s announcement encourages more companies to re-evaluate their approach to Performance Management systems and encourage more effective dialogue and feedback discussions with their employees.

Categories: DBA Blogs

FREE live webinar : Learn Oracle Apps 12.2 for DBAs and Apps DBAs

Online Apps DBA - Tue, 2015-10-27 08:41

Learn Oracle Apps DBA R12.2 (4)

Do you know enough about the features of Oracle E-Business Suite from 11i to 12 ? Are you interested in becoming an Oracle Apps DBA ?

Join us on our live webinar on Thursday (29th October,2015) at 6:30 – 7:30 PM PST/ 9:30-10:30 PM EST or Friday (30th October,2015) at 7:00-8:00 AM IST to know about the architecture of E-Business Suite highlighting the updates from 11i to 12 and the key points that everyone must know to become an Apps DBA.

To become an Oracle Apps DBA the important thing is that you should be very well familiar with Oracle Applications/E-Business Suite Architecture.

Below you can see the changes in the architecture of EBS R12.2


  • Introduction of Fusion Middleware 11g
  • Replacing Oracle Containers for Java (OC4J) 10g with WebLogic Server 11g
  • Uses Oracle Application Server 10g R3 ( for Forms & Reports
  • Default Database 11gR2 ( Latest is (11g Family)
  • Oracle JSP Compiler is replaced by WebLogic JSP Compiler in R12 version 12.2.
  • Dual File System from R12.2
  • Online Patching – Almost zero Downtime

To know more about the Oracle Apps DBA profession. Click on the button below to register for our webinar


Click here to register for the FREE Webinar

The post FREE live webinar : Learn Oracle Apps 12.2 for DBAs and Apps DBAs appeared first on Oracle Trainings for Apps & Fusion DBA.

Categories: APPS Blogs

Are frameworks a shortcut to slow bloated code?

Sean Hull - Tue, 2015-10-27 08:30
I was reading one of my favorite blogs again, Todd Hoff’s High Scalability. He has an interesting weekly post format called “Quotable Quotes”. I like them because they’re often choice quotes that highlight some larger truth. Join 32,000 others and follow Sean Hull on twitter @hullsean. 1. Chasing bloody frameworks One that caught my eye … Continue reading Are frameworks a shortcut to slow bloated code? →

SQL Server Row Level Security

Pythian Group - Tue, 2015-10-27 08:27


Row Level Security (RLS) has been implemented in SQL Server 2016 for both on-premise and v12 of Azure instances.

The problem this solves is: a company with multiple applications accessing sensitive data in one or more tables.

How do you ensure the data being read or written is only the data that login is authorized to see? In the past, this has been accomplished with a complicated series of views or functions, and there’s no guarantee a bug or malicious user wouldn’t be able to bypass those measures. With Row Level Security, it doesn’t matter what privileges you have (including sysadmin) or how you try to access the data.

How it Works

Row Level Security has two options: You can either FILTER the rows or BLOCK the operation entirely. The BLOCK functionality is not yet implemented in CTP 2.4, but the FILTER logic works like a charm.

The steps are very simple:
1 – Figure out what you’re going to associate with your users and data. You will need to create some link between your data and a login’s or user’s properties. Something that will allow the engine to say This Row is ok for This User.

2 – Create a Function defining the relationship between users and the data.

3 – Create a Security Policy for the function and table(s). You can use the same policy on multiple tables or views.

Once the Security Policy has been created, every query or DML operation on the tables or views you’re filtering will automatically have the function applied to the WHERE or HAVING clause. You can see the filter working by reviewing the execution plan as well. SQL Server will generate the Plan Hash value with the filtering logic in place. This allows Plan Re-Use with Row Level Security, and it’s a big improvement over Oracle’s implementation which doesn’t do this (as of Oracle 10g, the last time I worked with it) :-). See the bottom of this post for an example of the same query with RLS turned on & off.

What is particularly nice about these policies is that the data is filtered regardless of the user’s privileges. A sysadmin or other superuser who disables the policy is just an Audit log review away from having to explain what they were doing.

Row Level Security Walk Through

This is an example of setting up an RLS system for the Credit Card data in the AdventureWorks database. After this is completed, only users associated with a Business Entity in the Person.Person table will be able to see or update any credit card information, and the data they can touch will be limited to just their business.

Step 1: Add user_name column to Person.Person table

In this example, I’m associating the user_name() function’s value for each login with the BusinessEntityID. Of course, you can use any value you want, as long as you can access it from a SELECT statement in a Schema-Bound function. This means many system tables are off-limits.

USE AdventureWorks

ALTER TABLE person.person
ADD UserName nvarchar(128) NULL;

— Associate some person.person rows with a login too.
UPDATE person.person
SET UserName = ‘Business1’
BusinessEntityID IN (301, 303, 305);

Step 2: Create Users to Test

I’m just creating a login named Business1 to demonstrate this. Note that the user has db_owner in AdventureWorks

USE [master] GO
USE [AdventureWorks] GO
CREATE USER [business1] FOR LOGIN [business1] GO
USE [AdventureWorks] GO
ALTER ROLE [db_owner] ADD MEMBER [business1] GO

Step 3: Create Function to Filter Data

This function finds all credit cards for the user_name() running the query. Any values not returned by this function will be inaccessible to this user.

CREATE FUNCTION [Sales].[fn_FindBusinessCreditCard] (@CreditCardID INT)

1 AS result
person.person p INNER JOIN
sales.PersonCreditCard pcc ON p.BusinessEntityID = pcc.BusinessEntityID INNER JOIN
sales.CreditCard cc ON pcc.CreditCardID = cc.CreditCardID
cc.CreditCardID = @CreditCardID AND
p.UserName = user_name();

Step 4: Create a Security Policy

This creates a security policy on the Sales.CreditCard table.

CREATE SECURITY POLICY sales.RestrictCreditCardToBusinessEntity
ADD FILTER PREDICATE sales.fn_FindBusinessCreditCard(CreditCardID)
ON Sales.CreditCard

Step 5: Test Away

For all of the following examples, You should be logged in as the Business1 user who can only see 3 credit cards. In reality, there are 19,118 rows in that table.

--Will return three records
pcc.*, cc.*
Sales.PersonCreditCard pcc INNER JOIN
Sales.CreditCard cc ON cc.CreditCardID = pcc.CreditCardID

— Will only update three records
UPDATE Sales.CreditCard
ExpYear = ‘2020’

These are the execution plans for the above query with Row Level Security turned on and off:

Turned On: (and missing an index…)

Execution Plan With Security Turned On

Execution Plan With Security Turned On.

Turned Off:

Execution Plan With Security Turned Off

Execution Plan With Security Turned Off.


Discover more about our expertise in SQL Server.

Categories: DBA Blogs

A Cassandra Consistency Use Case

Pythian Group - Tue, 2015-10-27 08:07


I recently completed a project where I worked with a company using Cassandra to keep metadata about objects stored in an Object Store. The application keeps track of individual objects as rows within a partition based on user id. In an effort to save space there is also a mechanism to track duplicate references to the objects in another table. Object writes take place as background activity and the time it takes to complete those writes is invisible to the applications end users. The time it takes to retrieve an object though is very visible to the end user. The keyspace was defined as network topology using two data centers (actual data centers here about 50 ms apart) with replication factor 3 in both data centers.

Initially the application was set up to use consistency ONE for both writes and reads. This seemed to be working okay until we started doing failure testing. At which point objects would come up missing due to the delay time in pushing hinted hand offs from one node to another. A simple solution to this was to make all writes and reads LOCAL_QUORUM. In fact doing so did resolve pretty much all of the testing errors but at a much increased latency, about 3 times longer than with consistency ONE, on both reads and writes. Even so, the latencies were deemed to be acceptable since they were still well under anticipated network latencies outside of the data centers which is what the users would be seeing.

Could we have done better than that though?

The writes are a background activity not visible to the end user. The increased write latency is probably reasonable there. The read latency is visible to the user. There is an option which guarantees finding the stored object references while still keeping the latency to a minimum. This is what I propose, the default read consistency is set back to ONE and most of the time a read to Cassandra will find the object reference as was clear in the initial testing. But, if a read returns no object reference then a second read is issued using LOCAL_QUORUM. This way most, more than 99%, of all reads are satisfied with the much lower latency consistency of ONE only occasionally needing the second read. This can be extended further to a full QUORUM read if the LOCAL_QUORUM read fails.

It is important to note that this approach only works if there are no row versions. E.G. rows only exist or do not exist. If a row may have different versions over time as you might have if the row were updated rather than just inserted and then later deleted. It is also important to note that its possible to find a deleted row this way. For this use case these qualifications are not issues.


Discover more about our expertise in Cassandra.

Categories: DBA Blogs