Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 10 hours 45 min ago

Using the ELK Stack to Analyse Donor’s Choose Data

Sat, 2015-04-25 13:35

Donor’s Choose is an online charity in America through which teachers can post details of projects that need funding and donors can give money towards them. The data from the charity since it began in 2000 is available to download freely here in several CSV datasets. In this article I’m going to show how to use the ELK stack of data discovery tools from Elastic to easily import some data (the donations dataset) and quickly start analysing it to produce results such as this one:

I’m assuming you’ve downloaded and unzipped Elasticsearch, Logstash and Kibana and made Java available if not already. I did this on a Mac, but the tools are cross-platform and should work just the same on Windows and Linux. I’d also recommend installing Kopf, which is an excellent plugin for the management of Elasticsearch.

CSV Data Ingest with Logstash

First off we’re going to get the data in to Elasticsearch using Logstash, after which we can do some analysis using Kibana.

To import the data with Logstash requires a configuration file which in this case is pretty straightforward. We’ll use the file input plugin, process it with the csv filter, set the date of the event to the donation timestamp (rather than now), cast a few fields to numeric, and then output it using the elasticsearch plugin. See inline comments for explanation of each step:

input {  
    file {  
        # This is necessary to ensure that the file is  
        # processed in full. Without it logstash will default  
        # to only processing new entries to the file (as would  
        # be seen with a logfile for a live application, but  
        # not static data like we're working with here)  
        start_position  => beginning  
        # This is the full path to the file to process.  
        # Wildcards are valid.  
        path =>  ["/hdd/ELK/data/opendata/opendata_donations.csv"]  

filter {  
        # Process the input using the csv filter.  
        # The list of column names I took manually from the  
        # file itself  
        csv {separator => ","  
                columns => ["_donationid","_projectid","_donor_acctid","_cartid","donor_city","donor_state","donor_zip","is_teacher_acct","donation_timestamp","donation_to_project","donation_optional_support","donation_total","dollar_amount","donation_included_optional_support","payment_method","payment_included_acct_credit","payment_included_campaign_gift_card","payment_included_web_purchased_gift_card","payment_was_promo_matched","via_giving_page","for_honoree","donation_message"]}

        # Store the date of the donation (rather than now) as the  
        # event's timestamp  
        # Note that the data in the file uses formats both with and  
        # without the milliseconds, so both formats are supplied  
        # here.  
        # Additional formats can be specified using the Joda syntax  
        # (  
        date { match => ["donation_timestamp", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss"]}  
        # ------------
        # Cast the numeric fields to float (not mandatory but makes for additional analysis potential)
        mutate {
        convert => ["donation_optional_support","float"]
        convert => ["donation_to_project","float"]
        convert => ["donation_total","float"]

output {  
        # Now send it to Elasticsearch which here is running  
        # on the same machine.  
        elasticsearch { host => "localhost" index => "opendata" index_type => "donations"}  

With the configuration file created, we can now run the import:

./logstash-1.5.0.rc2/bin/logstash agent -f ./logstash-opendata-donations.conf

This will take a few minutes, during which your machine CPU will rocket as logstash processes all the records. Since logstash was originally designed for ingesting logfiles as they’re created it doesn’t actually exit after finishing processing the file, but you’ll notice your machine’s CPU return to normal, at which point you can hit Ctrl-C to kill logstash.

If you’ve installed Kopf then you can see at a glance how much data has been loaded:

Or alternatively query the index using Elasticsearch’s API directly:

curl -XGET 'http://localhost:9200/opendata/_status?pretty=true'

    "opendata" : {  
      "index" : {  
        "primary_size_in_bytes" : 3679712363,  
      "docs" : {  
        "num_docs" : 2608803,

Note that Elasticsearch will take more space than the source data (in total the 1.2Gb dataset ends up taking c.5Gb)

Data Exploration with Kibana

Now we can go to Kibana and start to analyse the data. From the Settings page of Kibana add the opendata index that we’ve just created:

Go to Discover and if necessary click the cog icon in the top right to set the index to opendata. The time filter defaults to the last 15 minutes only, and if your logstash has done its job right the events should have the timestamp of the actual donation, so you need to click on the time filter in the very top right of the screen to change time period to, for example, Previous year. Now you should see a bunch of data:

Click the toggle on one of the events to see the full data for it, including things like the donation amount, the message with the donation, and geographical details of the donor. You can find details of all the fields on the Donor’s Choose website here.

Click on the fields on the left to see a summary of the data within, showing very easily that within that time frame and sample of 500 records:

  • two thirds of donations were in the 10-100 dollar range
  • four-fifths included the optional donation towards the running costs of Donor’s Choose.

You can add fields into the table itself (which by default just shows the complete row of data) by clicking on add for the fields you want:

Let’s save this view (known as a “Search”), since it can be used on a Dashboard later:

Data Visualisation with Kibana

One of my favourite features of Kibana is its ability to aggregate data at various dimensions and grains with ridiculous ease. Here’s an example: (click to open full size)

Now let’s amend that chart to show the method of donation, or the donation amount range, or both: (click to open full size)

You can also change the aggregation from the default “Count” (in this case, number of donations) to other aggregations including sum, median, min, max, etc. Here we can compare cheque (check) vs paypal as a payment method in terms of amount given:

Kibana Dashboards

Now let’s bring the visualisations together along with the data table we saw in the the Discover tab. Click on Dashboard, and then the + icon:

Select the visualisations that you’ve created, and then switch to the Searches tab and add in the one that you saved earlier. You’ve now got a data table showing all currently selected data, along with various summaries on it.

You can rearrange the dashboard by dragging each box around to suit. Once you’ve got the elements of the dashboard in place you can start to drill into your data further. To zoom in on a time period click and drag a selection over it, and to filter on a particular data item (for example, state in the “Top ten states” visualisation) click on it and accept the prompt at the top of the screen. You can also use the freetext search at the top of the screen (this is valid on the Discover and Visualize pages too) to search across the dataset, or within a given field.

Example Analysis

Let’s look at some actual data analyses now. One of the most simple is the amount given in donations over time, split by amount given to project and also as the optional support amount:

One of the nice things about Kibana is the ability to quickly change resolution in a graph’s time frame. By default a bar chart will use an “Auto” granularity on the time axis, updating as you zoom in and out so that you always see an appropriate level of aggregation. This can be overridden to show, for example, year-on-year changes:

You can also easily switch the layout of the chart, for example to show the percentage of the two aggregations relative to each other. So whilst the above chart shows the optional support amount increasing by the year, it’s actually remaining pretty much the same when taken as a percentage of the donations overall – which if you look into the definition of the field (“we encourage donors to dedicate 15% of each donation to support the work that we do.“) makes a lot of sense

Analysis based on text in the data is easy. You can use the Terms sub-aggregation, where here we can see the top five states in terms of donation amount, California consistently being the top of the table.

Since the Terms sub-aggregation shows the Top-x only, you can’t necessarily judge the importance of those values in relation to the rest of the data. To do this more specific analysis you can use the Filters sub-aggregation to use free-form searches to create buckets, such as here to look at how much those from NY and CA donated, vs all other states. The syntax is field:value to include it, and -field:value to negate it. You can string these expressions together using AND and OR.

A lot of the analysis generally sits well in the bar chart visualisation, but the line chart has a role to play too. Donations are grouped according to the value range (<10, between 10 and 100, > 100), and these plot out nicely when considering the number of donations made (rather than total value). Whilst the total donation in a time period is significant, so is the engagement with the donors hence the number of donations made is important to analyse:

As well as splitting lines and bars, you can split charts themselves, which works well when you want to start comparing multiple dimensions without cluttering up a single chart. Here’s the same chart as previously but split out with one line per instance. Arguably it’s clearer to understand, and the relative values of the three items can be better seen here than in the clutter of the previous chart:

Following on from this previous graph, I’m interested in the spike in mid-value ($10-$100) donations at the end of 2011. Let’s pull the graph onto a dashboard and dig into it a bit. I’ve saved the visualisation and brought it in with the saved Search (from the Discover page earlier) and an additional visualisation showing payment methods for the donations:

Now I can click and drag the time frame to isolate the data of interest and we see that the number of donations jumps eight-fold at this point:

Clicking on one of the data points drills into it, and we eventually see that the spike was attributable to the use of campaign gift cards, presumably issued with a value > $10 and < $100.



The simplicity described in this article comes at a cost, or rather, has its limits. You may well notice fields in the input data such as “_projectid”, and if you wanted to relate a donation to a given project you’d need to go and look that project code up manually. There’s no (easy) way of doing this in Elasticsearch – whilst you can easily bring in all the project data too and search on projectid, you can’t display the two (project and donation) alongside each other (easily). That’s because Elasticsearch is a document store, not a relational database. There are some options discussed on the Elasticsearch blog for handling this, none of which to my mind are applicable to this kind of data discovery (but Elasticsearch is used in a variety of applications, not just as a data store for Kibana, so in others cases it is more relevant). Given that, and if you wanted to resolve this relationship, you’d have to go about it a different way, maybe using the linux join command to pre-process the files and denormalise them prior to ingest with logstash. At this point you reach the “right tool/right job” decision – ELK is great, but not for everything :-)


If you need to reload the data (for example, when building this I reprocessed the file in order to define the numerics as such, rather than the default string), you need to :

  • Drop the Elasticsearch data:
    curl -XDELETE 'http://localhost:9200/opendata'
  • Remove the “sincedb” file that logstash uses to record where it last read from in a file (useful for tailing changing input files; not so for us with a static input file)
    rm ~/.sincedb*

    (better here would be to define a bespoke sincedb path in the file input parameters so we could delete a specific sincedb file without impacting other logstash processing that may be using sincedb in the same path)
  • Rerun the logstash as above


Categories: BI & Warehousing

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

Fri, 2015-04-24 06:18

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!

Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.

Categories: BI & Warehousing

Data Integration Tips: ODI 12.1.3 – Convert to Flow

Thu, 2015-04-16 13:23

The many who have already upgraded Oracle Data Integrator from the 11g version to 12c probably know about this great feature called “convert to flow”. If not, well…here you go!

First, a little background on why I think this is an excellent bit of functionality. The ODI Upgrade Assistant will convert objects from 11g to 12c and it does a pretty decent job of it. When converting Interfaces, the upgrade process creates a Mapping in ODI 12c by taking the logical “mapping” layout and loading it into a Dataset object. I assumed the reason was because it wasn’t easy to convert an Interface directly to a full on flow-based mapping, which you typically would develop in ODI 12.1.3 rather than using the limited Dataset (only joins, filters, and lookups allowed). After the upgrade, you would then be stuck with loads of mappings that are not using the latest flow-based features and components.


Now, in ODI 12.1.3, we have the ability to convert our Dataset into the standard ODI 12c flow based components within the Mapping. With a right-click on the Dataset component, we can see the “Convert to Flow” option.


Select Convert to Flow and accept the warning that our Mapping will be changed forever…and boom! No more Dataset!

This is great for my individual Mappings, but now I want to convert my migrated Reusable Mapping Datasets to flow based components.


Wait, what? No option to Convert to Flow! It looks like the Reusable Mappings (which were upgraded from my ODI 11g Temporary Interfaces) cannot be converted to flow for some reason. Hmm… Well, let’s finish converting my Datasets to flow based components for the rest of my 342 upgraded Mappings…one-by-one. Yikes! Actually, we can find a better way to do this. Time to get Groovy with the ODI SDK!

Using Groovy, I can create a simple script to loop through all of my mappings, find the dataset, and call the convertToFlow function on that dataset component. Here’s a look at the guts of the script.

for (mapping in mappingsList){
  for (component in componentsList){

    java.util.List convertIssues = new ArrayList()
    blnConvert = 1
    try {
      blnConvert = component.convertToFlow(convertIssues)
      if (blnConvert) {
        for (item in convertIssues) {
          out.println item.toString()
    } catch (Exception e) {
    out.println e;
    out.println mapping.getName() + " had a dataset converted to flow."

Just remember to pass the results list object as a parameter to the convertToFlow call (and make sure the List object is properly instantiated as an ArrayList – as I was humbly reminded by David Allan via Twitter!). Once completed, you should be able to open each mapping and see that the dataset has been removed and only flow-based components exist.

Excellent, now we’ve completed our conversion in no time at all. But wait, what about those Reusable Mappings? Remember, we don’t have the right-click menu option to convert to flow as we did with the standard Mapping. Well, let’s see what our friend the ODI SDK has to say about that!

With a slight tweak to the code, replacing Mapping classes with ReusableMapping classes, we can batch convert our Reusable Mapping dataset components to flow based components in an instant. The reason it works via the API is due to the inheritance of the ReuseableMapping class. It inherits the same component methods from the interface oracle.odi.domain.mapping.IMapComponentOwner, which in turn have the same methods and functions, such as convertToFlow, as we had available in the Mapping class. I’m not quite sure why ODI Studio doesn’t expose “Convert to Flow” in the Reusable Mappings, but I’m sure it’s a simple fix we’ll see in an ODI 12c release down the road.

So there you have it, another Data Integration Tip from Rittman Mead – this time, a little help post-migration from ODI 11g to ODI 12c. If you would like more details on how Rittman Mead can help your migration of Oracle Data Integrator to the latest version, send us a note at We’d love to help!


Categories: BI & Warehousing

Oracle Data Integrator Enterprise Edition Advanced Big Data Option Part 1- Overview and install

Mon, 2015-04-13 14:54

Oracle recently announced Oracle Data Integrator Enterprise Edition Advanced Big Data Options as part of the new release of ODI. It includes various great new functionalities to work on an Hadoop ecosystem. Let’s have a look at the new features and how to install it on Big Data Lite 4.1 Virtual Machine.

Note that some of these new features, for example Pig and Spark support and use of Oozie, requires the new ODI EE Advanced Big Data Option license on-top of base ODI EE.

Pig and Spark support

So far ODI12c allowed us to use Hive for any Hadoop-based transformation. With this new release, we can now use Pig and Spark as well. Depending on the use case, we can choose which technology will give better performance and switch from one to another with very few changes. That’s the beauty of ODI – all you need is to do is create the logical dataflow in your mapping and choose your technology. There is no need to be a Pig Latin expert or a PySpark ninja, all of this will be generated for you! These two technologies are now available in the Topology, along with the Hadoop Data Server to define where lies the Data. You can also see some Loading Knowledge Modules for Pig and Spark.

Pig and Spark in ODI

Pig, as Mark wrote before, is a dataflow language. It makes it really appropriate with the new “flow paradigm” introduced in ODI 12c. The idea is to write a data pipeline in Pig Latin. That code will undercover create MapReduce jobs that will be executed.

Quoting Mark one more time, Spark is a cluster processing framework that can be used in different programming languages, the two most common being Python and Scala. It allows to do operation like filters, joins and aggregates. All of this can be done in-memory which can provides way better performance over MapReduce. The ODI team choose to use Python as a programming language for Spark so the Knowledge Modules will use PySpark.

New Hive Driver and LKMs

This release also brings significant improvements to the existing Hive technology. A new driver as been introduced under the name DataDirect Apache Hive JDBC Driver. It is actually the Weblogic Hive JDBC driver which aims at improving the performance and the stability.

New Hive Driver

New Knowledges Modules are introduced to benefit from this new driver and they are LKMs instead multi-connections IKMs as it use to be. Thanks to that, it can be combined with other LKMs into the same mapping which was not the case before.

Oozie Agent

Oozie is another Apache project and they define it as “a workflow scheduler system to manage Apache Hadoop jobs”. We can create workflow of different jobs in the Hadoop stack, and then schedule it at a certain time or trigger it when data becomes available.

What Oozie does is similar to the role of the ODI agent, and it’s now possible to use directly an existing Oozie engine instead of deploying a standalone agent on the hadoop cluster.

Oozie Engine

The Oozie engine will do what your ODI agent usually does – execution, scheduling, monitoring – but it is integrated in the Hadoop ecosystem. So we will be able to schedule and monitor our ODI jobs at the same place as all our other Hadoop jobs that we use outside of ODI. Oozie can also automatically retrieve the Hadoop logs. Also we lower the footprint because it doesn’t requires to install an ODI-specific component on the cluster. However, according to the white paper (link below), it looks like Load Plans are not supported. So the idea would be to execute the Load Plans with a standalone or JEE agent that will delegate the execution of Big Data-related scenarios to the Oozie Engine.

HDFS support in file-related ODI Tools

Most of the ODI tools handling files can also do it on HDFS now. So you can delete, move, copy files and folders. You can also append files and transfer it to HDFS via FTP. It’s even possible to detect when a file is created on HDFS. All you need to do is to indicate your Hadoop Logical Schema for source, target or both. In the following example I’m copying a file from the Unix filesystem to HDFS.


I think this is a huge step forward. If we want to use ODI 12c for our Hadoop data integration, it must be able to do everything end-to-end. The maintenance or administrative tasks such as archiving, deleting or copying should also be done using ODI. So far it was a bit tedious to created a shell script using hdfs dfs commands and then launch it using OdiOsCommand tool. Now we can directly use the file tools in a package or a procedure!

New mapping components : Jagged and Flatten

The two new components can be used in a Big Data context but also in your traditional data integration. The first one, Jagged, will pivot a set of key-value pairs into a columns with their values.

The Flatten components can be used with advanced files when you have nested attributes, like in JSON. Using a flatten component will generate more rows if needed to extract different values for a same attribute nested into another attribute.


You can see the detail of all the new features in the white paper “Advancing Big Data Integration” for ODI 12c.


How to install it?

This patch must be applied on top of an existing Oracle Data Integrator installation. It is not a bundled patch and it’s only related to Big Data Options so there is no point to install it if you don’t need its functionalities. Also make sure you are licensed for ODIEE Advanced Big Data Option if you plan to use Spark or Pig technology/KMs or execute your jobs using the Oozie engine.

To showcase this, I used the excellent –and free! – Big Data Lite 4.1 VM which already has ODI 12.1.3 and all the Hadoop components we need. So this example will be on an Oracle Enterprise Linux environment.

The first step is to download it from the OTN or My Oracle Support. Also make sure you close ODI Studio and shut down the agents. Then the README recommends to update OPatch and check the OUI. So let’s do that and also set some environment variables and unzip the ODI patch.

[oracle@bigdatalite ~]$ mkdir /home/oracle/bck
[oracle@bigdatalite ~]$ ORACLE_HOME=/u01/ODI12c/
[oracle@bigdatalite ~]$ cd $ORACLE_HOME
[oracle@bigdatalite ODI12c]$ unzip /home/oracle/Desktop/ -d $ORACLE_HOME 
[oracle@bigdatalite ODI12c]$ OPatch/opatch lsinventory -jre /usr/java/latest/
[oracle@bigdatalite ODI12c]$ export PATH=$PATH:/u01/ODI12c/OPatch/
[oracle@bigdatalite ODI12c]$ unzip -d /home/oracle/bck/ /home/oracle/Desktop/ 
[oracle@bigdatalite ODI12c]$ cd /home/oracle/bck/

This patch is actually composed of three piece. One of them, the second one, is only needed if you have an enterprise installation. If you have a standalone install, you can just skip it. Note that I always specify the JRE to be used by OPatch to be sure everything works fine.

[oracle@bigdatalite bck]$ unzip
[oracle@bigdatalite ODI12c]$ cd 20042369/
[oracle@bigdatalite 20042369]$ opatch apply -jre /usr/java/latest/
[oracle@bigdatalite 20042369]$ cd /home/oracle/bck/

 //[oracle@bigdatalite bck]$ unzip
 //[oracle@bigdatalite bck]$ cd 20674616/
 //[oracle@bigdatalite 20674616]$ opatch apply -jre /usr/java/latest/
 //[oracle@bigdatalite 20674616]$ cd /home/oracle/bck/

[oracle@bigdatalite bck]$ unzip 
[oracle@bigdatalite bck]$ cd 20562777/
[oracle@bigdatalite 20562777]$ opatch apply -jre /usr/java/latest/

Now we need to run the upgrade assistant that will execute some scripts to upgrade our repositories. But in Big Data Lite, the tables of the repository have been compressed, so we first need to uncompress them and rebuild the invalid indexes as David Allan pointed it out on twitter. Here are the SQL queries that will create the DDL statement you need to run if you are also using Big Data Lite VM :

 'alter table '||t.owner||'.'||t.table_name||' move nocompress;' q
 from all_tables t
 where owner = 'DEV_ODI_REPO'
 and table_name &lt;&gt; 'SNP_DATA';

select 'alter index '||owner||'.'||index_name||' rebuild tablespace '||tablespace_name ||';'
 from all_indexes
 where owner = 'DEV_ODI_REPO'
 and status = 'UNUSABLE';

Once it’s done we can start the upgrade assistant :

[oracle@bigdatalite 20562777]$ cd /u01/ODI12c/oracle_common/upgrade/bin
[oracle@bigdatalite bin]$ ./ua

Upgrade Assistant

The steps are quite straightforward so I’ll leave it to you. Here I selected Schemas, but if you have a standalone agent you will have to run it again and select “Standalone System Component Configurations” to upgrade the domain as well.

Before opening ODI Studio we will clear the JDev cache so we are sure everything looks nice.

[oracle@bigdatalite bin]$ rm -rf /home/oracle/.odi/system12.

We can now open ODI Studio. Don’t worry the version mentioned there and in the upgrade assistant is still but if you can see the new features it has been installed properly.

The last step is to go in the topology and change the driver used for all the Hive Data Server. As all the new LKMs use the new weblogic driver, we need to define the url instead of the existing one.  We simply select “DataDirect Apache Hive JDBC Driver” instead of the existing Apache driver.

And that’s it, we can now enjoy all the new Big Data features in ODI 12c! A big thanks to David Allan and Denis Gray for their technical and licensing help. Stay tuned as I will soon publish a second blog post detailing some features.

Categories: BI & Warehousing

Previewing Four Sessions at the Atlanta Rittman Mead BI Forum 2015

Thu, 2015-04-09 08:00

In a post earlier this week I previewed three sessions at the upcoming Brighton Rittman Mead BI Forum 2015; in this post I’m going to look at four particularly interesting sessions at the Atlanta Rittman Mead BI Forum 2015 event running the week after Brighton, on May 13th-15th 2015 at the Renaissance Atlanta Midtown Hotel, Atlanta GA. As well as an optional one-day masterclass on big data development by myself and Jordan Meyer on the 13th, the main event itself has keynotes and product update sessions from Oracle’s BI product management team, a data visualisation challenge and a guest talk by John Foreman, author of the book “Data Smart” and Chief Data Scientist at Mailchimp; in terms of the main sessions though there are four that I’m particularly interested in, starting with one by a speaker new to the BI Forum, Qualogy’s Hasso Schaap, who’ll be talking to us about their use of Oracle’s new BI Cloud Service in his session “Developing strategic analytics applications on OBICS PaaS”


“In this session I’ll tell how we use the Oracle BI Cloud Service in our development plans for a strategic analytics application. Focussing on Strategic HR Planning there’s so much you can do with your data that we decided to put it in a packaged app. I will discuss the important parts of the development process and show how we fixed the issues we came up with. Developing in the BI Cloud is different and expectations are also different. 
As an example there’s the part of prediction. How do we predict based on data in the BI Cloud and what are other possibilities. With prediction we were able to tell our customers a different story. A story that was different than before using old-school tools and techniques. In this session I will uncover some of the most appreciated functionality and will happily elaborate on the story behind ‘The present, the future, development and scenario planning’.”

My second featured session is by someone very-well known to previous BI Forum attendees, and to the wider Oracle BI+DW community: Stewart Bryson. Stewart of course used to head-up Rittman Mead in the US and then went-on to become our first Chief Innovation Officer, before leaving to start his own company Red Pill Analytics with Kevin McGinley, another old friend of Rittman Mead and the BI Forum. We’re very pleased to have both Stewart and Kevin delivering sessions at the Atlanta BI Forum, and for Stewart’s session he’s talking about something very close to his heart – “Supercharging BI Delivery with Continuous Integration”:


“One of the things I’ve never understood about the lifecycle features in most BI tools is why the designers feel the need to roll their own source control and DevOps features. Instead of focusing on deeper integration with tools and processes that exist in the other 90% of development paradigms, BI vendors instead start with a clean palette and create something completely siloed and desperately alone. 
In this presentation, we’ll take a look at how some of these other development paradigms approach DevOps — paying perhaps the closest attention to the world of Java development and other JVM languages. We’ll see how approaches such as continuous integration and continuous delivery play a part in rapid, iterative delivery, and how we can apply some of those approaches to the world of OBIEE development.”

My third session is by another speaker new to the BI Forum, but someone who’s well-known in the BI and data warehousing world and who I met in-person for the first time at last year’s Oracle Openworld: Sumit Sarkar. Sumit works for Progress Software, makers of the DataDirect ODBC drivers that powers OBIEE’s connection to Hadoop, for example, as well as connectors to MongoDB, Salesforce, Oracle RightNow and Eloqua, and as he’ll explain in his session “Make sense of NoSQL data using OBIEE”, NoSQL databases : 


“NoSQL databases have stormed the top 10 db-engines rankings with MongoDB at #4 and Cassandra at #8.  It’s inevitable that these NoSQL databases, storing unstructured data without a standard query language, will have BI requirements for unarmed OBIEE teams.  Not even a complete Oracle stack can save you with the release of Oracle NoSQL.This will be the first session of its kind to tackle standards based NoSQL connectivity.  
So join me at BI Forum ’15  to take control of NoSQL data with your RPD and expand big data skills and thought leadership within your organization.  Learn how organizations are using SQL access to NoSQL databases for integration across existing business intelligence platforms. We’ll talk about common challenges and gotchas that shops are facing when exposing unstructured NoSQL data to OBIEE.  It can get out of hand pretty quickly otherwise …”

My final selection is from CERN, the European Organization for Nuclear Research and home of course of the Large Hadron Collider (and who announced on April 1st the first unequivocal evidence for The Force, almost upstaging our announcement of Oracle E-Business Suite being ported to Hadoop and MongoDB). There’s several session at both the Brighton and Atlanta BI Forums on Oracle’s new Big Data Discovery tool, and in this session CERN’s Manuel Martin Marquez will be talking about their work in this area, in his session “Governed Information Discovery: Data-driven decisions for more efficient operations at CERN”


“The European Centre for Nuclear Research, CERN, is running the world’s largest and more powerful particle accelerator complex in order to shed light on how the Universe works and which are its main building blocks.  CERN’s particle accelerators and detectors infrastructure is comprehensively heterogeneous and complex. A number of critical subsystems, which represent cutting-edge technology in several engineering fields, need to be considered: cryogenics, power converters, magnet protection, etc. The historical monitoring and control data derived from these systems has persisted mainly using Oracle database technologies, but also other sorts of data formats such as JSOM, XML and plain text files. All of these must be integrated and combined in order to provide a full picture and better understanding of the overall status of the accelerator complex.
Therefore, a key challenge is to facilitate easy access to, flexible interaction with, and dynamic visualization of heterogeneous data from different sources and domains.  In our session, we will share our experience with a potential solution for finding insights within our data, Oracle Endeca Data Discovery. In addition, we will feature practical examples relating to future possibilities for improving the control and monitoring of CERN’s accelerator complex, optimization results for accelerator operations and a demo of the implemented solution”

Full agenda details on the Atlanta Rittman Mead BI Forum 2015 can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, and our first-ever Data Visualisation Bake-Off, using the dataset. Registration is now open and the event takes place between May 13th and 15th April 2015, at the Renaissance Atlanta Midtown Hotel, Atlanta GA. 

Categories: BI & Warehousing

OBIEE and the Oracle Database 12c In-Memory Option – Article and New Services from Rittman Mead

Wed, 2015-04-08 07:00

NewImageMy latest Business Intelligence column for Oracle Magazine is on the In-Memory Option for Oracle Database 12c, and using it to speed-up dashboards and reports in OBIEE11g. In the article I go through the basics of the in-memory option explaining how it adds in-memory columnar processing to the standard Oracle Database Enterprise Edition, and then I take the Airline Flight Delays dashboard in the OBIEE11g SampleApp v406 and enable it for in-memory processing; for queries that go against the base detail-level tables in the Oracle Database queries run roughly twice-as-fast, whilst queries going against aggregate tables return data instantaneously, all without any need to alter the underlying database schema or migrate to a new database engine.

To my mind there are two main groups of customers who could benefit from moving to Oracle Database 12c and the In-Memory Option; customers who are currently using earlier version of Oracle Database with regular disk-stored row-based storage (or indeed customers using other databases, for example Teradata or Microsoft SQL Server), and customers who’ve implemented Oracle Exalytics with TimesTen as the in-memory database cache, and who would now like to take advantage of the additional features and lower cost-of-ownership with in-memory processing directly in the Oracle Database.

If you already have licenses for Oracle Database Enterprise Edition you’ll only need to add the additional In-Memory Option license to enable these new features, whereas if you’re using TimesTen on Exalytics there are special terms for customers who wish to trade-in those licenses for Oracle Database Enterprise Edition and In-Memory Options licenses – and once you’ve moved over to Oracle Database 12c and the In-Memory Option, you’ll benefit from:

  • Access to full Oracle SQL including advanced analytics functions, aggregation and transformation capabilities
  • Moving to Oracle’s strategic database technology for in-memory analytics and Exalytics in-memory aggregate caching
  • Compatibility with existing Oracle Database functionality, making it easy to move reporting databases into Exalytics and enable for in-memory analytics
  • Columnar processing, an alternative to traditional row-based storage that’s better suited to BI-style filtering against attribute values
  • Full compatibility with all reporting and ETL tools that support access to Oracle Database data sources
  • Additional optimisations around aggregation, table joining and other BI-style queries
  • Faster dashboards, more interactive reporting and less maintenance compared to maintaining TimesTen

To get you started with either of these options, Rittman Mead have created two packages for customers looking to adopt Oracle Database 12c In-Memory Option; one for customers on traditional data warehouse databases looking to use In-memory for the first time, and another for customers using Exalytics who want to migrate from Oracle TimesTen. Full details of these two packages are now up on our website at our Supercharge OBIEE with the Oracle 12c In-Memory Option web page, or you can contact us at to talk through your particular requirements in more detail.

Categories: BI & Warehousing

Take Part in the BI Survey 15, and Have Your Voice Heard!

Wed, 2015-04-08 04:00

Long-term readers of this blog will know that we’ve supported for many years the BI Survey, an independent survey of BI tools customers and implementors. Rittman Mead have no (financial or other) interest in the BI Survey or its organisers, but we like the way it gathers in detailed data on which tools work best and when, and it’s been a useful set of data for companies such as Oracle when they prioritise their investment in tools such as OBIEE, Essbase and the BI Applications.


Here’s the invite text and link to the survey:

“We would like to invite you to participate in The BI Survey 15, the world’s largest annual survey of business intelligence (BI) users.
BARC’s annual survey gathers input from thousands of organizations to analyze their buying decisions, implementation cycles and the benefits they achieve from using BI software.
As a participant, you will:

  • Receive a summary of the results from the survey when it is published
  • Be entered into a draw to win one of ten $50 Amazon vouchers
  • Ensure that your experiences are included in the final analyses

Click here to take part
Business and technical users, as well as vendors and consultants, are all welcome to participate.
You will be able to answer questions on your usage of a BI product from any vendor and your experience with your service provider.
The BI Survey 15 is strictly vendor-independent: It is not sponsored by any vendor and the results are analyzed and published independently. 
Your answers will be used anonymously and your personal details will not be passed on to software vendors or other third parties.
The BI Survey 15 should take about 20 minutes to complete. For further information, please contact Adrian Wyszogrodzki at BARC ( 

Thank you in advance for taking part.”


Categories: BI & Warehousing

Previewing Three Sessions at the Brighton Rittman Mead BI Forum 2015

Tue, 2015-04-07 03:00

As well as a one-day masterclass by myself and Jordan Meyer, a data visualisation challenge, keynotes and product update sessions from Oracle and our guest speaker from the Oracle Data Warehouse Global Leaders Program, the Brighton Rittman Mead BI Forum 2015 has of course a fantastic set of speakers and sessions on a wide range of topics around Oracle BI, data warehousing and big data. In this blog post I’m going to highlight three sessions at the Brighton BI Forum, and later in the week I’ll be doing the same with three sessions from the Atlanta event – so let’s start with a speaker who’s new to the BI Forum but very well-known to the UK OBIEE community – Steve Devine.

Steve is one of the most experienced OBIEE practitioners in the Europe, recently with Edenbrook / Hitachi Consulting, Claremont and now working with Altius in the UK. In his session at the Brighton BI Forum 2015 Steve’s going to talk to us about what’s probably the hottest topic around OBIEE at the moment in his session “The Art and Science of Creating Effective Data Visualisations”. Over to Steve:


“These days, news publications and the internet are packed with eye-catching data visualisations and infographics – the New York Times, the Guardian or Information Is Beautiful to name but a few. Yet the scientists and statisticians tell us that everything could be a bar chart, and that nothing should ever be a pie chart! How do we make sense of these seemingly disparate, contrasting views?
My presentation provides an introduction on how graphic design principles complement the more science orientated aspects of data viz design. It will focus on a simple-to-apply design framework that brings all of these principles together, enabling you to create visualisations that have the right balance of aesthetics and function. By example, I’ll apply this framework to traditional BI scenarios such as operational and exploratory dashboards, as well as new areas that BI tools are just beginning to support such as commentary and storytelling. I’ll also look at how well Oracle’s BI tools address today’s data visualisation needs, and how they compare to the competition.”

On the topic of data visualisation, I’m also very pleased to have Daniel Adams from Rittman Mead’s US office coming over to the Brighton BI Forum to talk about effective dashboard design. Daniel’s been working with Rittman Mead clients in the US and Europe for the past year helping them apply data visualisation and dashboard design best practices to their dashboards and reports, and he’ll be sharing some of his methods and approaches in his session “User Experience First: Guided information and attractive dashboard design”:


“Most front end OBI developers can give users exactly what they ask for, but will that lead to insightful dashboards that improve data culture and escalate the user xperience? One the biggest  mistakes I see as a designer, are dashboards that are a cluttered collection of tables and graphs. Poorly designed dashboards can prevent users from adopting a BI implementation, diminishing the ROI. 
In this session, attendees will learn to design dashboards that inform, instruct, and lead to smart discussion and decisions.  This includes learning to visualize data to convey meaning, implementing attractive visual design, and creating a layout that leads users through a target rich environment. We will walk through a series of “before” and “after” dashboards that demonstrate the difference between meeting a requirement, and using proven UX and UI design concepts to make OBIEE dashboards insightful and enjoyable to use.”

Finally, someone I’m very pleased to have over to the Brighton BI Forum for the first time is Gerd Aiglstorfer. I first met Gerd at an Oracle event in Germany several years ago, and since then I’ve noticed several of his blogs and the launch of his Oracle University Expert Sessions on OBIEE development, administration and RPD modelling. Gerd is one of Europe’s premier experts in OBIEE and Oracle BI, and for his inaugural BI Forum presentation he’ll be deep-diving into one of the most complex topics around repository modeling in his session “Driving OBIEE Join Semantics on Multi Star Queries as User”:


“Multi star queries are a very useful and powerful functionality of OBIEE. But when I examine reports developed by business users or report developers I often find some misunderstandings on how it is working and queries are build by OBIEE. As additionally the execution strategy in OBIEE has changed to generate SQL of multi star queries I had the idea to introduce the topic at the BI Forum. Thus, it’s a quite interesting topic to go into technical details of OBIEE SQL generator engine.
I’ll introduce how users can drive join semantics on common fields in multi star queries. You will get a full picture of the functionality for a better understanding of how report creation affects SQL generation. I recognized some inconsistencies during my tests of the new OBIEE logic in January 2014. I will demonstrate the issues and would like to discuss if you would say: “It’s a defect within the SQL generator engine” – as I do.”

Full agenda details on the Brighton Rittman Mead BI Forum 2015 can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, and our first-ever Data Visualisation Bake-Off, using the dataset.

Categories: BI & Warehousing

Realtime BI Show with Kevin and Stewart – BI Forum 2015 Special!

Mon, 2015-04-06 08:00

Jordan Meyer and I were very pleased to be invited onto the Realtime BI Show podcast last week, run by Kevin McGinley and Stewart Bryson, to talk about the upcoming Rittman Mead BI Forum running in Brighton and Atlanta in May 2015. Stewart and Kevin are of course speaking at the Atlanta BI Forum event on May 13th-15th 2015 at the Renaissance Atlanta Midtown Hotel, Atlanta, and in the podcast we talk about the one-day masterclass that Jordan and I are running, some of the sessions at the event, and the rise of big data and data discovery within the Oracle BI+DW industry.

Full details on the two BI Forum 2015 events can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, the guest speakers and the inaugural Data Visualization Challenge. Registration is now open and can be done online using the two links below.

  • Rittman Mead BI Forum 2015, Brighton –  May 6th – 8th 2015 

We’ve also set up a special discount code for listeners to the Realtime BI Show, with 10%-off both registration and the masterclass fee for both the Brighton and Atlanta events – use code RTBI10 on the Eventbrite registration forms to qualify.

Categories: BI & Warehousing

Analysing ODI performance with Flame Graphs

Wed, 2015-04-01 17:01

Flame Graphs are a visualisation that I learnt about through the excellent Linux systems performance work of Brendan Gregg, and saw Luca Canali talk about recently at UKOUG Tech 14. They’re a brilliant way of summarising extremely dense information in a way from which the main components accounting for the most time can be identified. I was recently doing some analysis for a client on their ODI batch runtime and I thought it would be a good idea to try them out. Load Plans can have complex hierarchies to them and working out which main sections account for what time can be tricky, as can following a load plan step through to a session and on to a session step and its constituent parts.

A flame graph is made up of the “stack trace” on the y-axis, and the amount of time spent in each on the x-axis. This is different from most other standard visualisations where the x-axis represents the passage of time, and instead summarises the data at multiple levels of the stack trace hierarchy. The “stack trace” in this case with ODI is Load plan -> load plan step (load plan step […]) -> session -> session step -> task. It’s as easy to see the overall run time as it is a load plan step part way down, as a constituent task of a session step. And what’s more, flame graphs look nice! This may seem a flimsy reason for using them on their own, but it’s a bonus over trawling through dull tables of data alone.

Looking at the flame graph above (taken from a demo BI Apps implementation) it’s nice and easy to see that the Warehouse Load Phase accounts for c.75% of the time, within which the two areas accounting for most time are AP and AR balances. This is from literally a single glance at one graphic. Flame Graphs are built as SVGs which enables them to be interactive (here’s an example). Clicking on any of the stack trace boxes drills into that area, so for the tasks taking less time (and so displaying less text) this is useful to see the specifics. Here’s the GL balance load in detail, showing how long the row inserts take in proportion to the index build:


Creating the flame graph is simple. You just need a stack trace that is semi-colon separated, followed by a space-delimited counter value at the end. A bit of recursive SQL magic with the SNP_ tables (helpfully documented by Oracle here) gives us this kind of output file with one line for every task executed and its duration:

;Start_Load_Plan;Global_Variable_Refresh;Source_Extract_Phase;1_General;2_General_PRE-SDE;3_PRE-SDE_Day;Finalize_Day;Finalize_W_DAY_D;CREATE_INDEXES;Create_Indexes_:_W_DAY_D_2/2;EXEC_TABLE_MAINT_PROC;TABLE_MAINT_PROC;Create Indexes 3

which you then run through the Flame Graph tool:

cat /tmp/odi.out |~/git/FlameGraph/ --title "EBSVISION FIN HR_21_20141021_223159 / 2014-10-24 15:41:42" > /tmp/odi-flame-graph.svg

Simply load the resulting SVG into a web browser such as Chrome, and you’re done. Here’s an example that you can download and try out.

Categories: BI & Warehousing

Announcing Oracle E-Business Suite for Hadoop and MongoDB

Tue, 2015-03-31 23:50

Rittman Mead are very pleased today to announce our special edition of Oracle E-Business Suite R12 running on Apache Hadoop and MongoDB, for customers looking for the ultimate in scalability, flexible data storage and lower cost-of-ownership. Powered by Hadoop technologies such as Apache Hive, HDFS and MapReduce, optional reference data storage in MongoDB and reporting provided by Apache Pig, we think this represents the ultimate platform for large deployments of Oracle’s premier ERP suite.


In this special edition of Oracle E-Business Suite R12, we’ve replaced the Oracle Database storage engine with Hadoop, MapReduce and Apache Hive, with MapReduce providing the data processing engine and Apache Hive providing a SQL layer integrated with Oracle Forms. We’ve replaced Oracle Workflow with Apache Oozie and MongoDB as the optional web-scale NoSQL database for document and reference data storage, freeing you from the size limitations of relational databases, the hassles of referential integrity and restrictions of defined schemas. Developer access is provided through Apache Hue, or you can write your own Java MapReduce and or JavaScript MongoDB API programs to extend E-Business Suite’s functionality. Best of all, there’s no need for expensive DBAs as developers handle all data-modeling themselves (with MongoDB’s collections automatically adapting to new data schemas), and HDFS’s three-node replication removes the need for complicated backup & recovery procedures.


We’ve also brought Oracle Reports into the 21st century by replacing it with Apache Pig, a high-level abstraction language for Hadoop that automatically compiles your “Pig Latin” programs into MapReduce code, and allows you to bring in data from Facebook, Twitter to combine with your main EBS dataset stored in Hive and MongoDB.


On the longer-term roadmap, features and enhancements we’re planning include:

  • Loosening the current INSERT-only restriction to allow UPDATES, DELETEs and full ACID semantics once HIVE-5317 is implemented. 
  • Adding MongoDB’s new write-reliabiity and durability so that data is always saved when EBS writes it to the underlying MongoDB collection
  • Reducing the current 5-30 minute response times to less than a minute by moving to Tez or Apache Spark
  • Providing integration with Oracle Discoverer 9iAS to delight end-users, and provide ad-hoc reporting truly at the speed-of-thought

For more details on our special Oracle E-Business Suite for Hadoop edition, contact us at – but please note we’re only accepting new customers for today, April 1st 2015. 

Categories: BI & Warehousing

Oracle GoldenGate, MySQL and Flume

Mon, 2015-03-30 13:05

Back in September Mark blogged about Oracle GoldenGate (OGG) and HDFS . In this short followup post I’m going to look at configuring the OGG Big Data Adapter for Flume, to trickle feed blog posts and comments from our site to HDFS. If you haven’t done so already, I strongly recommend you read through Mark’s previous post, as it explains in detail how the OGG BD Adapter works.  Just like Hive and HDFS, Flume isn’t a fully-supported target so we will use Oracle GoldenGate for Java Adapter user exits to achieve what we want.

What we need to do now is

  1. Configure our MySQL database to be fit for duty for GoldenGate.
  2. Install and configure Oracle GoldenGate for MySQL on our DB server
  3. Create a new OGG Extract and Trail files for the database tables we want to feed to Flume
  4. Configure a Flume Agent on our Cloudera cluster to ‘sink’ to HDFS
  5. Create and configure the OGG Java adapter for Flume
  6. Create External Tables in Hive to expose the HDFS files to SQL access

OGG and Flume

Setting up the MySQL Database Source Capture

The MySQL database I will use for this example contains blog posts, comments etc from our website. We now want to use Oracle GoldenGate to capture new blog post and our readers’ comments and feed this information in to the Hadoop cluster we have running in the Rittman Mead Labs, along with other feeds, such as Twitter and activity logs.

The database has to be configured to user binary logging and also we need to ensure that the socket file can be found in /tmp/mysql.socket. You can find the details for this in the documentation. Also we need to make sure that the tables we want to extract from are using the InnoDB engine and not the default MyISAM one. The engine can easily be changed by issuing

alter table wp_mysql.wp_posts engine=InnoDB;

Assuming we already have installed OGG for MySQL on /opt/oracle/OGG/ we can now go ahead and configure the Manager process and the Extract for our tables. The tables we are interested in are


First configure the manager

-bash-4.1$ cat dirprm/mgr.prm 
PORT 7809

Now configure the Extract to capture changes made to the tables we are interested in

-bash-4.1$ cat dirprm/mysql.prm 
SOURCEDB wp_mysql, USERID root, PASSWORD password
discardfile /opt/oracle/OGG/dirrpt/FLUME.dsc, purge
EXTTRAIL /opt/oracle/OGG/dirdat/et
TRANLOGOPTIONS ALTLOGDEST /var/lib/mysql/localhost-bin.index
TABLE wp_mysql.wp_comments;
TABLE wp_mysql.wp_posts;
TABLE wp_mysql.wp_users;
TABLE wp_mysql.wp_terms;
TABLE wp_mysql.wp_term_taxonomy;

We should now be able to create the extract and start the process, as with a normal extract.

ggsci>add extract mysql, tranlog, begin now
ggsci>add exttrail ./dirdat/et, extract mysql
ggsci>start extract mysql
ggsci>info mysql
ggsci>view report mysql

We will also have to generate metadata to describe the table structures in the MySQL database. This file will be used by the Flume adapter to map columns and data types to the Avro format.

-bash-4.1$ cat dirprm/defgen.prm 
-- To generate trail source-definitions for GG v11.2 Adapters, use GG 11.2 defgen,
-- or use GG 12.1.x defgen with "format 11.2" definition format.
-- If using GG 12.1.x as a source for GG 11.2 adapters, also generate format 11.2 trails.

-- UserId logger, Password password
SOURCEDB wp_mysql, USERID root, PASSWORD password

DefsFile dirdef/wp.def

TABLE wp_mysql.wp_comments;
TABLE wp_mysql.wp_posts;
TABLE wp_mysql.wp_users;
TABLE wp_mysql.wp_terms;
TABLE wp_mysql.wp_term_taxonomy;
-bash-4.1$ ./defgen PARAMFILE dirprm/defgen.prm 

        Oracle GoldenGate Table Definition Generator for MySQL
      Version OGGCORE_12.

**            Running with the following parameters                  **
SOURCEDB wp_mysql, USERID root, PASSWORD ******
DefsFile dirdef/wp.def
TABLE wp_mysql.wp_comments;
Retrieving definition for wp_mysql.wp_comments.
TABLE wp_mysql.wp_posts;
Retrieving definition for wp_mysql.wp_posts.
TABLE wp_mysql.wp_users;
Retrieving definition for wp_mysql.wp_users.
TABLE wp_mysql.wp_terms;
Retrieving definition for wp_mysql.wp_terms.
TABLE wp_mysql.wp_term_taxonomy;
Retrieving definition for wp_mysql.wp_term_taxonomy.

Definitions generated for 5 tables in dirdef/wp.def.

Setting up the OGG Java Adapter for Flume

The OGG Java Adapter for Flume will use the EXTTRAIL created earlier as a source, pack the data up and feed to the cluster Flume Agent, using Avro and RPC. The Flume Adapter thus needs to know

  • Where is the OGG EXTTRAIL to read from
  • How to treat the incoming data and operations (e.g. Insert, Update, Delete)
  • Where to send the Avro messages to

First we create a parameter file for the Flume Adapter

-bash-4.1$ cat dirprm/flume.prm
SETENV ( GGS_USEREXIT_CONF = "dirprm/flume.props")
SOURCEDEFS ./dirdef/wp.def
DISCARDFILE ./dirrpt/flume.dsc, purge

TABLE wp_mysql.wp_comments;
TABLE wp_mysql.wp_posts;
TABLE wp_mysql.wp_users;
TABLE wp_mysql.wp_terms;
TABLE wp_mysql.wp_term_taxonomy;

There are two things to note here

  • The OGG Java Adapter User Exit is configured in a file called flume.props
  • The source tables’ structures are defined in wp.def

The flume.props file is a ‘standard’ User Exit config file

-bash-4.1$ cat dirprm/flume.props 

# Indicates if the operation timestamp should be included as part of output in the delimited separated values
# true - Operation timestamp will be included in the output
# false - Operation timestamp will not be included in the output
# Default :- true

# Optional properties to use the transaction grouping functionality

### native library config ###

javawriter.bootoptions=-Xmx32m -Xms32m -Djava.class.path=ggjava/ggjava.jar

Some points of interest here are

  • The Flume agent we will send our data to is running on port 4545 on host
  • We want each record to be prefixed with I(nsert), U(pdated) or D(delete)
  • We want each record to be postfixed with a timestamp of the transaction date
  • The Java class will do the actual work. (The curios reader can view the code in /opt/oracle/OGG/AdapterExamples/big-data/flume/src/main/java/com/goldengate/delivery/handler/flume/

Before starting up the OGG Flume, let’s first make sure that the Flume agent on bd5node1 is configure to receive our Avro message (Source) and also what to do with the data (Sink)

a1.channels = c1
a1.sources = r1
a1.sinks = k2
a1.channels.c1.type = memory
a1.sources.r1.channels = c1 
a1.sources.r1.type = avro 
a1.sources.r1.bind = bda5node1
a1.sources.r1.port = 4545
a1.sinks.k2.type = hdfs = c1
a1.sinks.k2.hdfs.path = /user/flume/gg/%{SCHEMA_NAME}/%{TABLE_NAME} 
a1.sinks.k2.hdfs.filePrefix = %{TABLE_NAME}_ 

Here we note that

  • The agent’s source (inbound data stream) is to run on port 4545 and to use avro
  • The agent’s sink will write to HDFS and store the files  in /user/flume/gg/%{SCHEMA_NAME}/%{TABLE_NAME}
  • The HDFS files will be rolled over every 1Mb (1048576 bytes)

We are now ready to head back to the webserver that runs the MySQL database and start the Flume extract, that will feed all committed MySQL transactions against our selected tables to the Flume Agent on the cluster, which in turn will write the data to HDFS

-bash-4.1$ export LD_LIBRARY_PATH=/usr/lib/jvm/jdk1.7.0_55/jre/lib/amd64/server
-bash-4.1$ export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_55/
-bash-4.1$ ./ggsci
ggsci>add extract flume, exttrailsource ./dirdat/et 
ggsci>start flume
ggsci>info flume
EXTRACT    FLUME     Last Started 2015-03-29 17:51   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:06 ago)
Process ID           24331
Log Read Checkpoint  File /opt/oracle/OGG/dirdat/et000008
                     2015-03-29 17:51:45.000000  RBA 7742

If I now submit this blogpost I should see the results showing up our Hadoop cluster in the Rittman Mead Labs.

[oracle@bda5node1 ~]$ hadoop fs -ls /user/flume/gg/wp_mysql/wp_posts
-rw-r--r--   3 flume  flume   3030 2015-03-30 16:40 /user/flume/gg/wp_mysql/wp_posts/wp_posts_.1427729981456

We can quickly create an externally organized table in Hive to view the results with SQL

     op string, 
 ID                     int,
 post_author            int,
 post_date              String,
 post_date_gmt          String,
 post_content           String,
 post_title             String,
 post_excerpt           String,
 post_status            String,
 comment_status         String,
 ping_status            String,
 post_password          String,
 post_name              String,
 to_ping                String,
 pinged                 String,
 post_modified          String,
 post_modified_gmt      String,
 post_content_filtered  String,
 post_parent            int,
 guid                   String,
 menu_order             int,
 post_type              String,
 post_mime_type         String,
 comment_count          int,
     op_timestamp timestamp
 COMMENT 'External table ontop of GG Flume sink, landed in hdfs'
 LOCATION '/user/flume/gg/wp_mysql/wp_posts/';

hive> select post_title from gg_flume.wp_posts where op='I' and id=22112;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1427647277272_0017, Tracking URL =
Kill Command = /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/bin/hadoop job  -kill job_1427647277272_0017
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2015-03-30 16:51:17,715 Stage-1 map = 0%,  reduce = 0%
2015-03-30 16:51:32,363 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.88 sec
2015-03-30 16:51:33,422 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.38 sec
MapReduce Total cumulative CPU time: 3 seconds 380 msec
Ended Job = job_1427647277272_0017
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.38 sec   HDFS Read: 3207 HDFS Write: 35 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 380 msec
Oracle GoldenGate, MySQL and Flume
Time taken: 55.613 seconds, Fetched: 1 row(s)

Please leave a comment and you’ll be contributing to an OGG Flume!

Categories: BI & Warehousing

Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools

Fri, 2015-03-27 08:44

There comes the point in any sufficiently complex or difficult problem diagnosis that the log files in OBIEE alone are not sufficient for building up a complete picture of what’s going on. Even with the debug/trace data that Presentation Services and other components can be configured precisely to write you’re sometimes just left having to guess what is going on inside the black box of each of the OBIEE system components.

Here we’re going to look at a couple of examples of lifting the lid just a little bit further on what OBIEE is up to, using standard Linux diagnostic tools. These are not something to be reaching for in the first instance, but more getting on to a last resort. Almost always the problem is simpler than you’ll think, and leaping for an network trace or stack trace is going to be missing the wood for the trees.

Diagnostics in action

At a client recently they had a problem with a custom skin deployment on a clustered (scaled-out) OBIEE deployment. Amongst other things the skin was setting the default palette for charts (viewui/chart/dvt-graph-skin.xml), and they were seeing only 50% of chart executions pick up the custom palette – the other 50% used the default. If either entire node was shut down, things were fine, but otherwise it was a 50:50 chance what the colours would be. Most odd….

When you configure a custom skin in OBIEE you should be setting CustomerResourcePhysicalPath in instanceconfig.xml, along with CustomerResourceVirtualPath. Both these are necessary so that Presentation Services knows:

  1. Logical – How to generate URLs for content requested by the user’s browser (eg logos, CSS files, etc).
  2. Physical – How to physically reference files on the file system that are read by OBIEE itself (eg XML files, language files)

The way the client had configured their custom skin was that it was on storage local to each node, and in a node-specific path, something like this:

  • /data/instance1/s_custom/
  • /data/instance2/s_custom/

Writing out the details in hindsight always makes a problem’s root cause a lot more obvious, but at the time this was a tricky problem. Let’s start with the basics. Java Host is responsible for rendering charts, and for some reason, it was not reading the custom colour scheme file from the custom skin correctly. Presentation Services uses all the available Java Hosts in a cluster to request charts, presumably on some kind of round-robin basis. An analysis request on NODE01 has a 50:50 chance of getting its chart rendered on Java Host on NODE01 or Java Host on NODE02:

Turned all the log files up to 11 didn’t yield anything useful. For some reason half the time Java Host would just “ignore” the custom skin. Shutting down each node proved that in isolation the custom skin configuration on each node was definitely correct, because then the colours started working just fine. It was only when multiple Java Hosts across the nodes were active that there was a problem.

How Java Host picks up the custom skin is entirely undocumented, and I ended up figuring out that it must get the path to the skin as part of the chart request from Presentation Services. Since Presentation Services on NODE01 has been configured with a CustomerResourcePhysicalPath of /data/instance1/s_custom/, Java Host on NODE02 would fail to find this path (since on NODE02 the skin is located at /data/instance2/s_custom/) and so fall back on the default. This was my hypothesis that I then proved by making the path available for each skin available on each node (symlink, or using a standard path would also have worked, eg /data/shared/s_custom, or even better, a shared mount point), and from there everything worked just fine.

But a hypothesis and successful resolution alone wasn’t entirely enough. Sure the client was happy, but there was that little itch, that unknown “black box” system that appeared to behave how I had deduced, but could we know for sure?

tcpdump – network analysis

All of the OBIEE components communicate with each other and the outside world over TCP. When Presentation Services wants a chart rendered it does so by sending a request to Java Host – over TCP. Using the tcpdump tool we can see that in action, and inspect what gets sent:

$ sudo tcpdump -i venet0 -i lo -nnA 'port 9810'

The -A flag capture the ASCII representation of the packet; use -X if you want ASCII and hex. Port 9810 is the Java Host listen port.

The output looks like this:

You’ll note that in this case it’s intra-node communication, i.e. src and dest IP addresses are the same. The port for Java Host (9810) is clear, and we can verify that the src port (38566) is Presentation Services with the -p (process) flag of netstat:

$ sudo netstat -pn |grep 38566
tcp        0      0         ESTABLISHED 5893/sawserver

So now if you look in a bit more detail at the footer of the request from Presentation Services that tcpdump captured you’ll see loud and clear (relatively) the custom skin path with the graph customisation file:

Proof that the Presentation Services is indeed telling Java Host where to go and look for the custom attributes (including colours)! NB this is on a test environment, so that paths vary from the /data/instance... example above)

strace – system call analysis

So tcpdump gives us the smoking gun, but can we find the corpse as well? Sure we can! strace is a tool for tracing system calls, and a fantastically powerful one, but here’s a very simple example:

$strace -o /tmp/obijh1_strace.log -f -p $(pgrep -f obijh1)

-o means to write it to file, -f follows child processes as well, and -p passes the process id that strace should attach to. Have set the trace running I run my chart, and then go and pick through my trace file.

We know it’s the dvt-graph-skin.xml file that Java Host should be reading to pick up the custom colours, so let’s search for that:

Well there we go – Java Host went to go and look for the skin in the path that it was given by Presentation Services, and couldn’t find it. From there it’ll fall back on the product defaults.

Right Tool, Right Job

As as I said at the top of this article, these diagnostic tools are not the kind of things you’d be using day to day. Understanding their output is not always easy and it’s probably easy to do more harm than good with false assumption about what a trace is telling you. But, in the right situations, they are great for really finding out what is going on under the covers of OBIEE.

If you want to find out more about this kind of thing, this page is a great starting point.

Categories: BI & Warehousing

New Oracle Big Data Quick-Start Packages from Rittman Mead

Wed, 2015-03-25 05:00

Many organisations using Oracle’s business intelligence and data warehousing tools are now looking to extend their capabilities using “big data” technologies. Customers running their data warehouses on Oracle Databases are now looking to use Hadoop to extend their storage capacity whilst offloading initial data loading and ETL to this complementary platform; other customers are using Hadoop and Oracle’s Big Data Appliance to add new capabilities around unstructured and sensor data analysis, all at considerably lower-cost than traditional database storage.


In addition, as data and analytics technologies and capabilities have evolved, there has never been a better opportunity to reach further into your data to exploit more value. Big Data platforms, Data Science methods and data discovery technologies make it possible to unlock the power of your data and put it in the hands of your  executives and team members – but what is it worth to you? What’s the value to your organisation of exploring deeper int the data you have, and how do you show return?

Many organisations have begin to explore Big Data technologies to understand where they can exploit value and extend their existing analytics platforms, but what’s the business case? The good news is, using current platforms, and following architectures like the Oracle Information Management and Big Reference Architecture written in conjunction with Rittman Mead, the foundation is in place to unlock a range of growth opportunities. Finding new value in existing data, predictive analytics, data discovery, reducing the cost of data storage, ETL offloading are all starter business cases proven to return value quickly.


To help you start on the Oracle big data journey, Rittman Mead have put together two quick-start packages focuses on the most popular Oracle customer use-cases;

If this sounds like something you or your organization might be interested in, take a look at our new Quick Start Oracle Big Data and Big Data Discovery packages from Rittman Mead home page, or drop me an email at and I’ll let you know how we can help.

Categories: BI & Warehousing

RM BI Forum 2015 : Justification Letters for Employers

Tue, 2015-03-24 03:48

(Thanks to Christian Berg @Nephentur for the suggestion, and acknowledgements to ODTUG KScope for the original idea – our favourite conference after the BI Forum)

The Rittman Mead BI Forum 2015 promises to be our best BI Forum yet, with fantastic speakers at each event, keynotes and guest speakers from Oracle and John Foreman, author of the bestselling book “Data Smart”, a data visualisation challenge and an optional one-day masterclass on delivering Oracle’s new Information Management and Big Data reference architecture by Rittman Mead’s Mark Rittman and Jordan Meyer. Uniquely amongst Oracle BI events we keep the numbers attending very limited and run just a single stream at each event, so everyone takes part in the same sessions and gets to meet all the attendees and speakers over the three days.

Sometimes though, management within organizations require special justification for team members to attend events like these, and to help you put your case together and get across the unique education and networking benefits of the Rittman Mead BI Forum, we’ve prepared justification letters for you to complete with your details, one each for the Brighton and Atlanta events. Click on the links below to download sample justification letters for the Brighton BI Forum running on May 6th-8th 2015, and the Atlanta one running the week after on May 13th-15th 2015:

Full details on the BI Forum 2015 agenda and how to register can be found on the Rittman Mead BI Forum 2015 home page, with registration open until the weekend before each event – hurry though as attendee numbers are strictly limited.

Categories: BI & Warehousing

OBIEE nqcmd Tidbits

Mon, 2015-03-23 21:42

nqcmd is the ODBC command line tool that always has, and hopefully always will, shipped with OBIEE. It enables you to manually fire queries directly at the BI Server, rather than through the usual way of Presentation Services generating Logical SQL and sending it to BI Server. This can be useful in several cases:

  1. Automated cache purging, by sending one of the SAPurge[…] ODBC commands to the BI Server, usually done as part of a script
  2. Automated execution of Logical SQL, often done to support testing scenarios
  3. Load Testing the BI Server (via a magic undocumented switch, SA_NQCMD_ADVANCED)
  4. Manual interogation of the BI Server – if you want to poke and prod nqsserver without launching a web browser, nqcmd is your friend :)

In using nqcmd there’re a couple of things I want to demonstrate here that I find useful but haven’t seen discussed [in detail] elsewhere.

Query Log via nqcmd

All BI Server queries run with a LOGLEVEL>=1 will write some log details to nqquery.log. The usual route to view this is either on the server directly itself, transferring it off with a tool such as WinSCP, or through the Administration page of OBIEE. Another option that is available is from nqcmd itself. You need to do two things:

  1. Set the environment variable SA_NQCMD_ADVANCED to Yes
  2. Include the command line arguments -ShowQueryLog -H when you invoke nqcmd. I don’t know what -H does – it’s just specified as being required for this to work.

Here’s a simple example in action:

[oracle@demo ~]$ export SA_NQCMD_ADVANCED=Yes
[oracle@demo ~]$ nqcmd -d AnalyticsWeb -u prodney -p Admin123 -ShowQueryLog -H

          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved

Connection open with info:
[0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16

        [T]able info
        [C]olumn info
        [D]ata type info
        [F]oreign keys info
        [P]rimary key info
        [K]ey statistics info
        [S]pecial columns info
        [Q]uery statement
Select Option: Q

Give SQL Statement: SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"
SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"
Row count: 1
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-0] [] [ecid: 0054Sw944KmFw000jzwkno0003ac0000rl,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] ###
########################################### [[
-------------------- SQL Request, logical request hash:
SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"

[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-34] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Query Status: Successful Completion [[

[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-28] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Physical query response time 0 (seconds), id <<333971>> [[


[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-29] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Physical Query Summary Stats: Number of physical queries 1, Cumulative time 0, DB-connect time 0 (seconds) [[

[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-33] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Logical Query Summary Stats: Elapsed time 0, Response time 0, Compilation time 0 (seconds) [[


Neat! But so what? Well, I see two uses straight away:

  1. In some situations you may not have access to the filesystem of the server on which the BI Server is running. For example, as a consultant I’ve been to clients where I’m given the Administration Tool client installation only. If I want to debug an RPD that I’m developing I’ll usually want to poke around in nqquery.log to see quite what physical SQL is being generated – and now I can.
  2. There was a discussion on the EMG mailing list recently about generating Physical SQL without executing it on the database. I’m going to discuss this in the next section of this article, and to do the analysis for this rapidly I’m using the inline query log.
Generating Physical SQL for OBIEE without Executing it – SKIP_PHYSICAL_QUERY_EXEC

OBIEE generates the Physical SQL that it runs against the database dynamically, at runtime. It takes the Logical request (“Logical SQL”), runs it through the RPD and generates one or more “Physical SQL” statements to be executed on the database as required to pull back the necessary data. A question arose recently on the EMG mailing list as to whether it is possible to get the Physical SQL – without executing it. You can imagine the benefits of this (namely, regression testing) since executing the database query each time is typically going to be expensive in machine resource and time consuming.

In SampleApp v406 there is a /home/oracle/scripts/PhysicalSQLGenerator, which does two things. First off it generates the Logical SQL for a given analysis, presumably using the generateReportSQL web service. It then takes that and runs it through nqcmd, scraping the nqquery.log for the resulting Physical SQL. In all of this no database queries get run. Very cool. But what’s the “secret sauce” at play here – can we distill it down in order to use it ourselves?

First, let’s look at how the SampleApp script does it. It sets some additional request variables in the Logical SQL:

[oracle@demo PhysicalSQLGenerator]$ cat lsql-out-dir/q1.lsql
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"

And if we extract the relevant part out of the bash script we can see that it also uses a couple of extra command line arguments (-q -NoFetch) when invoking nqcmd:

nqcmd -q -NoFetch -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

When it’s run we check nqquery.log and lo-and-behold we get this: (edited for brevity)

------------------- Sending query to database named 01 - Sample App Data (ORCL) (id: <<69923>>), connection pool named Sample Relational Connection, logical request hash dd4fb54f, physical request hash 8d6f36
3d: [[
SAWITH0 AS (select sum(T42442.Revenue) as c1
     BISAMPLE.SAMP_REVENUE_FA2 T42442 /* F21 Rev. (Aggregate 2) */ )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
     SAWITH0 D1 ) D1 where rownum <= 5000001


Query Status: Successful Completion [[

Rows 0, bytes 24 retrieved from database query id: <<69923>> Simulation Gateway 

Physical query response time 0 (seconds), id <<69923>> Simulation Gateway

Whilst the log says it is “Sending query to database” it does no such thing, and the “Simulation Gateway” is the giveaway clue. Proof that it doesn’t connect to the database? I shut the database down, and it still worked just fine. Crude, yes, but effective.

I’ll intersperse here the little trick that I mentioned in the first part of this article : -ShowQueryLog. It’s tedious switching back and forth between nqcmd and the nqquery.log when doing this kind of testing, so let’s do it all as one:

nqcmd -H -ShowQueryLog -q -NoFetch -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

Unfortunately it looks like -ShowQueryLog is mutually exclusive to -q and -NoFetch since it doesn’t return anything, even though the nqquery.log did get additional entries. But that’s fine, since by removing these two flags in order to get -ShowQueryLog to work we’re whittling down what is actually needed to generate the physical SQL on its own without database execution. Here’s the nqcmd, showing the query log inline and showing still the “Simulation Gateway” indicative of no physical query execution:

[oracle@demo PhysicalSQLGenerator]$ export SA_NQCMD_ADVANCED=Yes
[oracle@demo PhysicalSQLGenerator]$ nqcmd -H -ShowQueryLog -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved


s_0          s_1
Row count: 0
[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-0] [] [ecid: 0054Ut7AJ33Fw000jzwkno0005UZ00005Q,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] ############################################## [[
-------------------- SQL Request, logical request hash:
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"


[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-18] [] [ecid: 0054Ut7AK5DFw000jzwkno0005UZ00005S,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Sending query to database named 01 - Sample App Data (ORCL) (id: <<70983>>), connection pool named Sample Relational Connection, logical request hash dd4fb54f, physical request hash 8d6f363d: [[
SAWITH0 AS (select sum(T42442.Revenue) as c1
     BISAMPLE.SAMP_REVENUE_FA2 T42442 /* F21 Rev. (Aggregate 2) */ )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
     SAWITH0 D1 ) D1 where rownum <= 5000001

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-34] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Query Status: Successful Completion [[

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-26] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Rows 0, bytes 24 retrieved from database query id: <<70983>> Simulation Gateway [[

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-28] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Physical query response time 0 (seconds), id <<70983>> Simulation Gateway [[


It’s clear that the “-q -Nofetch” parameters used in nqcmd don’t have an effect on whether the physical query is executed (they’re to do with whether nqcmd as an ODBC client pulls back and displays the data you ask for). It’s actually just a single request variable that does the job, and it goes under the rather obvious name of SKIP_PHYSICAL_QUERY_EXEC. When set to 1 it generates all the necessary physical SQL but doesn’t execute it, and the presence of “Simulation Gateway” in the log signals this.

Categories: BI & Warehousing

Announcing the BI Forum 2015 Data Visualisation Challenge

Mon, 2015-03-23 03:00

The Rittman Mead BI Forum 2015 is running in Brighton from May 6th-8th 2015, and Atlanta from May 13th – 15th 2015. At this year’s events we’re introducing our first “data visualization challenge”, open to all attendees and with the dataset and scenario open from now until the start of each event. Using Oracle Business Intelligence 11g and any plugins or graphics libraries that embed and interact with OBIEE (full details and rules below), we challenge you to create the most effective dashboard or visualisation and bring it along to demo on the Friday of each event.

Help Donors Use their Funds Most Effectively

This year’s inaugural data visualisation challenge is based around the project and dataset, an online charity that makes it easy for anyone to help public school classroom projects that need funding (Rittman Mead will be making donations on behalf of the Brighton and Atlanta BI Forums to show our support for this great initiative). The project and dataset have been used in several hackathons and data crunching contests around the world, with analysis and visualisations helping to answer questions such as:

  • Why do some projects get funded, while others don’t?
  • Who donates to projects from different subjects?
  • Does proximity to schools change donation behavior?
  • What types of materials are teachers lacking the most? (eg chalk, paper, markers, etc)
  • Do poorer schools ask for more or less money from their donors?
  • If I need product x, what is the difference between projects asking for x that were successful vs those that aren’t.

More details on uses of the dataset can be found on the Donorschoose data blog, and example visualisations you could use to get some ideas and inspiration are on the Data Gallery showcase page.


Your challenge is to import this dataset into your analytical database of choice, and then create the best visualisation or dashboard in OBIEE to answer the following question: “Which project can I donate to, where my donation will have most impact?”

How Do I Take Part?

For more on the BI Forum 2015 Data Visualization Challenge including how to download the dataset and the rules of the challenge, take a look at the Rittman Mead BI Forum 2015 Data Visualisation Challenge web page where we’ve provided full details. You can either enter as an individual or as part of a team, but you must be registered for either the Brighton or Atlanta BI Forum events and come along in-person to demonstrate your solution – numbers at each event are strictly limited though, so make sure you register soon at the Rittman Mead BI Forum 2015 home page.

Categories: BI & Warehousing

Instrumenting OBIEE Database Connections For Improved Performance Diagnostics

Sun, 2015-03-22 19:30

Nearly four years ago I wrote a blog post entitled “Instrumenting OBIEE – The Final Chapter”. With hindsight, that title suffix (“The Final Chapter”) may have been a tad presumptuous and naïve of me (or perhaps I can just pretend to be ironic now and go for a five-part-trilogy style approach…). Back then OBIEE 11g had only just been released (who remembers in all its buggy-glory?), and in the subsequent years we’ve had significant patchset releases of OBIEE 11g bringing us up to now and with talk of OBIEE 12c around the corner.

As a fanboi of Cary Millsap and his approach to measuring and improving performance, instrumenting code in general – and OBIEE specifically – is something that’s interested me for a long time. The article was the final one that I wrote on my personal blog before joining Rittman Mead and it’s one that I’ve been meaning to re-publish here for a while. A recent client engagement gave me cause to revisit the instrumentation approach and refine it slightly as well as update it for a significant change made in OBIEE

What do I mean by instrumentation? Instrumentation is making your program expose information about what is being done, as well as actually doing it. Crudely put, it’s something like this:

40 GOTO 10

Rather than just firing some SQL at the database, instead we associate with that SQL information about what program sent it, and what that program was doing, who was using it, and so on. Instrumentation enables you to start analysing performance metrics against tangible actions rather than just amorphous clumps of SQL. It enables you to understand the workload profile on your system and how that’s affecting end users.

Pop quiz: which of these is going to be easier to work with for building up an understanding of a system’s behaviour and workload?

CLIENT_INFO          MODULE                    ACTION       CPU_TIME DISK_READS 
-------------------- ------------------------  ---------- ---------- ---------- 
                                               a17ff8e1         2999          1 
                                               fe6abd92         1000          6 
                                               a264593a         5999          2 
                                               571fe814         5000         12 
                                               63ea4181         7998          4 
                                               7b2fcb68        11999          5


CLIENT_INFO          MODULE                    ACTION       CPU_TIME DISK_READS
-------------------- ------------------------  ---------- ---------- ----------
06 Column Selector   GCBC Dashboard/Performan  a17ff8e1         2999          1
05 Table with condit GCBC Dashboard/Performan  a264593a         5999          2
06 View Selector     GCBC Dashboard/Performan  571fe814         5000         12
05 Table with condit GCBC Dashboard/Performan  63ea4181         7998          4
<unsaved analysis>   nqsserver@obi11-01        fe6abd92         1000          6
<unsaved analysis>   nqsserver@obi11-01        7b2fcb68        11999          5

The second one gives us the same information as before, plus the analysis being run by OBIEE, and the dashboard and page.

The benefits of instrumentation work both ways. It makes DBAs happy because they can look at resource usage on the database and trace it back easily to the originating OBIEE dashboard and user. Instrumentation also makes life much easier for troubleshooting OBIEE performance because it’s easy to trace a user’s entire session through from browser, through the BI Stack, and down into the database.

Instrumentation for OBIEE – Step By Step

If you want the ‘tl;dr’ version, the “how” rather than the “why”, here we go. For full details of why it works, see later in the article.

  1. In your RPD create three session variables. These are going to be the default values for variables that we’re going to send to the database. Make sure you set “Enable any user to set the value”.

  2. Set up a session variable initialization block to populate these variables. It is just a “dummy” init block as all you’re doing is setting them to empty/default values, so a ‘SELECT … FROM DUAL’ is just fine:

  3. For each Connection Pool you want to instrument, go to the Connection Scripts tab and add these three scripts to the Execute before query section:

    -- Pass the OBIEE user's name to CLIENT_IDENTIFIER
    call dbms_session.set_identifier('VALUEOF(NQ_SESSION.USER)')

    -- Pass the Analysis name to CLIENT_INFO
    call dbms_application_info.set_client_info(client_info=>SUBSTR('VALUEOF(NQ_SESSION.SAW_SRC_PATH)',(LENGTH('VALUEOF(NQ_SESSION.SAW_SRC_PATH)')-instr('VALUEOF(NQ_SESSION.SAW_SRC_PATH)','/',-1,1))*-1))

    -- Pass the dashboard name & page to MODULE
    -- NB OBIEE >= will set ACTION itself so there is no point setting it here (it will get overridden)
    call dbms_application_info.set_module(module_name=> SUBSTR('VALUEOF(NQ_SESSION.SAW_DASHBOARD)', ( LENGTH('VALUEOF(NQ_SESSION.SAW_DASHBOARD)') - INSTR('VALUEOF(NQ_SESSION.SAW_DASHBOARD)', '/', -1, 1) ) *- 1) || '/' || 'VALUEOF(NQ_SESSION.SAW_DASHBOARD_PG)' ,action_name=> '' );

    You can leave the comments in there, and in fact I’d recommend doing so to make it clear for future RPD developers what these scripts are for.

    Your connection pool should look like this:

    An important point to note is that you generally should not be adding these scripts to connection pools that are used for executing initialisation blocks. Initialisation block queries won’t have these request variables so if you did want to instrument them you’d need to find something else to include in the instrumentation.

Once you’ve made the above changes you should see MODULE, CLIENT_IDENTIFIER and CLIENT_INFO being populated in the Oracle system views :


SID PROGRAM CLIENT_ CLIENT_INFO              MODULE                       ACTION
--- ------- ------- ------------------------ ---------------------------- --------
 17 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 32846912
 65 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 4bc2a368
 74 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 35c9af67
193 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 10bdad6c
302 nqsserv prodney Geographical Analysis 1  11.10 Flights Delay/Overview 3a39d178
308 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 1fad81e0
421 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 4e5d36c1

You’ll note that we don’t set ACTION – that’s because OBIEE now sends a hash of the physical query text across in this column, meaning we can’t use it ourselves. Unfortunately the current version of OBIEE doesn’t store the physical query hash anywhere other than in nqquery.log, meaning that you can’t take advantage of it (i.e. link it back to data from Usage Tracking) within the database alone.

That’s all there is to it – easy! If you want to understand exactly how and why it works, read on…

Instrumentation for OBIEE – How Does it Work? Connection Pools

When OBIEE runs a dashboard, it does so by taking each analysis on that dashboard and sending a Logical Request for that analysis to the BI Server (nqsserver). The BI Server parses and compiles that Logical request into one or more Physical requests which it then sends to the source database(s).

OBIEE connects to the database via a Connection Pool which specifies the database-specific connection information including credentials, data source name (such as TNS for Oracle). The Connection Pool, as the name suggests, pools connections so that OBIEE is not going through the overhead of connecting and disconnecting for every single query that it needs to run. Instead it will open one or more connections as needed, and share that connection between queries as needed.

As well as the obvious configuration options in a connection pool such as database credentials, OBIEE also supports the option to send additional SQL to the database when it opens a connection and/or sends a new query. It’s this nice functionality that we piggy-back to enable our instrumentation.


The information that OBIEE can send back through its database connection is limited by what we can expose in variables. From the BI Server’s point of view there are three types of variables:

  1. Repository
  2. Session
  3. Request

The first two are fairly simple concepts; they’re defined within the RPD and populated with Initialisation Blocks (often known as “init blocks”) that are run by the BI Server either on a schedule (repository variables) or per user (session variables). There’s a special type of session variables known as System Session Variables, of which USER is a nice obvious example. These variables are pre-defined in OBIEE and are generally populated automatically when the user session begins (although some, like LOGLEVEL, still need an init block to set them explicitly).

The third type of variable, request variable, is slightly less obvious in function. In a nutshell, they are variables that are specified in the logical request sent to the BI Server, and are passed through to the internals of the BI Server. They’re often used for activating or disabling certain functionality. For example, you can tell OBIEE to specifically not use its cache for a request (even if it finds a match) by setting the request variable DISABLE_CACHE_HIT.

Request variables can be set manually inline in an analysis from the Advanced tab:

And they can also be set from Variable Prompts either within a report prompt or as a standalone dashboard prompt object. The point about request variables is that they are freeform; if they specify the name of an existing session variable then they will override it (if permitted), but they do not require the session variable to exist. We can see this easily enough – and see a variable request prompt in action at the same time. From the Prompts tab of an analysis I’ve added a Variable Prompt (rather than the usual Column Prompt) and given it a made up name, FOO:

Now when I run the analysis I specify a value for it:

and in the query log there’s the request variable:

-------------------- SQL Request, logical request hash:
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"

I’ve cut the quoted Logical SQL down to illustrate the point about the variable, because what was actually there is this:

-------------------- SQL Request, logical request hash:
SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/prodney/request variable example',FOO='BAR', PREFERRED_CURRENCY='USD';
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"

which brings me on very nicely to the key point here. When Presentation Services sends a query to the BI Server it does so with a bunch of request variables set, including QUERY_SRC_CD and SAW_SRC_PATH. If you’ve worked with OBIEE for a while then you’ll recognise these names – they’re present in the Usage Tracking table S_NQ_ACCT. Ever wondered how OBIEE knows what values to store in Usage Tracking? Now you know. It’s whatever Presentation Services tells it to. You can easily test this yourself by playing around in nqcmd:

[oracle@demo ~]$ rlwrap nqcmd -d AnalyticsWeb -u prodney -p Admin123 -NoFetch

          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved



Statement execute succeeded

and looking at the results in S_NQ_ACCT:

BIEE_BIPLATFORM@pdborcl > select to_char(start_ts,'YYYY-MM-DD HH24:MI:SS') as start_ts,saw_src_path,query_src_cd from biee_biplatform.s_nq_acct where start_ts > sysdate -1 order by start_ts;

START_TS            SAW_SRC_PATH                             QUERY_SRC_CD
------------------- ---------------------------------------- --------------------
2015-03-21 11:55:10 /users/prodney/request variable example  Report
2015-03-21 12:44:41 BAR                                      FOO
2015-03-21 12:45:26 BAR                                      FOO
2015-03-21 12:45:28 BAR                                      FOO
2015-03-21 12:46:23 BAR                                      FOO

Key takeaway here: Presentation Services defines a bunch of useful request variables when it sends Logical SQL to the BI Server:

Embedding Variables in Connection Script Calls

There are four options that we can configure when connecting to the database from OBIEE. These are:


As of OBIEE version (i.e. OBIEE >= OBIEE automatically sets the ACTION field to a hash of the physical query – for more information see Doc ID 1941378.1. That leaves us with three remaining fields (since OBIEE sets ACTION after anything we do with the Connection Pool):


The syntax of the command in a Connection Script is physical SQL and the VALUEOF function to extract the OBIEE variable:


As a simple example here is passing the userid of the OBIEE user, using the Execute before query connection script:

-- Pass the OBIEE user's name to CLIENT_IDENTIFIER
call dbms_session.set_identifier('VALUEOF(NQ_SESSION.USER)')

This would be set for every Connection Pool – but only those used for query execution – not init blocks. Run a query that is routed through the Connection Pool you defined the script against and check out V$SESSION:

SQL> select sid,program,client_identifier from v$session where program like 'nqsserver%';

       SID PROGRAM                                          CLIENT_IDENTIFIER
---------- ------------------------------------------------ ----------------------------------------------------------------
        22 (TNS V1-V3)         prodney

The USER session variable is always present, so this is a safe thing to do. But, what about SAW_SRC_PATH? This is the path in the Presentation Catalog of the analysis being executed. Let’s add this into the Connection Pool script, passing it through as the CLIENT_INFO:

-- Pass the Analysis name to CLIENT_INFO
call dbms_application_info.set_client_info(client_info=>'VALUEOF(NQ_SESSION.SAW_SRC_PATH)')

This works just fine for analyses within a dashboard, or standalone analyses that have been saved. But what about a new analysis that hasn’t been saved yet? Unfortunately the result is not pretty:

[10058][State: S1000] [NQODBC] [SQL_STATE: S1000] [nQSError: 10058] A general error has occurred.
[nQSError: 43113] Message returned from OBIS.
[nQSError: 43119] Query Failed:
[nQSError: 23006] The session variable, NQ_SESSION.SAW_SRC_PATH, has no value definition.
Statement execute failed

That’s because SAW_SRC_PATH is a request variable and since the analysis has not been saved Presentation Services does not pass it to BI Server as a request variable. The same holds true for SAW_DASHBOARD and SAW_DASHBOARD_PG if you run an analysis outside of a dashboard – the respective request variables are not set and hence the connection pool script causes the query itself to fail.

The way around this is we cheat, slightly. If you create a session variable with the names of these request variables that we want to use in the connection pool scripts then we avoid the above nasty failures. If the request variables are set then all is well, and if they are not then we fall back on whatever value we initialise the session variable with.

The final icing on the cake of the solution given above is a bit of string munging with INSTR and SUBSTR to convert and concatenate the dashboard path and page into a single string, so instead of :

/shared/01. QuickStart/_portal/1.30 Quickstart/Overview

we get:

1.30 Quickstart/Overview

Which is much easier on the eye when looking at dashboard names. Similarly with the analysis path we strip all but the last section of it.

Granular monitoring of OBIEE on the database

Once OBIEE has been configured to be more articulate in its connection to the database, it enables the use of DBMS_MONITOR to understand more about the performance of given dashboards, analyses, or queries for a given user. Through DBMS_MONITOR the collection of statistics such as DB time, DB CPU, and so can be triggered, as well as trace-file generation for queries matching the criteria specified.

As an example, here is switching on system statistics collection for just one dashboard in OBIEE, using SERV_MOD_ACT_STAT_ENABLE

call dbms_monitor.SERV_MOD_ACT_STAT_ENABLE(
    module_name=>'GCBC Dashboard/Overview'

Now Oracle stats to collect information whenever that particular dashboard is run, which we can use to understand more about how it is performing from a database point of view:

SYS@orcl AS SYSDBA> select module,stat_name,value from V$SERV_MOD_ACT_STATS;

MODULE                   STAT_NAME                           VALUE
------------------------ ------------------------------ ----------
GCBC Dashboard/Overview  user calls                             60
GCBC Dashboard/Overview  DB time                              6789
GCBC Dashboard/Overview  DB CPU                               9996
GCBC Dashboard/Overview  parse count (total)                    15
GCBC Dashboard/Overview  parse time elapsed                    476
GCBC Dashboard/Overview  execute count                          15
GCBC Dashboard/Overview  sql execute elapsed time             3887

Similarly the CLIENT_IDENTIFIER field can be used to collect statistics with CLIENT_ID_STAT_ENABLE or trigger trace file generation with CLIENT_ID_TRACE_ENABLE. What you populate CLIENT_IDENTIFIER with it up to you – by default the script I’ve detailed at the top of this article inserts the OBIEE username in it, but you may want to put the analysis here if that’s of more use from a diagnostics point of view on the database side. The CLIENT_INFO field is still available for the other item, but cannot be used with DBMS_MONITOR for identifying queries.

Categories: BI & Warehousing

More on the Rittman Mead BI Forum 2015 Masterclass : “Delivering the Oracle Big Data and Information Management Reference Architecture”

Thu, 2015-03-12 05:29

Each year at the Rittman Mead BI Forum we host an optional one-day masterclass before the event opens properly on Wednesday evening, with guest speakers over the year including Kurt Wolff, Kevin McGinley and last year, Cloudera’s Lars George. This year I’m particularly excited that together with Jordan Meyer, our Head of R&D, I’ll be presenting the masterclass on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture”.

NewImageLast year we launched at the Brighton BI Forum event a new reference architecture that Rittman Mead had collaborated with Oracle on, that incorporated big data and schema-on-read databases into the Oracle data warehouse and BI reference architecture. In two subsequent blog posts, and in a white paper published on the Oracle website a few weeks after, concepts such as the “Discovery Lab”, “Data Reservoirs” and the “Data Factory” were introduced as a way of incorporating the latest thinking, and product capabilities, into the reference architecture for Oracle-based BI, data warehousing and big data systems.

One of the problems I always feel with reference architectures though is that they tell you what you should create, but they don’t tell you how. Just how do you go from a set of example files and a vague requirement from the client to do something interesting with Hadoop and data science, and how do you turn the insights produced by that process into a production-ready, enterprise Big Data system? How do you implement the data factory, and how do you use new tools such as Oracle Big Data Discovery and Oracle Big Data SQL as part of this architecture? In this masterclass we’re looking to explain the “how” and “why” to go with this new reference architecture, based on experiences working with clients over the past couple of years.

The masterclass will be divided into two sections; the first, led by Jordan Meyer, will focus on the data discovery and “data science” parts of the Information Management architecture, going through initial analysis and discovery of datasets using R and Oracle R Enterprise. Jordan will share techniques he uses from both his work at Rittman Mead and his work with Slacker Radio, a Silicon Valley startup, and will introduce the R and Oracle R Enterprise toolset for uncovering insights, correlations and patterns in sample datasets and productionizing them as database routines. Over his three hours he’ll cover topics including: 

Session #1 – Data exploration and discovery with R (2 hours) 

1.1 Introduction to R 

1.2 Tidy Data  

1.3 Data transformations 

1.4 Data Visualization 

Session #2 – Predictive Modeling in the enterprise (1 hr)

2.1 Classification

2.2 Regression

2.3 Deploying models to the data warehouse with ORE

After lunch, I’ll take the insights and analysis patterns identified in the Discovery Lab and turn them into production big data pipelines and datasets using Oracle Data Integrator 12c, Oracle Big Data Discovery and Oracle Big Data SQL For a flavour of the topics I’ll be covering take a look at this Slideshare presentation from a recent Oracle event, and in the masterclass itself I’ll concentrate on techniques and approaches for ingesting and transforming streaming and semi-structured data, storing it in Hadoop-based data stores, and presenting it out to users using BI tools like OBIEE, and Oracle’s new Big Data Discovery.

Session # 3 – Building the Data Reservoir and Data Factory (2 hr)

3.1 Designing and Building the Data Reservoir using Cloudera CDH5 / Hortonworks HDP, Oracle BDA and Oracle Database 12c

3.2 Building the Data Factory using ODI12c & new component Hadoop KM modules, real-time loading using Apache Kafka, Spark and Spark Streaming

Session #4 – Accessing and visualising the data (1 hr)

4.1 Discovering and Analyzing the Data Reservoir using Oracle Big Data Discovery

4.2 Reporting and Dashboards across the Data Reservoir using Oracle Big Data SQL + OBIEE

You can register for a place at the two masterclasses when booking your BI Forum 2015 place, but you’ll need to hurry as we limit the number of attendees at each event in order to maximise interaction and networking within each group. Registration is open now and the two events take place in May – hopefully we’ll see you there!

Categories: BI & Warehousing

An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4

Thu, 2015-03-12 01:39

An important part of working with ODI is analysing the performance when it runs, and identifying steps that might be inefficient as well as variations in runtime against a baseline trend. The Operator tool in ODI itself is great for digging down into individual sessions and load plan executions, but for broader analysis we need a different approach. We also need to make sure we keep the data available for trend analysis, as it’s often the case that tables behind Operator are frequently purged for performance reasons.

In this article I’m going to show how we can make use of a generic method of pulling information out of an RDBMS such as Oracle and storing it in Elasticsearch, from where it can be explored and analysed through Kibana. It’s standalone, it’s easy to do, it’s free open source – and it looks and works great! Here I’m going to use it for supporting the analysis of ODI runtime information, but it is equally applicable to any time-based data you’ve got in an RDBMS (e.g. OBIEE Usage Tracking data).

Kibana is an open-source data visualisation and analysis tool, working with data stored in Elasticsearch. These tools work really well for very rapid analysis of any kind of data that you want to chuck at them quickly and work with. By skipping the process of schema definition and data modelling the time taken to the first results is drastically reduced. It enables to you quickly start “chucking about” data and getting meaning out of it before you commit full-scale to how you want to analyse it, which is what the traditional modelling route can sometimes force you to do prematurely.

ODI writes runtime information to the database, about sessions run, steps executed, time taken and rows processed. This data is important for analysing things like performance issues, and batch run times. Whilst with the equivalent runtime data (Usage Tracking) from OBIEE there is the superb RPD/Dashboard content that Oracle ship in SampleApp v406, for ODI the options aren’t as vast, ultimately being based on home-brew SQL against the repository tables using the repository schema documentation from Oracle. Building an OBIEE metadata model against the ODI schema is one option, but then requires an OBIEE server on which to run it – or merging into an existing OBIEE deployment – which means that it can become more hassle than it’s worth. It also means a bunch of up-front modelling before you get any kind of visualisations and data out. By copying the data across into Elasticsearch it’s easy to quickly build analyses against it, and has the additional benefit of retaining the data as long as you’d like meaning that it’s still available for long-term trend analysis once the data’s been purged from the ODI repository itself.

Let’s take a bit of a walk through the ODI dashboard that I’ve put together. First up is a view on the number of sessions that have run over time, along with their duration. For duration I’ve shown 50th (median), 75th and 95th percentiles to get an idea of the spread of session runtimes. At the moment we’re looking at all sessions, so it’s not surprising that there is a wide range since there’ll always be small sessions and longer ones:

Next up on the dashboard comes a summary of top sessions by runtime, both cumulative and per-session. The longest running sessions are an obvious point of interest, but cumulative runtime is also important; something may only take a short while to run when compared to some singular long-running sessions, but if it runs hundreds of times then it all adds up and can give a big performance boost if time is shaved off it.

Plotting out session execution times is useful to be able to see both when the longest running sessions ran:

The final element on this first dashboard is one giving the detail for each of the top x long-running session executions, including the session number so that it can be examined in further detail through the Operator tool.

Kibana dashboards are interactive, so you can click on a point in a graph to zoom in on that time period, as well as click & drag to select an arbitrary range. The latter technique is sometimes known as “Brushing”, and if I’m not describing it very well have a look at this example here and you’ll see in an instant what I mean.

As you focus on a time period in one graph the whole dashboard’s time filter changes, so where you have a table of detail data it then just shows it for the range you’ve selected. Notice also that the granularity of the aggregation changes as well, from a summary of every three hours in the first of the screenshots through to 30 seconds in the last. This is a nice way of presenting a summary of data, but isn’t always desirable (it can mask extremes and abnormalities) so can be configured to be fixed as well.

Time isn’t the only interaction on the dashboard – anything that’s not a metric can be clicked on to apply a filter. So in the above example where the top session by cumulative time are listed out we might want to find out more about the one with several thousand executions

Simply clicking on it then filters the dashboard and now the session details table and graph show information just for that session, including duration, and rows processed:

Session performance analysis

As an example of the benefit of using a spread of percentiles we can see here is a particular session that had an erratic runtime with great variation, that then stabilised. The purple line is the 95th percentile response time; the green and blue are 50th and 75th respectively. It’s clear that whilst up to 75% of the sessions completed in about the same kind of time each time they ran, the remaining quarter took anything up to five times as long.

One of the most important things in performance is ensuring consistent performance, and that is what happens here from about half way along the horizontal axis at c.February:

But what was causing the variation? By digging a notch deeper and looking at the runtime of the individual steps within the given session it can be seen that the inconsistent runtime was caused by a single step (the green line in this graph) within the execution. When this step’s runtime stabilises, so does the overall performance of the session:

This is performing a port-mortem on a resolved performance problem to illustrate how useful the data is – obviously if there were still a performance problem we’d have a clear path of investigation to pursue thanks to this data.


Data’s pulled from the ODI repository tables using Elasticsearch JDBC river, from where it’s stored and indexed in Elasticsearch, and presented through Kibana 4 dashboards.


The data load from the repository tables into Elasticsearch is incremental, meaning that the solution works for both historical analysis and more immediate monitoring too. Because the data’s actually stored in Elasticsearch for analysis it means the ODI repository tables can be purged if required and you can still work with a full history of runtime data in Kibana.

If you’re interested in finding out more about this solution and how Rittman Mead can help you with your ODI and OBIEE implementation and monitoring needs, please do get in touch.

Categories: BI & Warehousing