Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 9 hours 47 min ago

Once A Week Is Never Enough

Thu, 2015-10-01 04:48

Over the last few months we have been looking at User Engagement and recently asked people take part in a global usage survey so we could create a series of benchmark metrics detailing how organisations used OBIEE.

We had a great response and have started to process the data. One of the first things that struck us was that the average user logs into their OBIEE system about once per week.



We derive the once per week figure from the DAU/MAU metric we have talked about here and here. DAU/MAU shows us the % of your user community repeatably using the tool – so a kind of stickiness metric.

DAU and MAU stand for Daily Average Users and Monthly Average Users and are the unique number of users who login and use the system over the period. The definition of DAU/MAU we are using is (30 day moving average of DAU)/(MAU) – using a moving average levels out any usage spikes on particular days of the week.

The reason this measure is significant is that is tells us of how integrated the system is into people’s (working) lives. If you believe that the information in your OBIEE system is invaluable to your employees carrying out their jobs, then you would expect your DAU/MAU to tend to 100%.

Note: DAU/MAU won’t identify users who don’t into your system at all, there is an assumption that they will login at least once over the period of a month. If this is not happening then you have an adoption problem, not an engagement problem.

Is More Engagement Better?

My view is that the higher the engagement in the system, the more valuable the system is. So the questions become whether one visit per week enough; what does good look like; and what did you expect when you designed the system?

If you follow this reasoning, then we could assume that if we could get people using their OBIEE systems more, then their organisations would be better off.

Raising Engagement

We have previous discussed 5 levers that we think drive user engagement:

  • User interface and user experience
  • Quality, relevance, and confidence in data
  • Performance
  • Ability to use the system
  • Accessibility – is the system available in the right way, at the right time?

I want to talk about a couple of them here and how they can effect engagement.

User Interface And User Experience

We have been looking at report and dashboard design for a while now and believe that every dashboard should tell a story. When a user goes to a screen it must support the job they are doing. This means that dashboards have to be designed first and foremost focusing on user’s business processes. We see too many dashboards that contain loosely coupled sets of data that developers thought should be in the same place.

If you need to understand what data a user or analyst needs to do their job, then observe them, look which reports and spreadsheets they use, follow their eyes to see which cells on the various tables they monitor. Are they looking at trends, comparisions or exceptions?

We cover this in depth through the Visual Redesign phase of our User Engagement Service.

The following quote is from one of our client who has used the service:

The new designs provide a simple and at the same time powerful view of data.

Quality, relevance, and confidence in data

However good your dashboard and report design is, if the underlying data is incorrect or inconsistent, or if the user thinks it is, then the system will not be valuable. Users will go to other sources for the data, and they will stop using the unstrusted source. Too many times trust is purely subjective. This is one of the reasons why Excel starts to take root in organisations as a reporting tool. It is somewhere users can self source data, transform and aggregate it as they please and, maybe misguidedly, trust the results.

This starts to open up a much wider conversation about Data Governance that I will address in future posts. We have spoken to a lot of organisations recently where the issue is not that their users don’t have access to data, its that fact there are multiple systems that they could get the answer from and multiple routes around their information architecture that the data flows. You need a clear business intelligence and analytics architecture and roadmap.

Business intelligence and analytics systems are typically judged by return on investment (ROI), however I wonder if organisations with mature information architectures supported by complex data flows could be better off by looking at the cost and complexity of their data architecture. By looking at this they could address both data governance and the total cost of ownership (TCO).

The first step should be to analyse the analytics driven data flows, find out where they apply business logic, where there is duplication, and where there are silos. If you can simplify this process then you start to build the foundation for building a better ROI business case.

So addressing quality, relevance, and confidence in data is a big undertaking. There are some simple things you can do, such as maintain a taxonomy of reporting attributes across the organisation, however in most organisations with mature information architectures the answer is likely to be more deep rooted.


Our goal at Rittman Mead is to help companies harness data in and around their organisations through a range of analytics strategies, processes and tools. We see User Engagement as a key part of this. We believe in our User Engagement Service we have the right tools to help you raise your DAU/MAU.

We have a fundamental belief that the more people embed business intelligence and analytics into their work lives, the better job they will do their company.

The great thing is that in user engagement in OBIEE is measurable. Usage Tracking provides unequivocal details of how your system is performing, who is using it and how often, now and after any changes.

Categories: BI & Warehousing

Taking a Look at Oracle Big Data Preparation Cloud Service – Spark-Based Data Transformation in the Cloud

Sun, 2015-09-27 04:54

One of the sessions I’m delivering at the upcoming Oracle Openworld 2015 in San Francisco is entitled “Oracle Business Intelligence Cloud Service—Moving Your Complete BI Platform to the Cloud [UGF4906]”, and looks at how you can now migrate your entire OBIEE11g platform into Oracle Public Cloud including data warehouse and data integration routines. Using Oracle BI Cloud Services’ new on-premise RPD upload feature you can upload an existing RPD into BICS and run it from there, with the accompanying data warehouse database moving into Oracle’s Database Cloud Service (and with the restriction that you can’t then edit the repository within BICS, you need to do that on-premise and upload again). For ETL and data integration you can carry on using ODI12c which now has the ability to load data into Oracle Storage Cloud (for file sources) and BICS (via a REST API) as well as the full Oracle DBaaS, but another Oracle option for doing purely cloud-based data processing enrichment has recent become available – Oracle Big Data Preparation Cloud Service. So what is it, how does it work and how is it different to ODI12c?

Oracle Big Data Preparation Cloud Service (“BDP”) is a thin-client application within Oracle Cloud for ingesting, preparing and enriching datasets that don’t have a predefined schema and may well need certain fields obfuscated or cleansed. Being integrated with Oracle Storage Cloud and other infrastructure and platform services within Oracle cloud it’s obviously aimed mainly at data transformation tasks within the Oracle Cloud enviroment, but you can upload and download datasets from your browser for use with on-premise applications. Unlike the more general-purpose Oracle Data Integrator it’s aimed instead at a particular use-case – non-technical information analysts who need to get data transformed, wrangled and enriched before they can make use of it in an environment like Hadoop. In fact the product name is a bit misleading – it runs on a big data platform within Oracle Cloud and like Oracle Big Data Discovery uses Apache Spark for its data processing – but it could potentially be useful for a business analyst to prepare data for loading into Oracle BI Cloud Service, and I’ll cover this angle when I talk about data loading options in by Oracle Openworld session.

Within a logical architecture for a typical big data DW and BI system, BDP sits alongside ODI within the Data Factory and provides self-service, agile transformation capabilities to more business-orientated users. 


Oracle Big Data Cloud Preparation Service shares quite a bit of functionality and underlying technology, with Oracle Big Data Discovery – both run on Hadoop, they both use Apache Spark for data processing and transformation, and both offer data transformation and “wrangling” features aimed at non-technical users. Oracle are positioning Big Data Preparation Service as something you’d use in the execution layer of the Oracle Information Management Reference Architecture whereas Big Data Discovery is associated more with the discovery layer – I’d mostly agree but I can see a role for BDD even within the execution layer, as a front-end to the data reservoir that typically now runs alongside relationally-stored data warehouses.


Looking back at the slides from one of the recent Strata conferences, for example, sees Oracle positioning BDP as the “operational data preparation” tool for structured and unstructured data – with no defined schema – coming into your information platform, with the enriched output then being used BI tools, enterprise reporting and data discovery tools.



Apart from the scalability benefits of running BDP on Apache Spark, the other interesting feature in BDP is how it uses Spark’s machine learning capabilities to try to automate as much of the data preparation process as possible, for example detecting credit card numbers in data fields and recommending you obfuscate that column. Similar to BICS and how Oracle have tried to simplify the process of creating reports and dashboards for a small team, BDP runs in the cloud tries to automate and simplify as much of the data preparation and enrichment process as possible, with ODI12c still available for ETL developers to develop more complex transformations.

The development lifecycle for BDP (from the Oracle Big Data Preparation Cloud Service e-book on Oracle’s website) uses a cycle of ingesting, cleaning, enriching and then publishing data using scripts authored using the tool and run on the Apache Spark platform. The diagram below shows the BDP development lifecycle from Oracle’s Big Data Preparation Cloud Service Handbook, and shows how ingestion, enrichment, publishing and governance go in a cycle with the common foundation of the transformation scripts that you build using BDP’s web interface.


So let’s walk through an example data preparation exercise using a file of data stored initially in Oracle Storage Cloud Service. After logging into BDP via Oracle Cloud you’re first presented with the Catalog view, listing out all your previous transformations and showing you when they were last used to process some data.


To create a transformation you first give it a name, then select the data source and then the file you’re interested in. In my environment I’ve got Oracle Storage Cloud and HDFS available as my main data sources, or I could upload a file from my desktop and start from there.


BDP then ingests the file and then uses its machine learning features to process and classify data in each column, recommending column names such as “gender”, “city” and cc_number based on (presumably) some sort of classification model. In the screenshot below you can see a set of these recommendations on the left-hand side of the screen, with the columns themselves listed centre and a summary of the file profiling on the right.


Taking a closer look at the profile results panel you can see two of the columns have alerts raised, in red. Clicking on the alert shows that the two columns have credit card data stored in clear text, with the recommendation being to obfuscate or otherwise secure these fields. Clicking on a field then shows the various transformation options, with the obvious choice here being to automatically obfuscate the data in those fields.


Once you’ve worked through all the recommendations and added any transformations you choose to add yourself, the final step is to publish your transformation to one of the available targets. In the example below we’ve got Oracle Storage Cloud and HDFS again as potential targets; I’d imagine Oracle will add a connector to BICS soon, for example, so that you can use BDP as a data prep tool for file data that will then be added to your dataset in BICS.


So … it’ll be interesting to see where this one goes. Its interesting that Oracle have split out data preparation and data discovery into two tools whilst others are saying theirs can do both, and you’ll still need ODI for the more complex integration jobs. But I like the innovative use of machine learning to do away with much of the manual work required for classification of incoming data fields, and running the whole thing on Spark certainly gives it the potential of scale. A couple of years ago I was worried Oracle didn’t really have a strategy for data integration and ETL in the cloud, but we’re starting to see something happen now.

There’s a big push from the Oracle field at the moment to move customers into the cloud, and I can see BDP getting bundled in with Big Data Cloud Service and BICS as the accompanying cloud data preparation tool. The danger then of course is that Big Data Discovery starts to look less useful, especially with Visual Analyzer already available within BICS and coming soon on-premise with OBIEE12c. My guess is that what we’re seeing now with these initial releases of BDP and BDD is just the start, with BDP adding more automatic enrichment “smarts” and starting to cover integration use-cases too, whilst BDD will put more focus on data visualization and analytics on the data reservoir.

Categories: BI & Warehousing

Using the BI Server Metadata Web Service for Automated RPD Modifications

Fri, 2015-09-25 06:33

A little-known new feature of OBIEE 11g is a web service interface to the BI Server. Called the “BI Server Metadata Web Service” it gives a route into making calls into the BI Server using SOAP-based web services calls. Why is this useful? Because it means you can make any call to the BI Server (such as SAPurgeAllCache) from any machine without needing to install any OBIEE-related artefacts such as nqcmd, JDBC drivers, etc. The IT world has been evolving over the past decade or more towards a more service-based architecture (remember the fuss about SOA?) where we can piece together the functionality we need rather than having one monolithic black box trying to do everything. Being able to make use of this approach in our BI deployments is a really good thing. We can do simple things like BI Server cache management using a web service call, but we can also do more funky things, such as actually updating repository variable values in real time – and we can do it from within our ETL jobs, as part of an automated deployment script, and so on.

Calling the BI Server Metadata Web Service

First off, let’s get the web service configured and working. The documentation for the BI Server Metadata Web Service can be found here, and it’s important to read it if you’re planning on using this. What I describe here is the basic way to get it up and running.

Configuring Security

We need to configure the security against the web service to define what kind of authentication is required by it to use. If you don’t do this, you won’t be able to make calls to it. Setting up the security is a case of attaching a security policy in WebLogic Server to the web service (called AdminService) itself. I’ve used oracle/wss_http_token_service_policy which means that the credentials can be passed through using standard HTTP Basic authentication.

You can do this through Enterprise Manager:

Or you can do it through WLST using the attachWebServicePolicy call.

You also need to configure WSM, adding some entries to the credential store as detailed here.

Testing the Web Service

The great thing about web services is that they can be used from pretty much anywhere. Many software languages will have libraries for making SOAP calls, and if they don’t, it’s just HTTP under the covers so you can brew your own. For my testing I’m using the free version of SoapUI. In anger, I’d use something like curl for a standalone script or hook it up to ODI for integration in the batch.

Let’s fire up SoapUI and create a new project. Web services provide a Web Service Definition Language (WSDL) that describes what and how they work, which is pretty handy. We can pass this WSDL to SoapUI for it to automatically build some sample requests for us. The WSDL for this web service is


Where biserver is your biserver (duh) and port is the managed server port (usually 9704, or 7780).

We’ve now got a short but sweet list of the methods we can invoke:

Expand out callProcedureWithResults and double click on Request 1. This is a template SOAP message that SoapUI has created for you, based on the WSDL.

Edit the XML to remove the parameters section, and just to test things out in procedureName put GetOBISVersion(). Your XML request should look like this:

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  

If you try and run this now (green arrow or Cmd-Enter on the Mac) you’ll see the exact same XML appear on the right pane, which is strange… but click on Raw and you’ll see what the problem is :

Remember the security stuff we set up on the server previously? Well as the client we now need to keep our side of the bargain, and pass across our authentication with the SOAP call. Under the covers this is a case of sending HTTP Basic auth (since we’re using oracle/wss_http_token_service_policy), and how you do this depends on how you’re making the call to the web service. In SoapUI you click on the Auth button at the bottom of the screen, Add New Authentication, Type: Basic, and then put your OBIEE server username/password in.

Now you can click submit on the request again, and you should see in the response pane (on the right) in the XML view details of your BI Server version (you may have to scroll to the right to see it)

This simple test is just about validating the end-to-end calling of a BI Server procedure from a web service. Now we can get funky with it….

Updating Repository Variables Programatically

Repository variables in OBIEE are “global”, in that every user session sees the same value. They’re often used for holding things like “what was the last close of business date”, or “when was the data last updated in the data warehouse”. Traditionally these have been defined as dynamic repository variables with an initialisation block that polled a database query on a predefined schedule to get the current value. This meant that to have a variable showing when your data was last loaded you’d need to (a) get your ETL to update a row in a database table with a timestamp and then (b) write an init block to poll that table to get the value. That polling of the table would have to be as frequent as you needed in order to show the correct value, so maybe every minute. It’s kinda messy, but it’s all we had. Here I’d like to show an alternative approach.

Let’s say we have a repository variable, LAST_DW_REFRESH. It’s a timestamp that we use in report predicates and titles so that when they are run we can show the correct data based on when the data was last loaded into the warehouse. Here’s a rather over-simplified example:

The Title view uses the code:

Data correct as of: @{biServer.variables['LAST_DW_REFRESH']}

Note that we’re referencing the variable here. We could also put it in the Filter clause of the analysis. Somewhat tenuously, let’s imagine we have a near-realtime system that we’re federating a DW across direct from OLTP, and we just want to show data from the last load into the DW:

For the purpose of this example, the report is less important than the diagnostics it gives us when run, in nqquery.log. First we see the inbound logical request:

-------------------- SQL Request, logical request hash:  
   0 s_0,  
   "A - Sample Sales"."Time"."T05 Per Name Year" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
("Time"."T00 Calendar Date" <  VALUEOF("LAST_DW_REFRESH"))  

Note the VALUEOF clause. When this is parsed out we get to see the actual value of the repository variable that OBIEE is going to execute the query with:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-24 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

We can see the value through the Administration Tool Manage Sessions page too, but it’s less convenient for tracking in testing:

If we update the RPD online with the Administration Tool (nothing fancy at this stage) to change the value of this static repository variable :


And then rerun the report, we can see the value has changed:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-25 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Now let’s do this programatically. First off, the easy stuff, that’s been written about plenty before. Using biserverxmlgen we can create a XUDML version of the repository. Searching through this we can pull out the variable definition:

<Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
<Expr><![CDATA[TIMESTAMP '2015-09-25 00:30:00']]></Expr>  

and then wrap it in the correct XML structure and update the timestamp we want to use:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="">  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-26 00:30:00']]></Expr>  

(you can also get this automagically by modifying the variable in the RPD, saving the RPD file, and then comparing it to the previous copy to generate a patch file with the Administration Tool or comparerpd)

Save this XML snippet, with the updated timestamp value, as last_dw_refresh.xml. Now to update the value on the BI Server in-flight, first using the OBIEE tool biserverxmlcli. This is available on all OBIEE servers and client installations – we’ll get to web services for making this update call remotely in a moment.

biserverxmlcli -D AnalyticsWeb -R Admin123 -U weblogic -P Admin123 -I last_dw_refresh.xml

Here -D is the BI Server DSN, -U / -P are the username/password credentials for the server, and the -R is the RPD password.

Running the analysis again shows that it is now working with the new value of the variable:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-26 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Getting Funky – Updating RPD from Web Service

Let’s now bring these two things together – RPD updates (in this case to update a variable value, but could be anything), and BI Server calls.

In the above web service example I called the very simple GetOBISVersion. Now we’re going to use the slightly more complex NQSModifyMetadata. This is actually documented, and what we’re going to do is pass across the same XUDML that we sent to biserverxmlcli above, but through the web service. As a side note, you could also do this over JDBC if you wanted (under the covers, AdminService is just a web app with a JDBC connector to the BI Server).

I’m going to do this in SoapUI here for clarity but I actually used SUDS to prototype it and figure out the exact usage.

So as a quick recap, this is what we’re going to do:

  1. Update the value of a repository variable, using XUDML. We generated this XUDML from biserverxmlgen and wrapped it in the correct XML structure.

    We could also have got it through comparerpd or the Administration Tool ‘create patch’ function
  2. Use the BI Server’s NQSModifyMetadata call to push the XUDML to the BI Server online.

    We saw that biserverxmlcli can also be used as an alternative to this for making online updates through XUDML.
  3. Use the BI Server Metadata Web Service (AdminService) to invoke the NQSModifyMetadata call on the BI Server, using the callProcedureWithResults method

In SoapUI create a new request:

Set up the authentication:

Edit the XML SOAP message to remove parameters and specify the basic BI Server call:

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  

Now the tricky bit – we need to cram the XUDML into our XML SOAP message. But, XUDML is its own kind of XML, and all sorts of grim things happen here if we’re not careful (because the XUDML gets swallowed up into the SOAP message, it all being XML). The solution I came up with (which may not be optimal…) is to encode all of the HTML entities, which a tool like this does (and if you’re using a client library like SUDS will happen automagically). So our XUDML, with another new timestamp for testing:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="">  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-21 00:30:00']]></Expr>  


&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 00:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;

We’re not quite finished yet. Because this is actually a call (NQSModifyMetadata) nested in a call (callProcedureWithResults) we need to make sure NQSModifyMetadata gets the arguments (the XUDML chunk) through intact, so we wrap it in single quotes – which also need encoding (&apos;):

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 23:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

and then for final good measure, the single quotes around the timestamp need double-single quoting:

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

Nice, huh? The WSDL suggests that parameters for these calls should be able to be placed within the XML message as additional entities, which I wonder if would allow for proper encoding, but I couldn’t get it to work (I kept getting java.sql.SQLException: Parameter 1 is not bound).

So, stick this mess of encoding plus twiddles into your SOAP message and it should look like this: (watch out for line breaks; these can break things)

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  
         <procedureName>NQSModifyMetadata(&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;)</procedureName>  

Hit run, and with a bit of luck you’ll get a “nothing to report” response :

If you look in nqquery.log you’ll see:

[...] NQSModifyMetadata started.  
[...] NQSModifyMetadata finished successfully.

and all-importantly when you run your report, the updated variable will be used:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-21 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

If this doesn’t work … well, best of luck. Use nqquery.log, bi_server1.log (where the AdminService writes some diagnostics) to try and trace the issue. Also test calling NQSModifyMetadata from JDBC (or ODBC) directly, and then add in the additional layer of the web service call.

So, in the words of Mr Rittman, there you have it. Programatically updating the value of repository variables, or anything else, done online, and for bonus points done through a web service call making it possible to use without any local OBIEE client/server tools.

Categories: BI & Warehousing

Managing the OBIEE BI Server Cache from ODI 12c

Thu, 2015-09-24 13:47

I wrote recently about the OBIEE BI Server Cache and how useful it can be, but how important it is to manage it properly, both in the purging of stale data and seeding of new. In this article I want to show how to walk-the-walk and not just talk-the-talk (WAT? But you’re a consultant?!). ODI is the premier data integration tool on the market and one that we are great fans of here at Rittman Mead. We see a great many analytics implementations built with ODI for the data load (ELT, strictly speaking, rather than ETL) and then OBIEE for the analytics on top. Managing the BI Server cache from within your ODI batch makes a huge amount of sense. By purging and reseeding the cache directly after the data has been loaded into the database we can achieve optimal cache usage with no risk of stale data.

There are two options for cleanly hooking into OBIEE from ODI 12c with minimal fuss: JDBC, and Web Services. JDBC requires the OBIEE JDBC driver to be present on the ODI Agent machine, whilst Web Services have zero requirement on the ODI side, but a bit of config on the OBIEE side.

Setting up the BI Server JDBC Driver and Topology

Here I’m going to demonstrate using JDBC to connect to OBIEE from ODI. It’s a principle that was originally written up by Julien Testut here. We take the OBIEE JDBC driver bijdbc.jar from $FMW_HOME/Oracle_BI1/bifoundation/jdbc and copy it to our ODI machine. I’m just using a local agent for my testing, so put it in ~/.odi/oracledi/userlib/. For a standalone agent it should go in $AGENT_HOME/odi/agent/lib.

[oracle@ip-10-103-196-207 ~]$ cd /home/oracle/.odi/oracledi/userlib/  
[oracle@ip-10-103-196-207 userlib]$ ls -l  
total 200  
-rw-r----- 1 oracle oinstall    332 Feb 17  2014 additional_path.txt  
-rwxr-xr-x 1 oracle oinstall 199941 Sep 22 14:50 bijdbc.jar

Now fire up ODI Studio, sign in to your repository, and head to the Topology pane. Under Physical Architecture -> Technologies and you’ll see Oracle BI

Right click and select New Data Server. Give it a sensible name and put your standard OBIEE credentials (eg. weblogic) under the Connection section. Click the JDBC tab and click the search icon to the right of the JDBC Driver text box. Select the default,, and then in the JDBC Url box put your server and port (9703, unless you’ve changed the listen port of OBIEE BI Server)

Now click Test Connection (save the data server when prompted, and click OK at the message about creating a physical schema), and select the Local Agent with which to run it. If you get an error then click Details to find out the problem.

One common problem can be the connection through to the OBIEE server port, so to cut ODI out of the equation try this from the command prompt on your ODI machine (assuming it’s *nix):

nc -vz 9703

If the host resolves correctly and the port is open then you should get:

Connection to 9703 port [tcp/*] succeeded!

If not you’ll get something like:

nc: port 9703 (tcp) failed: Connection refused

Check the usual suspects – firewall (eg iptables) on the OBIEE server, firewalls on the network between the ODI and OBIEE servers, etc.

Assuming you’ve got a working connection you now need to create a Physical Schema. Right click on the new data server and select New Physical Schema.

OBIEE’s BI Server acts as a “database” to clients, within which there are “schemas” (Subject Areas) and “tables” (Presentation Tables). On the New Physical Schema dialog you just need to set Catalog (Catalog), and when you click the drop-down you should see a list of the Subject Areas within your RPD. Pick one – it doesn’t matter which.

Save the physical schema (ignore the context message). At this point your Physical Architecture for Oracle BI should look like this:

Now under Logical Architecture locate the Oracle BI technology, right click on it and select New Logical Schema. From the Physical Schemas dropdown select the one that you’ve just created. Give a name to the Logical Schema.

Your Logical Architecture for Oracle BI should look like this:

Building the Cache Management Routine Full Cache Purge

Over in the Designer tab go to your ODI project into which you want to integrate the OBIEE cache management functions. Right click on Procedures and select Create New Procedure. Give it a name such as OBIEE Cache – Purge All and set the Target Technology to Oracle BI

Switch to the Tasks tab and add a new Task. Give it a name, and set the Schema to the logical schema that you defined above. Under Target Command enter the call you want to make to the BI Server, which in this case is

call SAPurgeAllCache();

Save the procedure and then from the toolbar menu click on Run. Over in the Operator tab you should see the session appear and soon after complete – all being well – successfully.

You can go and check your BI Server Cache from the OBIEE Administration Tool to confirm that it is now empty:

And confirm it through Usage Tracking:

From what I can see at the default log levels, nothing gets written to either nqquery.log or nqserver.log for this action unless there is an error in your syntax in which case it is logged in nqserver.log:

(For more information on that particular error see here)

Partial Cache Purge

This is the same pattern as above – create an ODI Procedure to call the relevant OBIEE command, which for purging by table is SAPurgeCacheByTable. We’re going to get a step more fancy now, and add a variable that we can pass in so that the Procedure is reusable multiple times over throughout the ODI execution for different tables.

First off create a new ODI Variable that will hold the name of the table to purge. If you’re working with multiple RPD Physical Database/Catalog/Schema objects you’ll want variables for those too:

Now create a Procedure as before, with the same settings as above but a different Target Command, based on SAPurgeCacheByTable and passing in the four parameters as single quoted, comma separated values. Note that these are the Database/Catalog/Schema/Table as defined in the RPD. So “Database” is not your TNS or anything like that, it’s whatever it’s called in the RPD Physical layer. Same for the other three identifiers. If there’s no Catalog (and often there isn’t) just leave it blank.

When including ODI Variable(s) make sure you still single-quote them. The command should look something like this:

Now let’s seed the OBIEE cache with a couple of queries, one of which uses the physical table and one of which doesn’t. When we run our ODI Procedure we should see one cache entry go and the other remain. Here’s the seeded cache:

And now after executing the procedure:

And confirmation through Usage Tracking of the command run:

Cache Seeding

As before, we use an ODI Procedure to call the relevant OBIEE command. To seed the cache we can use SASeedQuery which strictly speaking isn’t documented but a quick perusal of the nqquery.log when you run a cache-seeding OBIEE Agent shows that it is what is called in the background, so we’re going to use it here (and it’s mentioned in support documents on My Oracle Support, so it’s not a state secret). The documentation here gives some useful advice on what you should be seeding the cache with — not necessarily only exact copies of the dashboard queries that you want to get a cache hit for.

Since this is a cookie-cutter of what we just did previously you can use the Duplicate Selection option in ODI Designer to clone one of the other OBIEE Cache procedures that you’ve already created. Amend the Target Command to:

When you run this you should see a positive confirmation in the nqserver.log of the cache seed:

[2015-09-23T23:23:10.000+01:00] [OracleBIServerComponent] [TRACE:3]  
[USER-42] [] [ecid: 005874imI9nFw000jzwkno0007q700008K,0] [tid: 9057d700]  
[requestid: 477a0002] [sessionid: 477a0000] [username: weblogic]  
Query Result Cache: [59124]  
The query for user 'weblogic' was inserted into the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_84190_2.TBL'. [[

A very valid alternative to calling SASeedQuery would be to call the OBIEE SOA Web Service to trigger an OBIEE Agent that populated the cache (by setting ‘Destination’ to ‘Oracle BI Server Cache (For seeding cache)’). OBIEE Agents can also be ‘daisy chained’ so that one Agent calls another on completion, meaning that ODI could kick off a single ‘master’ OBIEE Agent which then triggered multiple ‘secondary’ OBIEE Agents. The advantage of this approach over SASeedQuery is that cache seeding is more likely to change as OBIEE usage patterns do, and it is easier for OBIEE developers to maintain all the cache seeding code within ‘their’ area (OBIEE Presentation Catalog) than put in a change request to the ODI developers each time to change a procedure.

Integrating it in the ODI batch

You’ve two options here, using Packages or Load Plans. Load Plans were introduced in ODI and are a clearer and more flexible of orchestrating the batch.

To use it in a load plan create a serial step that will call a mapping followed by the procedure to purge the affected table. In the procedure step in the load plan set the value for the variable. At the end of the load plan, call the OBIEE cache seed step:

Alternatively, to integrate the above procedures into a Package instead of a load plan you need to add two steps per mapping. First, the variable is updated to hold the name of the table just loaded, and then the OBIEE cache is purged for the affected table. At the end of the flow a call is made to reseed the cache:

These are some very simple examples, but hopefully illustrate the concept and the powerful nature of integrating OBIEE calls directly from ODI. For more information about OBIEE Cache Management, see my post here.

Categories: BI & Warehousing

OBIEE BI Server Cache Management Strategies

Wed, 2015-09-23 11:36

The OBIEE BI Server cache can be one of the most effective ways of improving response times of OBIEE dashboards. By using data already in the cache it reduces load on the database, the network, and the BI Server.

Should you be using it? I always describe it as the “icing on the cake” – it’s not a fix for a badly-designed OBIEE system, but it does make a lot of sense to use once you’re happy that the foundations for the system are in place. If the foundations are not not in place? Then you’re just papering over the cracks and at some point it’s probably going to come back to bite you. As Mark Rittman put it nearly seven years ago, it’s “[…]usually the last desperate throw of the dice”. The phrase “technical debt”? Yeh, that. But, BI Server caching used after performance review and optimisation rather than instead of – then it’s a Good Thing.

So you’ve decided to use the BI Server cache, and merrily trotted over to Enterprise Manager to enable it, restarted the BI Server, and now your work is done, right? Not quite. Because the BI Server cache will start to store data from all the queries that you run, and use it to satisfy subsequent queries. Not only will it match on a direct hit for the same query, it will use a subset of an existing cache entry where appropriate, and can even aggregate up from what’s in the cache to satisfy a query at a higher level. Clever stuff. But, what happens when you load new data into your data warehouse? Well, the BI Server continues to serve requests out of the cache, because why shouldn’t it? And herein lies the problem with “just turn caching on”. You have to have a cache management strategy.

A cache management strategy sounds grand doesn’t it? But it boils down to two things:

  1. Accuracy – Flush any data from the cache that is now stale
  2. Speed – Prime the cache so that as many queries get a hit on it, first time
Maintaining an Accurate Cache

Every query that is run through the BI Server, whether from a Dashboard, Answers, or more funky routes such as custom ODBC clients or JDBC, will end up in cache. It’s possible to “seed” (“prime”/“warmup”) the cache explicitly, and this is discussed later. The only time you won’t see data in the cache is if (a) you have BI Server caching disabled, or (b) you’ve disabled the Cacheable option for a physical table that is involved in providing the data for the query being run.

You can see metadata for the current contents of the cache in the Administration Tool when connected online to the BI Server, through the Manage -> Cache menu option. This gives you lots of useful information (particularly when you come to optimising cache usage) including the size of each entry, when it was created, when it was last used, and so on.

Purging Options

So we’ve a spread of queries run that hit various dimension and fact tables and created lots of cache entries. Now we’ve loaded data into our underlying database, so we need to make sure that the next time a user runs an OBIEE query that uses the new data they can see it. Otherwise we commit the cardinal sin of any analytical system and show the user incorrect data which is a Bad Thing. It may be fast, but it’s WRONG….

We can purge the whole cache, but that’s a pretty brutal approach. The cache is persisted to disk and can hold lots of data stretching back months – to blitz all of that just because one table has some new data is overkill. A more targeted approach is to purge by physical database, physical table, or even logical query. When would you use these?

  • Purge entire cache – the nuclear option, but also the simplest. If your data model is small and a large proportion of the underlying physical tables may have changed data, then go for this
  • Purge by Physical Database – less brutal that clearing the whole cache, if you have various data sources that are loaded at different points in the batch schedule then targeting a particular physical database makes sense.
  • Purge by Physical Table – if many tables within your database have remained unchanged, whilst a large proportion of particular tables have changed (or it’s a small table) then this is a sensible option to run for each affected table
  • Purge by Query – If you add a few thousand rows to a billion row fact table, purging all references to that table from the cache would be a waste. Imagine you have a table with sales by day. You load new sales figures daily, so purging the cache by query for recent data is obviously necessary, but data from previous weeks and months may well remain untouched so it makes sense to leave queries against those in the cache. The specifics of this choice are down to you and your ETL process and business rules inherent in the data (maybe there shouldn’t be old data loaded, but what happens if there is? See above re. serving wrong data to users). This option is the most complex to maintain because you risk leaving behind in the cache data that may be stale but doesn’t match the precise set of queries that you purge against.

Which one is correct depends on

  1. your data load and how many tables you’ve changed
  2. your level of reliance on the cache (can you afford low cache hit ratio until it warms up again?)
  3. time to reseed new content

If you are heavily dependant on the cache and have large amounts of data in it, you are probably going to need to invest time in a precise and potentially complex cache purge strategy. Conversely if you use caching as the ‘icing on the cake’ and/or it’s quick to seed new content then the simplest option is to purge the entire cache. Simple is good; OBIEE has enough moving parts without adding to its complexity unnecessarily.

Note that OBIEE itself will perform cache purges in some situations including if a dynamic repository variable used by a Business Model (e.g. in a Logical Column) gets a new value through a scheduled initialisation block.

Performing the Purge

There are several ways in which we can purge the cache. First I’ll discuss the ones that I would not recommend except for manual testing:

  1. Administration Tool -> Manage -> Cache -> Purge. Doing this every time your ETL runs is not a sensible idea unless you enjoy watching paint dry (or need to manually purge it as part of a deployment of a new RPD etc).
  2. In the Physical table, setting Cache persistence time. Why not? Because this time period starts from when the data was loaded into the cache, not when the data was loaded into your database.
    An easy mistake to make would be to think that with a daily ETL run, setting the Cache persistence time to 1 day might be a good idea. It’s not, because if your ETL runs at 06:00 and someone runs a report at 05:00, there is a going to be a stale cache entry present for another 23 hours. Even if you use cache seeding, you’re still relinquishing control of the data accuracy in your cache. What happens if the ETL batch overruns or underruns?

    The only scenario in which I would use this option is if I was querying directly against a transactional system and wanted to minimise the number of hits OBIEE made against it – the trade-off being users would deliberately be seeing stale data (but sometimes this is an acceptable compromise, so long as it’s made clear in the presentation of the data).

So the two viable options for cache purging are:

  1. BI Server Cache Purge Procedures
  2. Event Polling Table
BI Server Cache Purge Procedures

These are often called “ODBC” Procedures but technically ODBC is just one – of several – ways that the commands can be sent to the BI Server to invoke.

As well as supporting queries for data from clients (such as Presentation Services) sent as Logical SQL, the BI Server also has its own set of procedures. Many of these are internal and mostly undocumented (Christian Berg does a great job of explaining them here, and they do creep into the documentation here and here), but there are some cache management ones that are fully supported and documented. They are:

  • SAPurgeCacheByQuery
  • SAPurgeCacheByTable
  • SAPurgeCacheByDatabase
  • SAPurgeAllCache
  • SAPurgeCacheBySubjectArea (>=
  • SAPurgeCacheEntryByIDVector (>=

The names of these match up to the purge processes that I describe above. The syntax is in the documentation, but what I am interested in here is how you can invoke them. They are my preferred method for managing the BI Server cache because they enable you to tightly couple your data load (ETL) to your cache purge. Setting the cache to purge based on a drop-dead timer (whether crontab, tivoli, Agent/iBot, whatever) gives you a huge margin of error if your ETL runtime does not remain consistent. Whether it organically increases in runtime as data volumes increase, or it fails and has to be fixed and restarted, ETL does not always finish bang-on when it is ‘supposed’ to.

You can call these procedures in the several ways, including:

  1. nqcmd – one of the most common ways, repeated on many a blog, but requires nqcmd/OBIEE to be installed on the machine running it. nqcmd is a command-line ODBC client for connecting to the BI Server
  2. ODBC – requires BI to be installed on the machine running it in order to make the OBIEE ODBC driver available
  3. JDBC – just requires the OBIEE JDBC driver, which is a single .jar file and thus portable
  4. Web Service – the OBIEE BI Server Web Service can be used to invoke these procedures from any machine with no dependencies other than some WSM configuration on the OBIEE server side.

My preference is for JDBC or Web Service, because they can be called from anywhere. In larger organisations the team building the ETL may have very little to do with OBIEE, and so asking them to install OBIEE components on their server in order to trigger cache purging can be quite an ask. Using JDBC only a single .jar needs copying onto the server, and using the web service not even that:

curl --silent --header "Content-Type: text/xml;charset=UTF-8" \
--user weblogic:Admin123 \
--data @purge_cache_soap.xml \

[59118] Operation SAPurgeAllCache succeeded!

For details of configuring ODI to use the BI Server JDBC driver in order to tightly couple the cache management into an existing ODI load job, stay tuned for a future blog!

Event Polling Tables (EPT)

NB Not Event “Pooling” Tables as I’ve often seen this called

The second viable approach to automated cache purging is EPT, which is a decoupled approach to managing the cache purge, with two components:

  1. An application (your ETL) inserts a row into the table S_NQ_EPT (which is created at installation time by the RCU in the BIPLATFORM schema) with the name of the physical table in which data has been changed
  2. The BI Server polls (hence the name) the S_NQ_EPT table periodically, and if it finds entries in it, purges the cache of data that is from those tables.

So EPT is in a sense the equivalent of using SAPurgeCacheByTable, but in a manner that is not tightly coupled. It relies on configuring the BI Server for EPT, and there is no easy way to know from your ETL if the cache purge has actually happened. It also means that the cache remains stale potentially as long as the polling interval that you’ve configured. Depending on when you’re running your ETL and the usage patterns of your users this may not be an issue, but if you are running ETL whilst users are on the system (for example intra-day micro ETL batches) you could end up with users seeing stale data. Oracle themselves recommend not setting the polling interval any lower than 10 minutes.

EPT has the benefit of being very easy to implement on the ETL side, because it is simply a database table into which the ETL developers need to insert a row for each table that they update during the ETL.

Seeding the Cache

Bob runs an OBIEE dashboard, and the results are added to the cache so that when Bill runs the same dashboard Bill gets a great response rate because his dashboard runs straight from cache. Kinda sucks for Bob though, because his query ran slow as it wasn’t in the cache yet. What’d be nice would be that for the first user on a dashboard the results were already in cache. This is known as seeding the cache, or ‘priming’ it. Because the BI Server cache is not dumb and will hit the cache for queries that aren’t necessarily direct replicas of what previously ran working out the optimal way to seed the cache can take some trial and error careful research. The documentation does a good job of explaining what will and won’t qualify for a cache hit, and it’s worth reading this first.

There are several options for seeding the cache. These all assume you’ve figured out the queries that you want to run in order to load the results into cache.

  1. Run the analysis manually, which will return the analysis data to you and insert it into the BI Server Cache too.
  2. Create an Agent to run the analysis with destination set to Oracle BI Server Cache (For seeding cache), and then either:
    1. Schedule the analysis to run from an Agent on a schedule
    2. Trigger it from a Web Service in order to couple it to your ETL data load / cache purge batch steps.
  3. Use the BI Server Procedure SASeedQuery (which is what the Agent does in the background) to load the given query into cache without returning the data to the client. This is useful for doing over JDBC/ODBC/Web Service (as discussed for purging above). You could just run the Logical SQL itself, but you probably don’t want to pull the actual data back to the client, hence using the procedure call instead.
Sidenote – Checking the RPD for Cacheable Tables

The RPD Query Tool is great for finding objects matching certain criteria. However, it seems to invert results when looking for Cacheable Physical tables – if you add a filter of Cacheable = false you get physical tables where Cacheable is enabled! And the same in reverse (Cacheable = true -> shows Physical tables where Cacheable is disabled)

Day in the Life of an OBIEE Cache Entry (Who Said BI Was Boring?)

In this example here I’m running a very simple report from SampleApp v406:

The Logical SQL for this is:

   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Why’s that useful to know? Because when working with the cache resubmitting queries is needed frequently and doing so directly from an interface like nqcmd is much faster (for me) than a web GUI. Horses for courses…

So I’ve run the query and now we have a cache entry for it. How do we know? Because we see it in the nqquery.log (and if you don’t have it enabled, go and enable it now):

[2015-09-23T15:58:18.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-42]  
[] [ecid: 00586hFR07mFw000jzwkno0005Qx00007U,0] [tid: 84a35700]  
[requestid: a9730015] [sessionid: a9730000] [username: weblogic]  
Query Result Cache: [59124] The query for user 'weblogic' was inserted into 
the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_57498_0.TBL'.

We see it in Usage Tracking (again, if you don’t have this enabled, go and enable it now):


We can also see it in the Administration Tool (when connected online to the BI Server):

We can even see it and touch it (figuratively) on disk:

So we have the data in the cache. The same query run again will now use the cache entry, as seen in nqquery.log:

[2015-09-23T16:09:24.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-21]
[] [ecid: 11d1def534ea1be0:6066a19d:14f636f1dea:-8000-000000000000b948,0:1:1:5]  
[tid: 87455700] 
[requestid: a9730017] [sessionid: a9730000] [username: weblogic]  
Cache Hit on query: [[  
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/weblogic/Cache Test 01',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Created by:     weblogic

and in Usage Tracking:

“Interestingly” Usage Tracking shows a count of 1 for number of DB queries run, which we would not expect for a cache hit. The nqquery.log shows the same, but no query logged as being sent to the database, so I’m minded to dismiss this as an instrumentation bug.

Now what about if we want to run a query but not use the BI Server Cache? This is an easy one, plenty blogged about it elsewhere – use the Request Variable DISABLE_CACHE_HIT=1. This overrides the built in system session variable of the same name. Here I’m running it directly against the BI Server, prefixed onto my Logical SQL – if you want to run it from within OBIEE you need the Advanced tab in the Answers editor.

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Now we get a cache ‘miss’, because we’ve specifically told the BI Server to not use the cache. As you’d expect, Usage Tracking shows no cache hit, but it does show a cache insert – because why shouldn’t it?

If you want to run a query without seeding the cache either, you can use DISABLE_CACHE_SEED=1:

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

These request variables can be set per analysis, or per user by creating a session initialisation block to assign the required values to the respective variables.

Cache Location

The BI Server cache is held on disk, so it goes without saying that storing it on fast (eg SSD) disk is a Good Idea. There’s no harm in giving it its own filesystem on *nix to isolate it from other work (in terms of filesystems filling up) and to make monitoring it super easy.

Use the DATA_STORAGE_PATHS configuration element in NQSConfig.ini to change the location of the BI Server cache.

  1. Use BI Server Caching as the ‘icing on the cake’ for performance in your OBIEE system. Make sure you have your house in order first – don’t use it to try to get around bad design.
  2. Use the SAPurgeCache procedures to directly invoke a purge, or the Event Polling Tables for a more loosely-coupled approach. Decide carefully which purge approach is best for your particular caching strategy.
  3. If using the SAPurgeCache procedures, use JDBC or Web Services to call them so that there is minimal/no installation required to call them from your ETL server.
  4. Invest time in working out an optimal cache seeding strategy, making use of Usage Tracking to track cache hit ratios.
  5. Integrate both purge and seeding into your ETL. Don’t use a schedule-based approach because it will come back to haunt you with its inflexibility and scope for error.
Categories: BI & Warehousing

Tracing Data Lineage and Impact Analysis in Oracle Enterprise Metadata Management 12c

Fri, 2015-09-18 22:11

At the beginning of the year, I wrote a couple of posts about Oracle Enterprise Metadata Management, one about how to install the tool and another one explaining the harvesting (importing) and stitching (relating) processes. A lot of things have happened since then, including a new version of OEMM released on May with some nice features as HTML 5 support, a workflow for business glossary and including the concept of groups in security.

OEMM is the Oracle solution for data governance and lifecycle management.  “Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise”. So, the process of knowing how my data is related, the integrity of it, who has access to which part of the data and with which application, from where people can access the data, defining the stewards of the data, etc are part of data governance.

To refresh the main concepts, the first thing that you have to do when you start working with OEMM is the harvesting or importing process. We import the metadata creating one model for each different metadata source. After the successful importing process, we need to create a configuration to relate all the required models.

One of the main features of OEMM is to perform Data Flow Lineage and Impact Analysis. Why is this so important? Because the typical problem that we face in IT is to understand how all the company data is related, what will happen if I modify one column in a source schema,  or how I can find out where a particular piece of data come from. In this post I will talk about these process which are exactly the same as in the previous version.

I created an environment where I installed the latest release of OEMM (, Oracle Database 12c that contains the source, staging and DW schemas and ODI 12c which is the ELT tool to load and transform the data into our DW schema. In addition I also have OBIEE 11g which repository is based on the DW schema and some OBI analyses to query the data. I imported all the metadata from these different sources and create a configuration that includes all the models created.


Data Impact Analysis

The process of Data Impact Analysis is to find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.

There are many ways to start the tracing data impact process in OEMM. We can start the analysis impact process in OEMM from the model where the object belongs or from a diagram.

We can open the model from the repository panel or from the configuration Architecture diagram. Once the model is opened, select the object that you want to perform the impact analysis, right click on it, and select Trace Data Impact or Trace Lineage (Advanced) option. Either option opens the same dialog window.


Then, we need to choose between the model and the configuration as the scope for this impact analysis. The most common option is to select one configuration to see the consequences of a possible change in all the target models present in this configuration. Also we have the option to see the result in graphic or in text mode.

In our example, we are going to trace the data impact of the ORDERS table which is in the database source model and is the source for the fact table in the DW.

In the following picture, you will find the resulted diagram of the data impact analysis. In order to see the diagrams better in OEMM,  you can collapse the Properties windows at your right and press the Fit to Content button in the diagram menu.


There are plenty of interesting things that you can do from here. First, if you take a closer look you will see that the lines between the Orders Source and Order Staging as well as the ones from Orders Staging and Orders_DW are ticker than the ones that connect Orders_DW to the OBIEE Model. This means there are some ETL process in the middle of these connections. When we right click on one of these arrows and select Trace ETL details, a new tab is opened inside the configuration and will show the mappings involved and the operations (a JOIN in our example). Also if we expand the windows properties and select the JOIN element we can see its condition.


Coming back to the Data Impact diagram, you can select one particular object like a column and press the highlight path button to emphasise only a specific data flow. This is very useful when you have many objects in the diagram and you need to focus in some particular data path.


If you want to keep this diagram saved so you can come back easily to it, you can add a bookmark. You can create folders to organise your bookmarks and they will appear at the bottom left of the screen.




Tracing Data Lineage

The Trace Data Lineage process is to find the path from the source to the target object that you selected.

In OEMM, tracing data lineage is a very similar process to impact analysis. In fact, it uses the same dialog window but now you should select the Trace Data Lineage or Trace Lineage (Advanced) option. Again, either option opens the same dialog window.

In our example, we are going to use the Table View from a particular OBI Analysis. We will perform first the data lineage using the model as a scope and then run another one in the context of the configuration in order to make clear the differences between these two data lineages.

Once I have the OBIEE Model opened, I double-click the required analysis and the analysis with all its components (criteria and layout) will appear in the Metadata Browser. I will run first the data lineage based on the model.


Another tab is shown with the data lineage based on the model. You can select one measure for example and see the properties. It will show details as the expression or formula (if it is a calculated item) and the default aggregation. In the diagram appears three main objects (that can be collapsed or expanded as required), one that represents the physical layer in the OBI repository (rpd), another for the Presentation layer and finally the table view in the OBI Analysis.

As you probably realise by now, there are a lot of tabs and panels that appears as you open different objects and execute some tasks. The panels are collapsible and can be resized, so you will need to do this a lot in order to visualise what you want.


We are going to repeat the same process but now selecting the configuration that contains the OBIEE model. In this new diagram you will see where the data in the OBI Analysis is coming from and the relationship with the other models in the configuration including the ODI model and the different database schemas. You can expand or collapse the objects in the diagram, tracing ETL details, highlight the path and all the features that we’ve seen for the data impact diagram as it is the same diagram.


Another useful and new feature of this release is the quick find that allows you to search for specific words and show you a list of all the objects in the diagram that matches the search criteria. If you double-click in one of this results, the object will appear highlighted in the diagram.


Model Connection Overview

The Model Connection Overview diagram shows the connection of the objects inside a model. As its name suggests is just an overview of the connections. You cannot go deeper into the object details, but is useful to have a rough idea of the object relationship in a model.  In the next pictures you will see an example of he Model Connection diagram of the OBIEE model and the diagram for a particular mapping in the ODI model. You can also save it as a bookmark as the other diagrams.


And this is how you can perform data lineage and impact analysis in OEMM, analysing the relationships among your company data.


Categories: BI & Warehousing

Managing Impala and Other Mixed Workloads on the Oracle Big Data Appliance

Tue, 2015-09-15 04:56

One of our current client projects uses Cloudera Impala to provide fast ad-hoc querying to the data we’re loading into their Oracle Big Data Appliance Hadoop environment. Impala bypasses MapReduce to provide faster queries than Hive, but to do so it does a lot of processing in-memory and runs server processes on each node in the cluster, leading in some cases to runaway queries blocking other workloads in the same way that OBIEE queries on an Oracle Database can sometimes block ETL and application workloads. Several projects share this same Big Data Appliance, so to try and limit the impact Impala could have on other cluster workloads the client had disabled the Impala Daemons on nine of the twelve nodes in their Big Data Appliance; our concern with this approach was that an Impala query could access data from any datanode in the Big Data Appliance cluster, so whilst HDFS data is typically stored and replicated to three nodes in the cluster running the Impala daemons on just a quarter of the available nodes was likely to lead to data locality issues for Impala and blocks getting shipped across the network unnecessarily.

Going back to OBIEE and the Oracle Database, Oracle have a resource management feature for the Oracle database that allows you to put users and queries into separate resource pools and manage the share of overall resources that each pool gets. I covered this concept on the blog a few years ago, and the version of Cloudera Hadoop (CDH5.3) as used on the client’s Big Data Appliance has a feature called “YARN”, or Yet Another Resource Negotiator, that splits out the resource management and scheduling parts that were bound into MapReduce in Hadoop 1.0 so that MapReduce then just runs as a workload type on Hadoop, and with it then possible to run other workload types, for example Apache Spark, on that same cluster management framework.



Impala isn’t however configured to use YARN by default and uses an internal scheduler to govern how concurrent queries run and use cluster resources, but it can be configured to use YARN in what Cloudera term “Integrated Resource Management” and our initial response was to recommend this approach; however YARN is really optimised for longer-running batch jobs and not the shorter jobs that Impala generates (such that Cloudera recommends you don’t actually use YARN, and control Impala resource usage via service-level process constraints or through a new Impala feature called Admission Control instead). Taking a step back though, how do we actually see what resources Impala is using across the cluster when a query runs, and is there a feature similar to the Oracle Database’s SQL Explain Plan to help us understand how an Impala SQL query is executed? Then, using this and the various resource management options to us, can we use them to understand how YARN and other options will affect the Impala users on the client’s cluster if we enable them? And, given that we were going to test this all out on one of our development Hadoop clusters running back at the office on VMWare, how well could we simulate the multiple concurrent queries and mixed workload we’d then encounter on the real customer Big Data Appliance?

When trying to understand what goes on when a Cloudera Impala SQL query runs, the two main tools in your toolbox are EXPLAIN plans and query profiles. The concept of EXPLAIN plans will be familiar to Oracle developers, and putting “explain” before your Impala SQL query when you’re using the Impala Shell (or pressing the “Explain” button when you’re using the Impala Editor in Hue) will display an output like the one below, showing the steps the optimiser plans to take to return the query results:

[] > explain
select sum( as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;
Query: explain select sum( as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000
| Explain String                                                      |
| Estimated Per-Host Requirements: Memory=154.01MB VCores=2           |
|                                                                     |
| 10:EXCHANGE [UNPARTITIONED]                                         |
| |                                                                   |
| 09:AGGREGATE [FINALIZE]                                             |
| |  output: sum:merge(                                     |
| |  group by: d.dest_city                                            |
| |  having: sum( > 3000                                    |
| |                                                                   |
| 08:EXCHANGE [HASH(d.dest_city)]                                     |
| |                                                                   |
| 05:AGGREGATE                                                        |
| |  output: sum(                                           |
| |  group by: d.dest_city                                            |
| |                                                                   |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.orig = o.orig                                 |
| |                                                                   |
| |--07:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  02:SCAN HDFS [airlines.geog_origin o]                            |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: o.orig_state IN ('Florida', 'New York', 'Alaska') |
| |                                                                   |
| 03:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.dest = d.dest                                 |
| |                                                                   |
| |--06:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  01:SCAN HDFS [airlines.geog_dest d]                              |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: d.dest_state = 'California'                       |
| |                                                                   |
| 00:SCAN HDFS [airlines.flight_delays f]                             |
|    partitions=1/1 files=1 size=64.00MB                              |
Fetched 35 row(s) in 0.21s

Like an Oracle SQL explain plan, Impala’s cost-based optimiser uses table and partition stats that you should have gathered previously using Impala’s “compute stats” command to determine what it thinks is the optimal execution plan for your query. To see the actual cost and timings for the various plan steps that are run for a query, you can then use the “summary” statement after your query has run (or for more detail, the “profile” statement”) to see the actual timings and stats for each step in the query execution.

[] > summary;                                 > ;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 20.35us  | 20.35us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 142.18ms | 180.81ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 59.86us  | 123.39us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 171.72ms | 208.36ms | 60      | 1.93K      | 22.73 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 89.42ms  | 101.82ms | 540.04K | 131.88M    | 12.79 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 16.32us  | 19.63us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 302.83ms | 302.83ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 936.71ms | 1.10s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 19.02us  | 46.49us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 266.99ms | 266.99ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.07s    | 1.90s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

Output from the Summary statement gives us some useful information in working out the impact of the various resource management options for the Oracle Big Data Appliance, at least in terms of its impact on individual Impala queries – we’ll look at the impact on the overall Hadoop cluster and individual nodes later on. From the output of the above Summary report I can see that my query ran on all six nodes in the cluster (queries I ran earlier on a smaller version of the fact table ran on just a single node), and I can see how long each step in the query actually took to run. So what happens if I run the same query again on the cluster but disable the Impala daemon service role on three of the nodes, using Cloudera Manager?


Here’s the Summary output after running the query again:

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 22.01us  | 22.01us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 3      | 111.12ms | 117.24ms | 7       | 193        | 6.27 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 3      | 30.09us  | 39.02us  | 30      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 3      | 161.26ms | 173.57ms | 30      | 1.93K      | 22.84 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 3      | 156.50ms | 238.90ms | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 3      | 20.19us  | 28.93us  | 1.41K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 477.38ms | 477.38ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 3      | 1.48s    | 1.66s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 3      | 12.07us  | 14.89us  | 519     | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 308.83ms | 308.83ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 3      | 3.39s    | 6.85s    | 131.88M | 131.88M    | 74.11 MB  | 480.00 MB     | airlines.flight_delays_full f |

What the Summary statement doesn’t show you is the overall time the query took to run, and the query ran against three nodes took 9.48s to run compared to 3.59s for the one before where I had all six nodes’ Impala daemon enabled. In-fact I’d expect a query running on the client’s BDA with just three out of twelve nodes enabled to run even slower because of the block locality issue – Impala has a feature called block locality tracking which keeps track of where HDFS data blocks are actually located on the cluster and tries to run impalad tasks on the right nodes, but three out of twelve nodes running makes that job really hard – but the other factor that we need to consider is how running multiple queries concurrently affects things when only a few nodes are handling all the Impala user queries.

To try and simulate concurrent queries running I opened six terminal session against nodes actually running Impala Daemon service roles and submitted the same query from each session, with a second or two gap between each query; with all six nodes enabled the average response time rose to about 6s, but with just three enabled the response rose fairly consistently to around 27s.


This is of course what you’d expect when everything was trying to run on the same three (now resource-starved) server nodes, and again I’d expect this to be even more pronounced on the client’s twelve-node BDA. What this test of course didn’t cover was running workloads other than Impala on the same cluster, or running queries against different datasets, but it did at least show us how response-time increases fairly dramatically (albeit consistently) as more Impala users come onto the system.

So now we have some baseline benchmarking figures, let’s configure Impala to use YARN, using Cloudera Manager on the CDH5.3 setup used on the client’s BDA and our development cluster back in the office. There’s actually two parts to Impala running on YARN in CDH5.x; YARN itself as the overall cluster resource management layer, and another component called Llama (Low-Latency, or “Long-Lived”, Application Master) that sits between YARN and Impala and reduces the time that each Impala query takes to obtain YARN resource allocations.

llama arch

Enabling YARN and Llama (and if you want to, configuring Llama and thereby Impala for high-availability) is done through a wizard in CDH5.3 that also offers to set up an Linux feature called Cgroups that YARN can use to limit the “containers” it uses for resource management at the OS-level.

Once you’ve run through the wizard and restarted the cluster, Impala should be configured to use YARN instead of its own scheduler to request resources, which in-theory will allow Hadoop and the Big Data Appliance to consider Impala workloads alongside MapReduce, Spark and HBase when scheduling jobs across the cluster. Before we get into the options YARN gives us for managing these workloads I ran the same Impala queries again, first as a single query and then with six running concurrently, to see what impact YARN on its own had on query response times.

The single query on its own took around the same time as without YARN to run (3-4s), but when I ran six concurrent queries together the response time went up from the 3-4s that I saw without YARN enabled to between 5s and 18s depending on the session, with quite a bit of variation between response times compared to the consistent times I saw when YARN wasn’t being used – which surprised me as one of the stated benefits of YARN is making job execution times more predictable and smooth, though this cloud be more of an overall-cluster thing and there are also recommendations around configuring YARN and Llama’s resource estimation more efficient for Impala in the Cloudera docs.

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 41.38us  | 41.38us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 115.28ms | 123.04ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 44.44us  | 67.62us  | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 170.91ms | 201.47ms | 60      | 1.93K      | 22.82 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 82.25ms  | 98.34ms  | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.39us  | 18.99us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 244.40ms | 244.40ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 850.55ms | 942.47ms | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 13.99us  | 19.05us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 222.03ms | 222.03ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.54s    | 2.88s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

But it seems clear that users of Impala on the client cluster should expect some sort of overhead from using YARN to manage Impala’s resources, with the payoff being better balance between Impala workloads and the other uses they’re putting the BDA cluster too – however I think there’s more we can do to fine-tune how Llama and YARN allocate memory to Impala queries up-front (allocating a set amount of memory for all queries, rather than making an estimate and then adding more memory mid-query if it’s needed) and of course we’ve not really tested it on a cluster with a full, mixed workload running. But what about our original scenario, where only a certain percentage of the overall cluster resources or nodes are allocated to Impala query processing? To set up that sort of division resources we can use another feature of YARN called dynamic allocation, and dynamic resource pools that we can set up through Cloudera Manager again.

Dynamic allocation is one of the ways that YARN can be configured to manage multiple workloads on a Hadoop cluster (the other way is through static service pools, and I’ll come to those in a moment). Using dynamic allocation I can set up a resource pool for the airline flight delays application that my Impala SQL queries are associated with and allocate it 25% of overall cluster resources, with the remainder of cluster resources allocated to other applications. I can keep that weighting simple as I have done in the screenshot below, or I can allocate resources based on virtual cores and memory, but I found it simpler to just set these overall weightings and let YARN worry about cores and RAM. 


Depending on the scheduling policy you select, YARN will prioritise Impala and other jobs in different ways, but the recommended scheduling policy for mixed workloads is dominent resource fairness which balances RAM and CPU depending on which resource pool needs them most at a particular time. Note also that Impala can either be managed as part of the overall YARN workload or separately, a choice you can make in the Impala service configuration settings in Cloudera Manager (the “Enable Dynamic Resource Pools” setting that’s checked below, but was unchecked for the screenshot above)


There’s also a separate control you can place on Impala queries called Admission Control, that limits the number of queries that can run or be queued for a resource pool at any particular time. The docs are a bit vague on when to use admission control, when to use YARN or not and so on, but my take on this is that if it’s just Impala queries you’re worried about and throttling their use solves the problem then use this feature and leave Impala outside of YARN, but if you need to manage overall mixed workloads then do it all through YARN. For my testing example though I just went with simple resource pool weighting, and you can see from the screenshot below where multiple queries are running at once for my pool, CPU and RAM resources are constrained as expected.


To make a particular Impala query run within a specific resource pool you can either allocate that user to a named resource pool, or you can specific the resource pool in your Impala shell session like this:

[] > set request_pool = airlines; 
REQUEST_POOL set to airlines
[] > select sum( as total_flights, d.dest_city
from airlines.flight_delays_full f join airlines.geog_dest d on f.dest = d.dest
join airlines.geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;

Looking then at a typical summary output for a query running with these restrictions (25% of resources overall) and other queries running concurrently, the numbers don’t look all that different to before and results took between 8s and 30s to return – again I was surprised on the variance but I think YARN is more about overall cluster performance rather than individual queries, and you shouldn’t read too much into specific times on a dev server with an unrepresentative overall workload.

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 26.78us  | 26.78us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 209.10ms | 262.02ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 63.20us  | 118.89us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 282.56ms | 401.37ms | 60      | 1.93K      | 22.76 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 99.56ms  | 114.14ms | 540.04K | 131.88M    | 12.85 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.49us  | 17.94us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 531.08ms | 531.08ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 1.20s    | 1.54s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 24.29us  | 68.23us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 287.31ms | 287.31ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 2.34s    | 3.13s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

A point to note is that I found it very hard to get Impala queries to run when I got down to specifying virtual core and memory limits rather than just overall weightings, so I’d go with these high-level resource pool prioritisations which seemed to work and didn’t unduly affect query response times. For example the setting below looked clever, but queries always seemed to time out and I never really got a satisfactory setup working properly.


Note that for YARN dynamic resource pools to be used, all Linux/CDH users will need to be assigned to resource pools so they don’t run as “unconstrained”; this can also be done from the Dynamic Resource Pools configuration page.

Finally though, all of this management through resource pools might not be the best way to control resource usage by YARN. The Cloudera docs say quite clearly on the Integrated Resource Management page that:

“When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.

What this means in-practice is that, if you’ve got a single project using the Big Data Appliance and you just want to specify at a high-level what proportion of resources Impala, HBase, MapReduce and the other services under YARN management use, you can define this as static service pool settings in Cloudera Manager and have these restrictions enforced by Linux Cgroups. In the screenshot below I unwound all of the dynamic resource pool settings I created a moment ago and allocated 25% of overall cluster resources to Impala, with the wizard then using those top-level values to set limits for services across all nodes in the cluster based on their actual RAM and CPU, the services running on them and so on.


Then, going back to Cloudera Manager and running some queries, you can see these static service pool limits being applied in real-time and their effect in the form of graphs for particular cluster resources.


So given all of this, what was our recommendation to the client about how best to set up resource management for Impala and other workloads on their Big Data Appliance? Not too much should be read into individual numbers – it’s hard to simulate a proper mixed workload on a development server, and of course their BDA has 12 nodes, more memory, faster CPUs. However it’s probably fair to say these are the obvious conclusions:

  • Running Impala daemons on just a subset of nodes isn’t actually a bad way to constrain resources used by Impala, but it’ll only work on clusters with a small amount of nodes so that there’s a good chance one node will have one of the three copies of a data block. On a system of the scale of our customer’s, we’ll probably hit unacceptable overheads in terms of block locality. I would not carry on with this approach because of that.
  • If the customer BDA will be running a mixed workload, i.e. data loading, long-running Hive/Pig/Spark jobs as well as short-running Impala jobs, enabling Impala for YARN and setting overall resource pools for applications would be the best approach, but individual Impala queries will probably run slower than now (even given the restriction in resources), due to the overhead YARN imposes when scheduling and running jobs. But this will be the best way to allocate resource between applications and provide a generally “smoother” experience for users
  • If the BDA needs to be optimised mostly for Impala queries, then don’t manage Impala under YARN, leave it outside of this and just use Static service pools to allocate Impala roughly 25% of resources across all nodes. In both this and the previous instance (Impala on YARN) then all nodes should be re-enabled for Impala so as to minimize issues over block locality
  • If the only real issue is Impala queries for a particular application taking all resources/becoming runaway, Impala could be left outside of YARN but enabled for admission control so as to limit the total number of running/queued queries for a particular application.
Categories: BI & Warehousing

Oracle Big Data Discovery 1.1 now GA, and Available as part of BigDataLite 4.2.1

Sat, 2015-09-05 01:53

The new Oracle Big Data Discovery 1.1 release went GA a couple of weeks ago, and came with a bunch of features that addressed show-stoppers in the original 1.0 release; the ability to refresh and reload datasets from Hive, compatibility with Cloudera CDH and Hortonworks HDP Hadoop platforms, Kerberos integration, and the ability to bring in datasets from remote JDBC datasources. If you’re new to Big Data Discovery I covered the initial release in a number of blog posts over the past year or so:

So let’s start by loading some data into Big Data Discovery so that we can explore what’s in it, see the range of attributes and their values and do some basic data clean-up and enrichment. As with Big Data Discovery 1.0 you import (or “sample”) data into Big Data Discovery’s DGraph engine either via file upload, or by using a command-line utility. Data in Hadoop has to be registered in the Hive HCatalog metadata layer, and I’ll start by importing a Hive table mapped to some webserver log files via a Hive SERDE:


To import or sample this table into BDD’s DGraph engine I use the following command to invoke the Big Data Discovery Data Processing engine, which reads the Hive table metadata, loads the Hive table data into the DGraph engine (either all rows, or a representative sample) and process/enriches the data to add geocoding, for example:

[oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t apachelog_parsed

This then runs as an Apache Spark job under YARN, progress of which you can track either from the console or through Cloudera Manager / Hue.

[2015-09-04T14:20:43.792-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] 
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 start time: 1441390841404
 final status: UNDEFINED
 tracking URL: http://bigdatalite.localdomain:8088/proxy/application_1441385366282_0001/
 user: oracle
[2015-09-04T14:20:45.794-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] Application report for application_1441385366282_0001 (state: ACCEPTED)

Going over to Big Data Discovery Studio I can then see the dataset within the catalog, and then load it into a project and start exploring and transforming the dataset. In the screenshots below I’m cleaning up the date and time field to turn it into a timestamp, and arranging the automatically-derived country, city, region and state attributes into a geography hierarchy. BDD1.1 comes with a bunch of other transformation enhancements including new options for enrichment, the ability to tag values via a whitelist and so on – a full list of new features for BDD1.1 can be found in MOS Doc.ID 2044712.1


Crucially now in BDD1.1 you can either refresh a data set with new data from Hive (re-load), or do an incremental update after you’ve selected an attribute as the update (identity) column – in the screenshot below I’m doing this for a dataset uploaded from a file, but you can reload and update dataset from the command-line too which then opens-up the possibility of scripting, scheduling etc.


You can also define JDBC data connections in the administration part of BDD Studio, and then type in SQL queries to define data sources that can then be added into your project as a dataset – loading their data directly into the DGraph engine rather than having to stage it in Hadoop beforehand.


Then, as with the initial release of Big Data Discovery, you can define joins between the data sets in a project based on common attributes – in this case I’m joining the page URLs in the webserver logs with the page and article data extracted from our WordPress install, sourced from both Hive and Oracle (via JDBC)


Other new features in BDD1.1 include the ability to define “applications”, projects and datasets that are considered “production quality” and include details on how to refresh and incrementally load their datasets (designed presumably to facilitate migrations from Endeca Information Discovery), and a number of other new features around usability, data exploration and security. You can download BDD1.1 from Oracle’s Edelivery site, or download it pre-installed and configured as part of the new Big Data Lite 4.2.1 Virtualbox virtual machine.

Categories: BI & Warehousing

Primeros Cursos de Rittman Mead en Español en América Latina

Fri, 2015-09-04 08:04

La semana pasada y en el contexto del OTN Tour LA 2015 dimos nuestros primeros cursos en Español en Quito, Ecuador y Cali, Colombia.

Se trató de un workshop en este caso de un día (8 horas) sobre Desarrollo Front-End de OBIEE 11g.

En esta oportunidad los cursos fueron presenciales, dónde todo el material teórico-práctico era en español y cada alumno tenía a su disposición durante el entrenamiento una máquina virtual exclusiva donde realizar la parte práctica contando con el apoyo y la asistencia del instructor en todo momento.

Hace muchos años que vengo dando cursos. El primer curso que di, fue allá por 1999 (ejem, sí, en el siglo pasado). Desde ahí he dado muchísimos más sobre distintos productos Oracle inclusive como instructora de Oracle University lo cuál he sido por más de 10 años. En total debo haber dado cursos a más de 500 personas en mi vida.

Hace un año y medio que estoy trabajando felizmente en Rittman Mead desarrollando y dictando cursos en Inglés, lo que ha sido un reto maravilloso. Pero volver a dar cursos en español fue como quien dice “cómo volver a casa”. Realmente una experiencia super linda.


El feedback que tuvimos del curso fue realmente muy bueno, y nos llena de alegría haber logrado una vez más exceder las expectativas de los  participantes. Esto son algunos ejemplos del feedback que hemos recibido.



Queremos agradecer a todos los asistentes del curso por su participación y también a nuestros partners en Ecuador Bluefrog y Refundation, y a los Grupos de Usuario de Oracle de Ecuador y Colombia por la organización.

Si quieres recibir nuestros cursos en tu propio idioma, ya sea en Español o Portugués o quieres convertirte en nuestro partner de entrenamientos, mándanos un correo a y nos pondremos inmediatamente en contacto contigo.

Categories: BI & Warehousing

UKOUG Partner of the Year Awards 2015

Thu, 2015-09-03 08:48


It’s that time of year again for the UKOUG Partner of the Year Awards. This year we have been nominated for 4 awards:

  • Engineered Systems Partner of the Year Award
  • Business Analytics Partner of the Year Award
  • Training Partner of the Year Award
  • Emerging Partner of the Year Award

The awards are decided by “end users of Oracle-related products or services” i.e. you, so we would like to ask you to vote for us by going to this link.

I would like to propose four reasons why I think we deserve these awards.

Research, development and sharing

The culture at Rittman Mead has always been to delve into the tools we use, push their boundaries and share what we learn with the community. Internally, we have formalised this by having our own in house R&D department. People like Robin Moffatt, Jordan Meyer and Mark Rittman spend a lot of time and effort looking at the core Oracle data and analytics toolset to determine the optimal way to use it and see which other leading edge tools can be integrated into it.

This has given rise to a huge amount of freely available information ranging from a whole series on OBIEE performance tuning to drinks cabinet optimisation.

We have also worked with Oracle to produce a new version of their reference architecture that was the first one to incorporate the new wave of big data technologies and approaches such as Hadoop and a data reservoir.


One of the main drivers for our R&D department is to make us more effective at delivering data and analytics projects.

We are continually investigating common and new approaches and design patterns found in the world of ETL, data warehousing, business intelligence, analytics, data science, big data and agile project delivery, and combining them with our experience to define optimal ways deliver projects.

Again, we share a lot of these approaches through talks at Oracle and community conferences, blog posts and routines shared on our GitHub repository.

Learning and education

Learning is key to everything we do in life, and as such, we see the provision of independent courses for Oracle business intelligence and data integration tools as key for the industry. We have developed our own training materials based on the different roles people play on projects, for example we have a Business Enablement Bootcamp aimed at end users and OBIEE Bootcamp aimed at developers. We know from our feedback forms how effective this training is.

To supplement the training materials we also wrote the official OBIEE Oracle Press book based around the same examples and data sets.


Our key role as an Oracle partner and member of the Oracle community is to optimise the value any organisation gets from investing in Oracle data and analytics related software and hardware.

This is something that requires a long term commitment, a high level of investment and a deep level of knowledge and experience, which is hopefully demonstrated above. To this end, we are prepared to often go beyond the level of information that Oracle can offer and in certain cases challenge their own understanding of the tools.

We were the first UK partner to buy an Exalytics server, for example, and have written a whole host of articles around the subject. Similarly we are the proud owner of a BICS pod and we are now evaluating how organisations can effectively use cloud in their strategic business intelligence architectures and then, if they do, the best approach to integrating it.

Finally, we are also investing heavily in user engagement, providing the capability to measure then optimise an organisation’s data and analytics systems. We believe user engagement is directly and measurably linked to the return organisations get from their investment in Oracle data and analytics software and hardware.

In Summary

So, in summary, I hope that the reasons that I outline above explain why we deserve some or all of the above awards, as they act as a great way to recognise the effort put in by all our staff over the course of the year. The voting link is here.

Categories: BI & Warehousing

The greatest OBIEE usage study ever…

Tue, 2015-09-01 09:03

Rittman Mead is excited to announce the first ever global OBIEE usage survey. Our goal is to provide deep analysis into how organisations use their OBIEE systems and to create a set of industry benchmarks. We will be releasing the results of this research for free to the OBIEE community.


The types of metrics we are looking at are:

  • DAU/MAU (daily average users/monthly average users) – this gives an approximation of utility;
  • Frequency of use – how often your users access dashboards;
  • Recency of use – when was the last time reports and dashboards were accessed;
  • Reach of system – how many users are accessing the system.

Here’s how it works: we need your Usage Tracking data. To make providing this data easier we can send you a script for your system administrator to run to extract this. We even have an option to obfuscate key attributes so we can’t see any usernames or sensitive details.

Once we receive your data, we will analyze your data individually and provide you with a free report designed to provide you unique insight into to your system’s usage, an example of which is available here.

We will also add your obfuscated, depersonalised and aggregated data to our benchmarking database and let you know how your system performs against industry standards.

Please note: you do have to be running the usage tracking feature of OBIEE for this to work. We strongly advise having this running in any system and can help you get it turned on, if required. Also any data you send to Rittman Mead is completely anonymous and holds no personal or sensitive attributes. It will only be used for benchmarking.

At the end of the survey we will perform a detailed analysis of the complete OBIEE usage database and publish the results.

How do I take part?

Please email us at and we will send full over the scripts and full instructions.

Why are we doing this?

We are currently focused on user engagement of BI and analytics systems and have been conducting research over the last few months. We have found very few tangible studies about enterprise BI usage, in particular OBIEE usage.

We are creating this database from OBIEE users around the world and will use this as the academic basis for furthering our research into user engagement and OBIEE.


Categories: BI & Warehousing

Three Easy Ways to Stream Twitter Data into ElasticSearch

Sat, 2015-08-29 00:46

For the past few months a friend has been driving me crazy with all his praise for Splunk. He was going on about how easy it is to install, integrate different sources and build reports. I eventually started playing around to see if it could be used for a personal project I’m working on. In no time at all I understood what he was on about and I could see the value and ease of use of the product. Unfortunately the price of such a product means it is not a solution for everyone so I started looking around for alternatives and ElasticSearch caught my eye as a good option.

In this post we will focus on how we can stream Twitter data into ElasticSearch and explore the different options for doing so. Storing data in ElasticSearch is just the first step but you only gain real value when you start analysing this data. In the next post we will add sentiment analysis to our Twitter messages and see how we can analyse this data by building Kibana dashboards. But for now we will dig a bit deeper into the following three configuration options:

We will look at the installation and configuration of each of these and see how we can subscribe to twitter using the Twitter API. Data will then get processed, if required, and sent to Elasticsearch.


Why Use Elasticsearch

Elasticsearch has the ability to store large quantities of semi-structured (JSON) data and provides the ability to quickly and easily query this data. This makes it a good option for storing Twitter data which is delivered as JSON and a perfect candidate for the project I’m working on.


You will need a server to host all the required components. I used an AWS free tier (t2.micro) instance running Amazon Linux 64-bit. This post assumes you already have an elasticsearch cluster up and running and that you have a basic understanding of elasticsearch. There are some good blog posts, written by Robin Moffatt, which were very useful during the installation and configuration stages.

Twitter Stream API

In order to access the Twitter Streaming API, you need to register an application at Once created, you should be redirected to your app’s page, where you can get the consumer key and consumer secret and create an access token under the “Keys and Access Tokens” tab. These values will be used as part of the configuration for all the sample configurations to follow.

The API allows two types of subscriptions. Either subscribe to specific keywords or to a user timeline (similar to what you see as a twitter user).


We'll start with logstash as this is probably the easiest one to configure and seems to be the recommended approach for integrating sources with elasticsearch in recent versions. At the time of writing this post, logstash only supported streaming based on keywords which meant it was not suitable for my needs but it’s still a very useful option to cover.

logstash installation

To install logstash you need to download the correct archive based on the version of elasticsearch you are running.

curl -O

Extract the archived file and move the extracted folder to a location of your choice

tar zxf logstash-x.x.x.tar.gz
mv logstash-x.x.x /usr/share/logstash/

logstash configuration

To configure logstash we need to provide input, output and filter elements. For our example we will only specify input (twitter) and output (elasticsearch) elements as we will be storing the full twitter message.

For a full list of logstash twitter input settings see the official documentation.

Using your favourite text editor, create a file called twitter_logstash.conf and copy the below text. Update the consumer_key, consumer_secret, oath_token and oath_token_secret values with the values from your Twitter Stream App created earlier.

input {
    twitter {
        # add your data
        consumer_key => "CONSUMER_KEY_GOES_HERE"
        consumer_secret => "CONSUMER_SECRET_GOES_HERE"
        oauth_token => "ACCESS_TOKEN_GOES_HERE"
        oauth_token_secret => "ACCESS_TOKEN_SECRET_GOES_HERE"
        keywords => ["obiee","oracle"]
        full_tweet => true
output {
    elasticsearch_http {
        host => "localhost"
        index => "idx_ls"
        index_type => "tweet_ls"

This configuration will receive all tweets tagged with obiee or oracle and store them to an index called idx_ls in elasticsearch.

To run logstash, execute the following command from the installed location

bin/logstash -f twitter_logstash.conf

If you subscribed to active twitter tags you should see data within a few seconds. To confirm if your data is flowing you can navigate to http://server_ip:9200/_cat/indices?v which will show you a list of indices with some relevant information.


With this easy configuration you can get Twitter data flowing in no time at all.

Twitter River Plugin

Next we will look at using the River Plugins to stream Twitter data. The only reason to use this approach over logstash is if you want to subscribe to a user timeline. Using this feature will show the same information as the Twitter application or viewing your timeline online.

Note!!Twitter River is not supported from ElasticSearch 2.0+ and should be avoided if possible. Thanks to David Pilato for highlighting this point. It is still useful to know of this option in the very rare case where it might be useful.

Twitter River Plugin installation

Before installing the plugin you need to determine which version is compatible with your version of elasticsearch. You can confirm this at and selecting the correct one.

To install you need to use the elasticsearch plugin installation script. From the elasticsearch installation directory, execute:

bin/plugin -install elasticsearch/elasticsearch-river-twitter/x.x.x

Then restart your Elasticsearch service.

Twitter River Plugin configuration

To configure the twitter subscriber we will again create a .conf file with the necessary configuration elements. Create a new file called twitter_river.conf and copy the following text. As with logstash, update the required fields with the values from the twitter app created earlier.

  "type": "twitter",
  "twitter" : {
        "oauth" : {
            "consumer_key" : "CONSUMER_KEY_GOES_HERE",
            "consumer_secret" : "CONSUMER_SECRET_GOES_HERE",
            "access_token" : "ACCESS_TOKEN_GOES_HERE",
            "access_token_secret" : "ACCESS_TOKEN_SECRET_GOES_HERE"
        "filter" : {
            "tracks" : ["obiee", "oracle"]
        "raw" : true,
        "geo_as_array" : true
  "index": {
    "index": "idx_rvr",
    "type": "tweet_rvr",
    "bulk_size": 100,
    "flush_interval": "5s"

This configuration is identical to the logstash configuration and will receive the same tweets from twitter. To subscribe to a user timeline instead of keywords, replace the filter configuration element:

"filter" : {
      "tracks" : ["obiee", "oracle"],

with a user type element

"type" : "user",

To start the plugin you need to execute the following from a terminal window.

curl -XPUT localhost:9200/_river/idx_rvr/_meta -d @twitter_river.conf

Depending on how active your subscribed tags are you should see data within a few seconds in elasticsearch. You can again navigate to http://server_ip:9200/_cat/indices?v to confirm if your data is flowing. Note this time that you should see two new rows, one index called _river and the other idx_rvr. idx_rvr is where your twitter data will be stored.

To stop the plugin (or change between keywords and user timeline), execute the following from a terminal window:

curl -XDELETE 'localhost:9200/_river/idx_rvr';


Finally we will look at the most flexible solution of them all. It is a bit more complicated to install and configure but, given what you gain, the small amount of extra time spent is well worth the effort. Once you have Tweepy working you will be able to write you own python code to manipulate the data as you see fit.

Tweepy installation

As Tweepy is a python package we will use pip to install the required packages. If you don't have pip installed. Execute one of the following, depending on your linux distribution.

yum -y install python-pip


apt-get install python-pip

Next we will install the Tweepy and elasticsearch packages

pip install tweepy
pip install elasticsearch

Tweepy configuration

Create a new file called and copy the following text to the file

import tweepy
import sys
import json
from textwrap import TextWrapper
from datetime import datetime
from elasticsearch import Elasticsearch



auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

es = Elasticsearch()

class StreamListener(tweepy.StreamListener):
    status_wrapper = TextWrapper(width=60, initial_indent='    ', subsequent_indent='    ')

    def on_status(self, status):
            #print '\n%s %s' % (, status.created_at)

            json_data = status._json
            #print json_data['text']


        except Exception, e:
            print e

streamer = tweepy.Stream(auth=auth, listener=StreamListener(), timeout=3000000000 )

#Fill with your own Keywords bellow
terms = ['obiee','oracle']


As with the river plugin you can subscribe to the user timeline by changing the subscription type. To do this replace the last line in the script with


To start the listener you need to execute the python file


Navigate to the elasticsearch index list again to ensure you are receiving data.


Getting Twitter data into Elasticsearch is actually pretty simple. Logstash is by far the easiest one to configure and if subscribing to keywords is your only requirement it should be the preferred solution. Now that we have the foundation in place, in the next post we will have a look at how we can enhance this data by adding sentiment analysis and how we can use this data to make decisions.

Categories: BI & Warehousing

Combining Spark Streaming and Data Frames for Near-Real Time Log Analysis & Enrichment

Sat, 2015-08-01 08:51

A few months ago I posted an article on the blog around using Apache Spark to analyse activity on our website, using Spark to join the site activity to some reference tables for some one-off analysis. In this article I’ll be taking an initial look at Spark Streaming, a component within the overall Spark platform that allows you to ingest and process data in near real-time whilst keeping the same overall code-based as your batch-style Spark programs.


Like regular batch-based Spark programs, Spark Streaming builds on the concept of RDDs (Resilient Distributed Datasets) and provides an additional high-level abstraction called a “discretized stream” or DStream, representing a continuous stream of RDDs over a defined time period. In the example I’m going to create I’ll use Spark Streaming’s DStream feature to hold in-memory the last 24hrs worth of website activity, and use it to update a “Top Ten Pages” Impala table that’ll get updated once a minute.


To create the example I started with the Log Analyzer example in the set of DataBricks Spark Reference Applications, and adapted the Spark Streaming / Spark SQL example to work with our CombinedLogFormat log format that contains two additional log elements. In addition, I’ll also join the incoming data stream with some reference data sitting in an Oracle database and then output a parquet-format file to the HDFS filesystem containing the top ten pages over that period.

The bits of the Log Analyzer reference application that we reused comprise of two scripts that compile into a single JAR file; a script that creates a Scala object to parse the incoming CombinedLogFormat log files, and other with the main program in. The log parsing object contains a single function that takes a set of log lines, then returns a Scala class that breaks the log entries down into the individual elements (IP address, endpoint (URL), referrer and so on). Compared to the DataBricks reference application I had to add two extra log file elements to the ApacheAccessLog class (referer and agent), and add some code in to deal with “-“ values that could be in the log for the content size; I also added some extra code to ensure the URLs (endpoints) quoted in the log matched the format used in the data extracted from our WordPress install, which stores all URLs with a trailing forward-slash (“/“).

package com.databricks.apps.logs
case class ApacheAccessLog(ipAddress: String, clientIdentd: String,
 userId: String, dateTime: String, method: String,
 endpoint: String, protocol: String,
 responseCode: Int, contentSize: Long, 
 referer: String, agent: String) {
object ApacheAccessLog {
val PATTERN = """^(\S+) (\S+) (\S+) \[([\w\d:\/]+\s[+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) ([\d\-]+) "([^"]+)" "([^"]+)"""".r
def parseLogLine(log: String): ApacheAccessLog = {
 val res = PATTERN.findFirstMatchIn(log)
 if (res.isEmpty) {
 ApacheAccessLog("", "", "", "","", "", "", 0, 0, "", "")
 else {
 val m = res.get
 val contentSizeSafe : Long = if ( == "-") 0 else
 val formattedEndpoint : String = (if ( == "/") else"/"))
 ApacheAccessLog(,,,,, formattedEndpoint,,, contentSizeSafe,,

The body of the main application script looks like this – I’ll go through it step-by-step afterwards:</>

package com.databricks.apps.logs.chapter1

import com.databricks.apps.logs.ApacheAccessLog
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{StreamingContext, Duration}

object LogAnalyzerStreamingSQL {
  val WINDOW_LENGTH = new Duration(86400 * 1000)
  val SLIDE_INTERVAL = new Duration(10 * 1000)

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in Scala")
    val sc = new SparkContext(sparkConf)

    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val postsDF = sqlContext.load("jdbc", Map(
                  "url" -> "jdbc:oracle:thin:blog_refdata/",
                  "dbtable" -> "BLOG_REFDATA.POST_DETAILS"))

    val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)

    val logLinesDStream = streamingContext.textFileStream("/user/oracle/rm_logs_incoming")

    val accessLogsDStream =

    val windowDStream = accessLogsDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL)

    windowDStream.foreachRDD(accessLogs => {
      if (accessLogs.count() == 0) {
        println("No logs received in this time interval")
      } else {

        // Filter out bots 
        val accessLogsFilteredDF = accessLogs
                                      .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*"))
                                      .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin|wp-includes|favicon.ico|xmlrpc.php|wp-comments-post.php).*")).toDF()
        val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) as total FROM accessLogsFiltered a JOIN posts p ON a.endpoint = p.POST_SLUG GROUP BY p.POST_TITLE, p.POST_AUTHOR ORDER BY total DESC LIMIT 10 ")                 
        // Persist top ten table for this window to HDFS as parquet file
       "/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet", "parquet", SaveMode.Overwrite)      


The application code starts then by importing Scala classes for Spark, Spark SQL and Spark Streaming, and then defines two variable that determine the amount of log data the application will consider; WINDOW_LENGTH (86400 milliseconds, or 24hrs) which determines the window of log activity that the application will consider, and SLIDE_INTERVAL, set to 60 milliseconds or one minute, which determines how often the statistics are recalculated. Using these values means that our Spark Streaming application will recompute every minute the top ten most popular pages over the last 24 hours.

package com.databricks.apps.logs.chapter1
import com.databricks.apps.logs.ApacheAccessLog
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{StreamingContext, Duration}
object LogAnalyzerStreamingSQL {
 val WINDOW_LENGTH = new Duration(86400 * 1000)
 val SLIDE_INTERVAL = new Duration(60 * 1000)

In our Spark Streaming application, we’re also going to load-up reference data from our WordPress site, exported and stored in an Oracle database, to add post title and post author values to the raw log entries that come in via Spark Streaming. In the next part of the script then we define a new Spark context and then a Spark SQL context off-of the base Spark context, then create a Spark SQL data frame to hold the Oracle-sourced WordPress data to later-on join to the incoming DStream data – using Spark’s new Data Frame feature and the Oracle JDBC drivers that I separately download off-of the Oracle website, I can pull in reference data from Oracle or other database sources, or bring it in from a CSV file as I did in the previous Spark example, to supplement my raw incoming log data. 

def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in Scala")
 val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 val postsDF = sqlContext.load("jdbc", Map(
 "url" -> "jdbc:oracle:thin:blog_refdata/",

Note also how Spark SQL lets me declare a data frame (or indeed any RDD with an associated schema) as a Spark SQL table, so that I can later run SQL queries against it – I’ll come back to this at the end).

Now comes the first part of the Spark Streaming code. I start by defining a new Spark Streaming content off of the same base Spark context that I created the Spark SQL one off-of, then I use that Spark Streaming context to create a DStream that reads newly-arrived files landed in an HDFS directory  – for this example I’ll manually copy the log files into an “incoming” HDFS directory, whereas in real-life I’d connect Spark Streaming to Flume using FlumeUtils for a more direct-connection to activity on the webserver. 

val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val logLinesDStream = streamingContext.textFileStream("/user/oracle/rm_logs_incoming")

Then I call the Scala “map” transformation to convert the incoming DStream into an ApacheAccessLog-formatted DStream, and cache this new DStream in-memory. Next and as the final part of this stage, I call the Spark Streaming “window” function which packages the input data into in this case a 24-hour window of data, and creates a new Spark RDD every SLIDE_INTERVAL – in this case 1 minute – of time.

val accessLogsDStream =
val windowDStream = accessLogsDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL)

Now that Spark Streaming is creating RDDs for me to represent all the log activity over my 24 hour period, I can use the .foreachRDD control structure to turn that RDD into its own data frame (using the schema I’ve inherited from the ApacheAccessLog Scala class earlier on), and filter out bot activity and references to internal WordPress pages so that I’m left with actual page accesses to then calculate the top ten list from.

windowDStream.foreachRDD(accessLogs => {
 if (accessLogs.count() == 0) {
 println("No logs received in this time interval")
 } else {
// Filter out bots 
 val accessLogsFilteredDF = accessLogs
 .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*"))
 .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin|wp-includes|favicon.ico|xmlrpc.php|wp-comments-post.php).*")).toDF()

Then, I use Spark SQL’s ability to join tables created against the windowed log data and the Oracle reference data I brought in earlier, to create a parquet-formatted file containing the top-ten most popular pages over the past 24 hours. Parquet is the default storage format used by Spark SQL and is suited best to BI-style columnar queries, but I could use Avro, CSV or another file format If I brought the correct library imports in.

val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) as total FROM accessLogsFiltered a JOIN posts p ON a.endpoint = p.POST_SLUG GROUP BY p.POST_TITLE, P.POST_AUTHOR ORDER BY total DESC LIMIT 10 ") 
 // Persist top ten table for this window to HDFS as parquet file"/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet", "parquet", SaveMode.Overwrite) 

Finally, the last piece of the code starts-off the data ingestion process and then continues until the process is interrupted or stopped.


I can now go over to Hue and move some log files into the HDFS directory that the Spark application is running on, like this:


Then, based on the SLIDE_INTERVAL I defined in the main Spark application earlier on (60 seconds, in my case) the Spark Streaming application picks up the new files and processes them, outputting the results as a Parquet file back on the HDFS filesystem (these two screenshots should display as animated GIFs)


So what to do with the top-ten pages parquet file that the Spark Streaming application creates? The most obvious thing to do would be to create an Impala table over it, using the schema metadata embedded into the parquet file, like this:

CREATE EXTERNAL TABLE rm_logs_24hr_top_ten
LIKE PARQUET '/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet/part-r-00001.parquet'
LOCATION '/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet';

Then I can query the table using Hue again, or I can import the Impala table metadata into OBIEE and analyse it using Answers and Dashboards.


So that’s a very basic example of Spark Streaming, and I’ll be building on this example over the new few weeks to add features such as persistent storing of all processed data, and classification and clustering the data using Spark MLlib. More importantly, copying files into HDFS for ingestion into Spark Streaming adds quite a lot of latency and it’d be better to connect Spark directly to the webserver using Flume or even better, Kafka – I’ll add examples showing these features in the next few posts in this series.

Categories: BI & Warehousing

Oracle Data Integrator to load Oracle BICS and Oracle Storage Cloud Service

Mon, 2015-07-27 08:54

It is known that Oracle focuses its sales on cloud this year and in the BI world, we have seen the introduction of Oracle Business Intelligence Cloud Service – BICS – that we already covered on the blog. Another great product to store files and unstructured data is Oracle Storage Cloud Service, that can also be used as a staging area before loading a BICS schema. To provision our BI system in the cloud or to backup our files, it’s important to find a reliable and maintainable way to move on-premise data to the cloud. And when we speak about reliable and maintainable way to move data, I can’t help but think about Oracle Data Integrator. Luckily, Ayush Ganeriwal wrote two excellent blog posts explaining how to use ODI to load these two cloud services! And he even gave away the Knowledge Modules, ODI Open Tools, Jar files and other things he developed to make it work.

Load data to Oracle Storage Cloud Service with a package.

Oracle Storage Cloud Service is used to store files and unstructured data in the cloud, very much like Amazon S3. In the case of a company that has moved to the cloud, they need some space to store their files that will be used by other cloud services. Oracle Storage Cloud Service is the place to store all these files. If you want to use it as a cloud backup of your local filesystem, you might even choose Oracle Storage Cloud Archive Service which has cheaper storage but cost more when retrieving or editing data, similarly to Amazon Glacier .

Oracle Storage Cloud Service offers a RESTful web service API to load data as well as a Java Library directly wrapping this API and supporting client-side encryption. When it comes to load data, what could be better than using the tool we already use for all the rest? In this blog post , Ayush details how he created an ODI OpenTool that can load data to the Oracle Storage Cloud Service using the Java Library.

We first need to download the Java Library and place it in our ODI Studio or Agent classpaths. Below are the default locations for these classpaths, but we can also specify additional classpath if we want. Don’t forget to close ODI Studio and shutdown the agent(s) before adding the files.

ODI Studio on Unix : ~/.odi/oracledi/userlib
ODI Studio on Windows : %APPDATA%\Roaming\odi\oracledi\userlib
12c Standalone Agent : &lt;ODI_AGENT_HOME&gt;/odi/agent/lib
11g Standalone Agent : &lt;ODI_AGENT_HOME&gt;/oracledi/agent/drivers/

For a JEE agent, we need to create a server template as described in the doc :

In the same folder, we can also drop the Jar file for the Open Tool that Ayush created and posted on (last link in the folder at the moment of writing). Remember that all the drivers, libraries and Open Tools have to be present in every ODI Studio or agent install.

We can now reopen ODI Studio as we need to register the new Open Tool in the master repository so everyone can use it. In the ODI Menu, click on Add Remove Open  Tools and then on the googles icon to search for the Open Tools. In our case, the class is under the package

Adding an ODI Open Tool

Select the OracleStorageCloudGet tool and click the Plus icon to add it. There we can see the description of the tool, the command, the parameters and the icon used. Let’s add the OracleStorageCloudPut as well and click OK.

Open Tool description

That’s it! We can now see two new ODI Open Tools under the Plugins category in the Package Toolbox. We can add it to the package and fill all the parameters or copy/paste the command to use it in an ODI Procedure.

Open Tools in a package

Load data to Oracle Business Intelligence Cloud Service with a mapping.

BICS offers a Data Loader which is actually an Apex application taking Excel, CSV, text or zip files as input. But this is more a manual process used when working with sample data or really small dataset. If we plan to use BICS in a production environment, we need something more resilient that can be scheduled and that handles errors. We could use Data Sync but maybe you don’t want to introduce a new tool. Luckily, BICS also provides a RESTful web service API we can use to load data. Again, ODI is the perfect tool to create our own reusable components to integrate with new sources or targets. And once again, Ayush did all the job for us as described in this article.

The first step is similar to what we did previously, we need make all the drivers and libraries available for ODI Studio and the agent. The steps in the two Knowledge Modules contains Jython code, calling some java methods from the odi-bics.jar file. These methods – just like the Oracle Storage Cloud Service java library – will use the Jersey Libraries to do the RESTful web service calls so we will need them as well. Basically, just take all the jar files from the archive you can find on under the name “RKM and IKM for Oracle BI Cloud Service”. As mentioned above, we need to add it to the classpath of ODI Studio and the agent installs and restart them.

The article mentions that we need to create a new Technology for BI Cloud Service but an XML export is already created in the archive. Go in the Topology, right click on Technology and import the xml file.

Adding a Technology

We can also import the two Knowledge Modules, either globally or under a single project. Ayush wrote one RKM to reverse-engineer the tables from our BICS schema and one Multi-Connections IKM to integrate the data from the on-premise database to the BICS schema in the cloud. I had to tweak the IKM a little bit but it might be linked to my particular setup. The steps were not displayed in the order it was set in the XML file and I had to reorder it properly to place the drop before the create and truncate. Probably a bug, I will investigate that later.

Reordering the steps in the KM

We are done with the install of new components and we can now set up the topology. First create a Dataserver with the newly imported technology and provide a login/password and the URL to the BICS instance as the Service URL. From my testing and unlike Ayush’s screenshot, it should start with https and should not include the port number. Of course, we need to create a physical schema where we need to provide our Identity Domain in both Tenant fields. We can leave the Database fields empty. The last step to do in the Topology is to create a Logical schema and associate it to the Physical schema through a context.


We can then switch to the Designer tab and create a new Model with BI Cloud Service technology and that Logical Schema. On the Reverse Engineer tab we select Customized and the RKM BI Cloud Service. We can also specify a mask to restrict the metadata import to specific table names before hitting the Reverse Engineer button at the top. I set the mask to DIM_LOC% and only my DIM_LOCATION as been reverse engineered from the BICS schema.

Model and reverse engineering

Finally, we can create a mapping. In this example I load two target tables. DIM_LOCATIONS (plural) is sitting on an on-premise Oracle Database for users accessing OBIEE from our HQ while DIM_LOCATION (singular) is in the schema linked to the our BICS instance used by remote users. There is nothing special here, except that I unselected the insert and update checkboxes for my surrogate key (LOCATION_SK) because I asked BICS to automatically populate it when I created the table – under the scene, a sequence is created and a trigger added to the table to populate the field with the next value of the sequence on each insert.


On the Physical tab, the LKM should be set on LKM SQL Multi-connect because we will delegate the data transfer to the Multi-Connections IKM. This IKM is the one we imported : IKM SQL to BI Cloud Service. I choose to enable the truncate option here.


Let’s hit the execute button and watch the result in the operator and in our BICS schema.


The two articles demonstrate one more time how easy it is to plug a new technology within ODI. This tool is a wonderful framework that can be used to move data from and to any technologies, even if it’s Big Data or Cloud as we have seen here. By having every integration jobs happening in the same place, we can have an easier maintenance, a better monitoring and we can schedule all the jobs together. It’s also way easier to see the big picture in our projects.

If you have any question about this post feel free to reach me, we love that kind of interesting challenges at Rittman Mead.

Thanks a lot to Ayush for such good components and libraries he provided. I’m glad I will share the stage with him and a fellow ODI expert, Holger Friedrich, at the Oracle Open World. We will speak about Best Practices for Development Lifecycle Management . Come to see us and say hi!

Categories: BI & Warehousing

Taking a Closer Look at Knowledge Modules in ODI12c – Component-Style and Multi-Connect KMs

Sun, 2015-07-26 10:19

Another question that came up from the ODI12c Bootcamp Course I’m delivering for a client in London at the moment is how to choose between the different knowledge modules that come with ODI12c. What with the choice now between template-style KMs and the new component-style KMs, the new option of multi-connect KMs, and the general question around which KM you pick within a KM type when building a table mapping, I thought it’d be interesting to take a closer look at how knowledge modules work with ODI12c and how you go about making the right choice of KM when creating a mapping.

As a quick primer, Oracle Data Integrator up until the recent 12c release had six types of knowledge module you could use in a data mapping:


They were:

  • Load Knowledge Modules, for loading source data out of the source database server and into a staging table typically on the target database platform
  • Integrate Knowledge Modules, for taking that staging data and integrating (inserting, updating, merging etc) it into the target table
  • Reverse Knowledge Modules, for reverse-engineering the table metadata from a source system
  • Check Knowledge Modules, for performing data quality checks on source and target tables
  • Journalise Knowledge Modules, for setting up change data capture on a source table or table set
  • Service Knowledge Modules, for exposing tables or other datastore as CRUD-type web services

Using ODI11g as an example, when you created a new mapping you selected an LKM for extracting data out of your source database, an IKM for integrating the results into the target table, and optionally a CKM or JKM if you needed to run data quality checks or use table journalization (CDC). In all cases you had to first import the knowledge module definitions into the ODI11g Work Repository and your project before you could use them. To take an example, an ODI11g mapping where the source was a file and the target, an Oracle database, might look like this as a Mapping diagram:


Looking at the Flow diagram, at the start there are no knowledge modules to select from as none have yet been imported:


I therefore import a selection of IKM, LKM and other KMs for the technologies I’m using, and then I’m able to assign an LKM and IKM to my flow diagram.


For both Load Knowledge Modules (LKMs) and Integration Knowledge Modules (IKMs), you have a number of options ranging from generic JBDC/SQL-type modules that connect to a source and then transfer the data using JDBC batch routines, through to highly-specialized ones that leverage particular platform technologies. I typically start the prototyping phase of my ODI projects by selecting the simplest, most generic LKM and IKM I can find, and once I’ve got the mapping logic correct then shift to one that uses more of the underlying database’s features – the docs also have a nice guide for making your KM selection. For example, I might assign the LKM SQL to Oracle knowledge module to the source table and the IKM SQL Control Append one to the target, like this:


In this case, ODI will first create a Java routine that extract via a JDBC connection the rows of data from the file, and then load that data into a staging table on the target database server. Then, that staging data will be integrated into the target table using a regular SQL INSERT statement.

Of course I could do this more efficiently using an Oracle External Table. Let’s select the LKM File to Oracle (EXTERNAL TABLE) load knowledge module and also change the IKM to IKM Oracle Incremental Update; note that this would only work if the target database server could see the file we’re loading in via this mechanism – if the file had to stay on a remote server then I’d have to stick with the IKM SQL Control Append and ODI would effectively ignore the request to use an Oracle External Table.


With ODI12c, things are a bit different due to the introduction of another type of knowledge module: “Component-Style” Knowledge Modules. Component-style KMs were introduced for a few reasons with ODI12c; they made migration of mappings and projects from OWB easier as OWB mappings are made up of lots of arbitrarily-arranged mapping operator components that can be combined into all-types of data flow, and they also made it possible for Oracle to create lots more of these granular mapping components and use them across all technology types. For example as Oracle’s David Allen talks about in this comment on one of our previous blog posts and in this follow-up blog of his own, Oracle could create a generic Table Function component-style KM and have it apply a SQL table function for Oracle sources, or run a Pig relation through an arbitrary Pig Latin script as I did in a more recent blog post.

These new component-style KMs come built-in to ODI12c which means that you don’t actually have to import any template-style KMs in to get started with a mapping; in the example below from my blog post earlier today on ODI12c and Oracle Streams, I can run the mapping I’ve just created by just selecting from the built-in component-style KMs that ship with ODI12c out-of-the-box.


So should we now use component-style KMs when creating mappings and avoid using the old template-style ones? The docs don’t say this explicitly, but my impression is that component-style KMs are the way Oracle wants to take things forward and in most cases, ODI will automatically select suitable component-style LKMs and IKMs when you create the physical mapping and this is usually the best option; only time I switch to a template-style IKM or LKM is if I’m using a platform technology that component-style KMs don’t yet cover, or I’m working on some edge-case – by default though I go with component-style LKMs and IKMs. Of course you still need to import JKMs and other KM types and presumably Oracle will extend component KMs beyond IKMs and LKMs over time, but that’s my recommendation for now.

So now we’ve got what component-style KMs are, there’s another new KM concept that came along with ODI12c – “Multi-Connect” KMs. Multi-Connect KMs are a special type of template-style KM that allows the staging area to be on a separate data server to the target data warehouse, whereas most template-style KMs assume the staging and target data schemas are on the same data server. I used a multi-connect IKM File-Hive to Oracle knowledge module towards the end of another article where I used Oracle Loader for Hadoop to export data out of Hadoop and into an Oracle Database; normally when you use an IKM the staging and target areas are on the same database, but in this case the staging table for the mapping was on the Hadoop (Hive) side and I therefore had to select LKM SQL Multi-Connection as the load knowledge module, whereapon I could then select IKM File-Hive to Oracle (OLH/OSCH) as the integration knowledge module.


Similarly, with our ODI12c mapping if I moved the staging area to the source database, or more commonly a separate “ETL-hub”-style database where the customer wants a more ETL (compared to ELT)-style integration setup, I can request that the staging location for the mapping moves to this hub database using a new feature in the logical mapping editor:


Then in the mapping I can add some transformations that take place on the ETL hub database, something you’d most probably do to avoid licensing ODI on their full data warehouse database server.


Then when I switch to the Physical mapping view I can see this additional execution unit with the transformations in it, I first select the LKM SQL Multi-Connect to bring the file data in, then I can select an LKM and IKM combination that supports multi-connect but uses an Oracle-specific technology (in this case, dblinks) to move the data from the ETL hub to the target database.


Or, if I didn’t like using dblinks or the ETL hub was on a non-Oracle platform, I could use the component-style LKM SQL Multi-Connect LKM at the access point and then a mono-source component-style IKM to do the data integration – note however that whenever we use generic JDBC connections and batch-extraction to bring data across to the target platform the data flows through the agent, which won’t be as efficient as using the multi-connect template-style KM that transfers data using SQL*Net and dblinks.


So – a few thoughts on the new knowledge module setup in ODI12c, and what these new component KMs and multi-connect KMs are actually for. You can find more articles on ODI12c and data integration over on our blog, if you’re interested in reading more about ODI12c development and internals.

Categories: BI & Warehousing

Using Streams with ODI12c for Oracle-to-Oracle Change Data Capture

Sun, 2015-07-26 05:10

Although Oracle GoldenGate replaced Oracle Streams a couple of years ago as the recommended data replication and change data capture technology for Oracle databases, many customers still on Oracle Database 11gR2 or earlier still use Streams for Oracle-to-Oracle change data capture as it works and compared to GoldenGate doesn’t require any additional licensing. Oracle’s GoldenGate Statement of Direction paper from 2014 states that streams in Oracle 11gR2 will continue to be supported but no future versions of the Oracle Database will come with Streams included, but if you’re on 11gR2 and you just want to trickle-feed capture between two Oracle databases it’s an interesting option.

I covered Oracle-to-Oracle data replication using Streams a few times in the past including this OTN article on ODI and Change Data Capture back from before 2007 or so, this article on OWB11gR2 and Change Data Capture from 2010 and once from back in 2006 that went into the details of setting up asynchronous hotlog change data capture with the new “Paris” OWB10gR2 release. We’re now on the 12c release of Oracle Data Integrator and I’m teaching our ODI12c Bootcamp course to a client next week who’s particularly interested in using Streams with ODI12c, so I thought it’d be worth taking a look at this feature in more detail to see if much has changed since the earlier articles.

Let’s start then with an ODI12c 12.1.3 install with a regular mapping set-up to copy and join the DEPT and EMP tables from one Oracle database into a denormalized table in another Oracle Database. Both are Oracle Database 11gR2 ( and the initial mapping looks like this:


One thing that many ODI developers don’t know about the 12c release is that it comes with a set of “component-style” knowledge modules built-into the tool, which you can use straightaway to get a mapping running without having to select and import IKMs, LKMs and other KMs from the ODI Studio filesystem. In my case the Physical mapping looks like the screenshot below with two execution units (one for each Oracle Database server) and a number of built-in component-style KMs available for selection. I choose the LKM SQL to SQL (Built-in) load knowledge module which uses a generic JDBC connection to load source records into the staging table on the target server, and then the IKM Oracle Insert integration knowledge module to take that staging data and integrate it into the target table.


I then run this mapping and see that ODI extracted data from the source database using a Java routine and batch transfers into the Oracle staging table, and then integrated the contents of that staging table into the target Oracle table. I could of course improve this by using the LKM Oracle to Oracle (DBLink) knowledge module and thereby avoid loading in two steps, but what I’d instead like to do is use Oracle Streams to trickle-feed new and changed data from my source tables over to the target database server, as shown in the diagram below. 



In the OWB and Asynchronous Change Data Capture article i linked to earlier in the post, setting up change data capture involved quite a few steps; the database had to be put into archivelog mode, the GLOBAL_NAMES parameter had to be set and a whole bunch of PL/SQL procedures had to be called to set up the source-to-target connection. Once it’s running, Streams takes transactions off of the redo log files on the source database and send them across the network to the target database server in a similar way to how GoldenGate sends transactions in the trail file across to target database servers – except it’s Oracle-to-Oracle only and in my experience is a lot more fragile than GoldenGate, which is why we and most other customers switched to GoldenGate when it came out.

ODI12c comes with a number of change data capture or “journalizing” knowledge modules that use either database triggers, UPDATE_DATE fields, Oracle Streams or GoldenGate to replicate data from source system to the target data warehouse. The journalizing knowledge module we’ll use, JKM Oracle 11g Consistent (Streams) is a template-style KM that needs to be imported first from the filesystem where ODI Studio was installed, as shown in the screenshot below – note also when you do this yourself that there’s a big “deprecated” notice next to it, saying that it could be removed at any time (presumably in-favour of the GoldenGate-based ones)


ODI and the JKM Oracle 11g Consistent (Streams) KM takes a more “f*ck it, let’s just to it” approach to database configuration than my OWB10gR2 version did, automatically configuring the source database for Streams by running all of the setup PL/SQL routines leaving you just to put the database in archivelog mode if it’s not already, and granting the connecting user (in my case, SYSTEM) streams administrator privileges. Moving over to SQL*Plus on the source database I therefore run the setup commands listed in the KM notes like this:

SQL*Plus: Release Production on Sun Jul 26 06:34:44 2015


Copyright (c) 1982, 2011, Oracle.  All rights reserved.



Connected to:

Oracle Database 11g Enterprise Edition Release – 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options


SQL> grant dba to system;


Grant succeeded.


SQL> begin

  2  dbms_streams_auth.grant_admin_privilege(

  3       grantee   => ‘system’,

  4       grant_privileges => true);

  5  end;

  6  /


PL/SQL procedure successfully completed.


SQL> shutdown immediate;

Database closed.

Database dismounted.

ORACLE instance shut down.

SQL> startup mount;

ORACLE instance started.


Total System Global Area 2471931904 bytes

Fixed Size    2230872 bytes

Variable Size  570426792 bytes

Database Buffers1895825408 bytes

Redo Buffers    3448832 bytes

Database mounted.

SQL> alter database archivelog;


Database altered.


SQL> alter database open;


Database altered.

Next I’ll go back to ODI Studio and enable the source Model for journalising by double-clicking on the model in ODI Studio and then selecting the JKM from the Journalizing tab, like this:


I then right-click on the EMP and DEPT tables within the source model and select Changed Data Capture > Add to CDC, where I can fine-tune the replication order so that new departments (DEPTNO) employees link to will always have been created before the employee record hits.


It’s at this next stage, when I enable journalizing, that Streams is set-up on the source and target database servers and all the supporting tables and views are created. To enable journalizing I click on the model, not the individual tables, and select Changed Data Capture > Start Journal, like this:


If you’ve read any of our previous posts on ODI and changed data capture you’ll realise that this setup process is that same regardless of the underlying replication technology, which makes it easy to start with database-centric CDC technologies such as this and then move to GoldenGate later on without lots of rework or re-training. For now though lets run this setup process using the local agent and then check the Operator navigator to see what it did (and whether it worked…)


And it did work. Enabling journalising with the Oracle Streams JKM involves quite a few setup steps including checking that all the database settings and parameters are enabled correctly, then running the various DBMS_STREAMS and other packages to setup the capture and transmission process. Then, as with all of the ODI JKMs a set of J$ tables are created to hold the primary keys of new and changed records coming from the source system, along with JV$ views that join those primary keys to the full incoming replicated rows – this blog post by the Oracle ODI team explains the background to JKMs very well if you want to understand them in more detail. Looking at the source SCOTT schema in SQL*Developer I can see the CDC and J$/JV$ tables and views created in the schema; if I didn’t want these tables created in the actual data schema I could have specified a different schema as the WORK schema when I created the database connection in ODI Studio prior to this exercise.


Next I have to define one or more “subscribers’ to the journals; for these more advanced “consistent set JKMs” of which the Oracle Streams one is one, you can define multiple consumers or “subscribers” to the changed data so that one can be further down the queue than the other (Simple JKMs only allow a single subscriber). I call my subscriber “SUNOPSIS” as this is the default subscriber name ODI adds to the mappings downstream. 


Pressing OK after adding the subscriber again brings up the prompt to select an agent, and going over to the Operator navigator I can see that another set of steps have run, again doing some streams setup but also adding details of the journal subscribers to the tables created on the source database.


I can check that journalising is now working by using the View Data… feature in the Designer navigator to insert new rows into the EMP and DEPT tables, and then checking the J$ tables in the source database schema to see if the rows’ primary keys have been added – which they have been.


To now just read journalised new and changed data into my mapping, all I do then is go to the Physical mapping diagram, select the first source table and check the Use Journalized Data Only checkbox, then do the same for the other table (note it is the table source you select, not the access point for that table into the target execution block).


So now I’ll run the mapping and check the results in the Operator navigator … but instead, I get an error:


This is because, given the way that ODI handles CDC and journalising, it can’t allow two journalised tables to be directly joined in a mapping – we think this is because ODI can’t guarantee both tables are in the same update “state” and therefore makes you copy their data into a staging or intermediary table before you can do the join. I therefore amend the mapping to load the journalised tables into staging tables on the target database server, amend the joins and filter to reference the staging tables, and then join their contents to then filter and load into the target reporting table, like this:


With the Physical mapping details, the two incoming Oracle staging tables are loaded by the same LKM SQL to SQL component-style mapping as we used before to extract data from the two journalised tables, and the Journalized Data Only flag is still set for the source tables, as you can see below.


What this also highlights is a key difference between the way I trickle-fed transactions across from my source database back with the OWB and Changed Data Capture article back at the start of this post, and the way ODI’s JKMs do it; in the OWB example I set up an actual trickle-feed process outside of OWB which transferred changed data across the network to my target data warehouse, whereapon I then read those change tables and used them to update the target DW tables in my data warehouse.

And this is actually how the GoldenGate KMs work with ODI – a GoldenGate replication process copies new and changed data from the source database to the target and ODI then reads from these change tables, whereas the Streams (and other non-GoldenGate) JKMs create the change capture tables back on the source database (the various J$ and JV$ tables I reviewed using SQL*Developer earlier on), with ODI then reading from those remote change tables and bringing across the new and changed data to the target database server. This I guess makes things easier to set up – you don’t have to worry about configuring the target database for streams replication – but it does mean that you still incur the network traffic every-time you micro-batch the changes across the network rather than spreading that traffic transaction-by-transaction.

Anyway, back to ODI and this time, when I run the mapping it works, though looking at the Operator navigator again I can see that no new data came across, and the journal data is still waiting to be consumed afterwards. Why is this?


If you’ve only used the Simple CDC JKMs from Oracle before this is normally all you need to do, but with Consistent Set ones such as this one, or the GoldenGate JKMs, you need to lock the subscriber view of the journalised data and extend the CDC window before you can access the journal records; for the Simple JKMs the IKM (Integration Knowledge Module) takes care of the unlock, extend, purge and lock operations for you automatically in the background, whereas with Consistent Set ones you typically do this as part of a wider ODI package as shown in the screenshot below, with the first and last tasks created by dragging and dropping the journalized model onto the package canvas and selecting Journalizing Model as the type (the subscriber name, “SUNOPSIS” in this case is typed-in below those settings and is off-screen in the screenshot)


Now when I run this package, as opposed to the mapping on its own, I get the row of new data I was expecting and the journal table is now empty, as I was the only subscriber.


Finally, if I was looking for real-time continuous loading into this target table, I could wrap the package in an event-detection loop that waits for in this case ten seconds for three journal rows to be written, then either processes the three as soon as the third arrives or loops around again every ten seconds (obviously in-reality you’d want to put in a mechanism to halt the loop if needed, but in my case I’ll just kill the job from the Operator navigator when I want it to stop)


So that’s the basics of using ODI with Oracle Streams for Oracle-to-Oracle changed data capture; if you’re just copying data between two Oracle databases and you’re on 11gR2 or earlier this might be an option, but long-term you’ll need to think about GoldenGate as Oracle aren’t developing Streams beyond the 11gR2 release. Note also that all the streams activity happens over on the source database server so you still need this additional step to copy the journaled data across to the target data warehouse, but it’s still a fairly non-invasive way to capture changes on the source Oracle database and it does have the considerable advantage (compared to GoldenGate) of being free-to-use.

Categories: BI & Warehousing

Security patches released for OBIEE, and ODI DQ

Wed, 2015-07-15 00:14

Oracle issued their quarterly Critical Patch Update yesterday, and with it notice of several security issues of note:

  • The most serious for OBIEE (CVE-2013-2186) rates 7.5 (out of 10) on the CVSS scale, affecting the OBIEE Security Platform on both and The access vector is by the network, there’s no authentication required, and it can partially affect confidentiality, integrity, and availability.
    • The patch for users of OBIEE is to install the latest patchset, (3GB, released – by no coincidence I’m sure – just yesterday too).
    • For OBIEE there is a small patch (64Kb), number 21235195.
  • There’s also an issue affecting BI Mobile on the iPad prior to, the impact being partial impact on integrity.
  • For users of ODI DQ there’s a whole slew of issues, fixed in CPU patch 21418574.
  • Exalytics users who are on ILOM versions earlier that 3.2.6 are also affected by two issues (one of which is 10/10 on the CVSS scale)

The CPU document also notes that it is the final patch date for If you are still on 10g, now really is the time to upgrade!

Full details of the issues can be found in Critical Patch Update document, and information about patches on My Oracle Support, DocID 2005667.1.

Categories: BI & Warehousing

Options for Enabling ODI11g+12c Standalone Agents for High-Availability (or … Why JEE Agents are the Best Option)

Sat, 2015-07-11 08:31

A few years ago we posted some articles on the blog around Oracle Data Integrator ETL restartability and resilience, and making ODI’s agents highly-availabile. The context around those posts was a large data integration project we were doing for a client who was deploying Oracle Fusion Middleware across the entire enterprise, and they wanted us to use the new ability within ODI11g to deploy agents within WebLogic Server Managed Servers to provide true high-availability for their ETL platform; ODI11g in this topology uses Oracle Coherence to provide a middleware-based data grid that holds a shared record of the agent schedule so that another agent can pick-up and resume tasks in the schedule if the main agent falls-over mid-way through execution. The diagram below shows this topology, which was then brought-forward into ODI12c and can be used now, through a feature called “JEE Agents”.


And this is our standard recommendation to clients who want to make their ODI agents as highly-available as possible, so that there’s always an active agent ready to receive execution or schedule requests, and schedules keep running even if the agent that’s running it falls-down mid-way through execution. But the JEE Agent approach has some significant prerequisites:

  • You need to deploy them within Oracle WebLogic Server Enterprise Edition, with at least two machines in a cluster within the WebLogic domain, which can be expensive for clients without the required WebLogic Server licenses or who haven’t got suitable hardware available
  • You also need Oracle Coherence for the shared record of ODI schedules that each JEE agent can get access to, and the client may not have a license for Coherence in their current WebLogic/Fusion Middleware license deal
  • The customer might also, understandably, what to properly understand what they can do with just Standalone Agents before they shell-out for WebLogic Server Enterprise Edition, Coherence and a bunch of new servers

For anyone new to Oracle Data Integrator, agents are Java processes that either sit within WebLogic Server clusters or directly on source and target servers, and take the instructions within your ODI mappings, packages and loan plans and execute them as SQL instructions sent to databases, web service calls or whatever other interactions your ETL processes require. Agents can either receive instructions ad-hoc from the ODI Studio application or via command-line or web service calls, or they can be assigned on or more schedules that they then execute at the appropriate time to move data into your data warehouse or between various applications. ODI11g has two types of agents; JEE agents as mentioned earlier, and what are called Standalone Agents, agent that run in a Java JVM and are started from the command-line, that don’t require WebLogic Server and are mainly used because of their small footprint – you an run them directly on the database server that’ll be performing the Load and Transform part of an ELT job, for example, but they don’t have the HA capabilities of agents running in WebLogic Server and they don’t have access to all the “enterprise” Java features provided to agents running within WebLogic Server.

Borkur and I worked with just such a customer recently and it was a useful opportunity to dig into exactly how Standalone Agents handle failover, and how close you can get to JEE agents if you’d like some degree of high-availability for your ODI setup. As this was a new implementation of ODI for the customer and they were currently using OBIEE11g, their first though was to deploy ODI11g until such time as they could upgrade their entire Fusion Middleware stack to 12c; as we were only talking about Standalone Agents though my advice was that there was no harm in deploying ODI12c instead of the 11g version as it didn’t really involve any WebLogic integration, and they could then benefit from the improved developer features in ODI12c such as flow-based mappings and deployment specifications.

High-Availability for ODI11g Standalone Agents – OPMN and “Cross-Wiring” to Create a Single Physical/Logical Standalone Agent

So let’s start with Standalone Agents within an ODI11g deployment. In this instance, the customer had two datacenters and wanted to put an agent in each to provide some degree of HA and disaster recovery, with one agent at any one time being the “primary” agent that received job execution requests and was assigned the scheduled daily load. In this instance we were able to also deploy OPMN (Oracle Process Manager and Notification Server), installed as part of the Oracle Web Tier Utilities download and the recommend way to provide some degree of high-availability for ODI11g Standalone Agents.

What we thought would be interesting here though would be to try and follow John Goodwin’s instructions on cross-wiring two Standalone Agent + OPMN installs to provide active/passive failover across two server nodes. In John’s case he’s talking about agents running Hyperion data loading jobs but it’s just as applicable to our situation, so we went with the topology below to try-out active/passive failover for our ODI11g Standalone Agents across two nodes.


To start with then, each Standalone Agent is now registered with their local OPMN instance so that opmntcl can now be used to stop, start and restart them. In our setup, the primary ODI node ( currently has the active Standalone Agent instance running, as I can see by using the opmnctl command-line utility:

[oracle@odi11g-node1 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   21891 | Alive   
ohs1                             | OHS                |     N/A | Down

 If I switch over to the second node ( I’m currently leaving that agent as disabled, as it’s the “passive” part of my active/passive arrangement.

[oracle@odi11g-node2 ~]$ cd /home/oracle/Middleware/Oracle_WT1/instances/instance1/bin
[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |     N/A | Down    
ohs1                             | OHS                |     N/A | Down

The clever part now, and as detailed in John Goodwin’s article, is registering each OPMN server with the other server in the active/passive setup, so that if one goes down, OPMN brings up the other one. This method doesn’t seem to be documented in the ODI11g documentation and later on I’ll explain why I think this is, but to “cross-wire” the OPMN servers in this way you have to make a few manual changes to each OPMN server’s opmn.xml configuration file, in my case located at /home/oracle/Middleware/Oracle_WT1/instances/instance1/config/OPMN/opmn/opmn.xml:


1. First you create a new <topology></topology> section within the file that lists out the hosts that have OPMN installed that you want to include in the failover cluster (use the remote 6701 port rather than the local 6700 port for this)

2. Next you add the service-failover=“1” and service-weight settings to the <process-type></process-type> section for the agent you want to

[oracle@odi11g-node1 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   22064 | Alive   
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node1 bin]$ kill -9 22064
[oracle@odi11g-node1 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   22147 | Init    
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node1 bin]$ kill -9 22147
[oracle@odi11g-node1 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |     N/A | Down    
ohs1                             | OHS                |     N/A | Down

and then moving over to the secondary’s node OPMN instance, you can see that it’s agent has now changed from disabled, to running:

[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |     N/A | Down    
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   10263 | Alive   
ohs1                             | OHS                |     N/A | Down

… which is pretty cool. But there’s a couple of drawbacks with this approach. The first is that whilst OPMN can move the active agent between two or more server nodes, it doesn’t update the agent’s hostname in the ODI Master Repository, which means you either need to hack around with the ODI Master Repository tables to update its record of the hostname each time the agent fails-over, or you do what I did and create a CNAME or VIP entry in the DNS server’s records so that the agent is reached by a virtual hostname, in our case


The problem though, is that you need a means of updating this CNAME or VIP record when OPMN fails-over the agent to the secondary node, otherwise any further requests to this agent through ODI Studio will direct to the failed node, not the one that’s now running the active Standalone agent. In-practice you could set-up something that pings both agents and updates the CNAME record in your DNS zone when the primary agent moves, but this involves working with your organization’s network or sysadmin team which can sometimes be a challenge. 

The other issue is what happens when an ad-hoc or scheduled job is running when the active agent goes down. Let’s check this now by creating a Load Plan and then running it through what’s currently the active agent:


I’ll now kill the process, and let OPMN restart it on that server node:

[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   10263 | Alive   
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node2 bin]$ kill -9 10263
[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   10452 | Alive   
ohs1                             | OHS                |     N/A | Down

Notice how the pid (process ID) has now changed for the agent, which indicates that OPMN has restarted it on that node. Going back to ODI Studio, after a while I can see that the agent on restart has marked the load plan execution as having failed:


If I restart the load plan execution though, a new load plan execution instance is created which automatically skips the steps that ran successfully before:


If I kill the agent’s process three times in-a-row though, like this:

[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   10452 | Alive   
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node2 bin]$ kill -9 10452
[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   10557 | Init    
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node2 bin]$ kill -9 10557
[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |   10602 | Init    
ohs1                             | OHS                |     N/A | Down    
[oracle@odi11g-node2 bin]$ kill -9 10602
[oracle@odi11g-node2 bin]$ ./opmnctl status
Processes in Instance: instance1
ias-component                    | process-type       |     pid | status  
ODI11_Standalone                 | odiagent           |     N/A | Down    
ohs1                             | OHS                |     N/A | Down

the other agent will become the active agent, but it may take some time before it sync’s up correctly with the Master Repository because the CNAME created to point to the active node might take a while to change or get updated. In my case, the load plan continues to display as “running” for a couple of minutes more, until it’s marked as failed.


What it is good to see, though, is that this agent that OPMN has failed-over too still has the schedule assigned to it; as far as the Master Repository is concerned it’s still the same logical/physical agent, even though OPMN has moved it to a new node – this is because we access it via the special virtual host name.


So using OPMN and this active/passive failover-setup has the advantage of making sure there’s always an agent ready to receive ad-hoc job requests or run jobs on the schedule, but it requires you to maintain a CNAME or Virtual IP for the agent and restart failed mappings to make it work. So how does this all look with ODI12c?

High-Availability for ODI12c Standalone Agents – Node Manager and an Active/Active Setup

ODI12c has the same Standalone Agents and JEE Agents that ODI11g does, but introduces a third type of agent, Co-Located Agents, that confuses things a bit more. Moreover, ODI12c Standalone Agents are managed by Node Manager not OPMN, which has been removed from the overall Fusion Middleware 12c architecture, with Node Manager in this case running within a lightweight, pseudo-domain especially for ODI12’c Standalone Agent in this case, and with the new Co-Located Agents running within a full WebLogic domain but outside of a WebLogic Server Managed Server.

So what this means in-practice is that each node’s Standalone Agent with an ODI12c install is managed by Node Manager rather than OPMN, but as Node Manager outside of a full WebLogic domain can’t link-up with another Node Manager for an active/passive high-availability setup, each Standalone Agent has to be  active all the time (or as much as possible) in what in this case is now an active/active HA setup.


Within ODI Studio and the Master Repository topology then, each agent is registered separately and it’s the choice of the operator which agent jobs are sent to, and which agent runs the daily ETL schedule. You could of course set these agents up in a load-balancing cluster with a parent or master agent accepting all the job execution requests, but then you’d need to make that master agent highly-avaialble, where would it go, and you’re back to the same issue again.


Each agent in this instance can either be started via the utility, or via Node Manager like this (edited for brevity):

[oracle@odi12c-node1 bin]$ ./ ODI12_Primary_Standalone
Starting system Component ODI12_Primary_Standalone ...
Initializing WebLogic Scripting Tool (WLST) ...
Welcome to WebLogic Server Administration Scripting Shell
Type help() for help on available commands
Reading domain from /home/oracle/odi12c/user_projects/domains/base_domain
Please enter Node Manager password:
Connecting to Node Manager ...
Successfully Connected to Node Manager.
Starting server ODI12_Primary_Standalone ...
Successfully started server ODI12_Primary_Standalone ...
Successfully disconnected from Node Manager.
Exiting WebLogic Scripting Tool.

By default, like ODI11g Standalone Agents and OPMN, Node Manager won’t automatically restart an agent under it’s control if it falls-over. If, however, you set the CrashRecoveryEnabled setting to “true” in the /home/oracle/odi12c/user_projects/domains/base_domain/nodemanager/ file as shown in the screenshot below, ODI12c Standalone Agents that were started by NodeManager will get auto-restarted if the process fails.


As you can see in the Terminal session screenshot below, Node Manager picks up the fact that the ODI12c Standalone Agent process is started is no longer running (by checking running processes against lock files the system component created on startup), then automatically restarts it so that it’s running again at the end.


As with ODI11g’s Standalone Agents and OPMN, load plans that are running when an ODI12c Standalone Agent fails then fail themselves, as shown in the screenshot below:


You can in-fact restart failed load plans with either active agent – in this case I’m restarting it with the secondary Standalone Agent, simulating a complete failure of the first node rather than a process failure that Node Manager could automatically deal with.


Where things do get complicated with this active/active setup is when the agent you’ve assigned the schedule to fails; in this case, unlike ODI11g where there was just a single logical agent, in this instance the second agent is completely independent of the first, and will need to have the schedule assigned to it if the primary agent fails; the primary agent, once the server node is running again, will need to have the schedule taken away from it otherwise you’ll have two agents acting as the scheduler each one running it’s own copy of the daily data load. One way around this would be to have the scheduling done by a third-party scheduler, which would either access the two active/active agents through a load-balancer and round-robin approach, or try and use one agent where possible falling over to the other one.

Conclusions – And why JEE Agents, WebLogic and Coherence Aren’t Such Overkill, After All…

Once you’ve been through all of this and looked at the limitations of a Standalone Agent-only setup, you can sort-of see why Oracle went down the WebLogic and Coherence route for their fully HA-capable agent topology; Node Manager can work across multiple servers when you run it within a WebLogic Server domain, and using Coherence to store a local (middleware-level) copy of the schedule means that all of the agents can keep track of what’s been executed and where they are in the schedule. Coupled with a load balancer and a Virtual IP that provides a consistent address for all of the JEE agents in the cluster, it’s a bit more engineering but you can see why it was needed; in the meantime, the customer went with 12c in the end and put in-place manual processes to deal with Standalone Agent failure in this active/active setup, which should be enough to tide-them over until the next budget round when they can put additional hardware and the JEE Agent setup into place.

Categories: BI & Warehousing

Essbase Studio checkModelInSyncExternal Error

Wed, 2015-07-08 02:12

This week I am back at one of my favourite clients to assist with some issues since the latest upgrade to These guys are seriously technical and know their stuff, so when issues pop up I’m always ready for an interesting and challenging day in the office.

As with all the recent 11g versions, they had to decide between in-place and out-of-place upgrades, and here they opted for an in-place upgrade because they have a fairly settled configuration using Answers, Publisher and Essbase. They were trying to avoid the need to reconfigure components like:

  • SQL Group Providers
  • Customisations
  • Application Roles used in OBIEE and Essbase Filters

Plus, when reconfiguring the above you also run the risk of missing something and it could take a long time to track down where it went wrong.

The Problem

In the end this was not a very complicated problem or solution but seeing as we couldn’t find anything on the web or Oracle Support regarding this error, I thought it might be useful to share in case others run into this same issue.

After performing the upgrade from OBIEE to the client was unable to view or edit the Essbase Model properties in Essbase Studio. In fact, they couldn’t even see their Essbase Model at all. Only the Cube Schema was visible.


When we tried to select or edit the Schema the following message appeared:

Error in the external procedure 'checkModelInSyncExternal'. Line = 136.


Oracle Support Came To The Party

After trying several different options to fix it, none of which made any difference, the client raised a P1 query with Oracle Support. After going through the initial standard questions and a few messages between the client and Oracle Support, they came back with a solution. All of this within 24 hours…

The Reason

After applying the patch the catalog version is not synchronised with the new version of Essbase Studio Server.

The Solution

Even thought we couldn’t find any reference to this in the post-patching section of the documentation, when you read the Oracle Essbase Studio Readme there is a section in there describing the problem and solution.

To fix the problem we followed these simple steps

  1. Navigate to
  2. Copy startCommandLineClient.bat.template to
  3. Rename the new startCommandLineClient.bat.template to startCommandLineClient.bat
  4. Edit startCommandLineClient.bat so it looks like this
    NOTE – Update the paths according to your environment AND use full paths
    @echo off
    set JAVA_HOME=C:\oracle\Oracle_BI1\jdk\jre
    "%JAVA_HOME%\bin\java" -Xmx128m %JAVA_OPTIONS% -jar "C:\oracle\Oracle_BI1\products\Essbase\EssbaseStudio\Server\client.jar"
    if %errorlevel% == 0 goto finish
    echo .
    echo Press any key to finish the client session
    pause > null
    : finish
  5. Open a CMD window and start startCommandLineClient.bat
  6. Enter the requested information (Host, Username and Password)
  7. Enter the following command and wait for it to complete
  8. Type Exit to leave the script and close the CMD window

You should now be able to open the Model Properties in Essbase Studio without any issues.


Categories: BI & Warehousing

How Engaged Are Your OBIEE Users?

Thu, 2015-06-25 10:58

Following on from Jon’s blog post “User Engagement: Why does it matter?”, I would like to take this one step further by talking about measurement. At Rittman Mead we believe that if you can’t measure it, you can’t improve it. So how do you measure user engagement?


User engagement for OBIEE is like most web based products or services:

  • both have users who access the product or service and then take actions.
  • users of both use it repeatedly if they get value from those actions.

A lot of thought has gone into measuring the customer experience and engagement for web based products and services. Borrowing some of these concepts will help us understand how to measure user engagement for BI solutions.

We look at three metrics:

  • Frequency of use
  • Recency of use
  • Reach of the system
Usage Tracking Data

OBIEE offers visibility of what its users are doing through its Usage Tracking feature, we can use this to drive our metrics.

Figure 1


As we can see from Figure 1, the usage tracking data can support our three metrics.

Frequency of use
  • Number of times a user or group of users visit in a specific period (Day / Month / Year)
  • Number of times a dashboard / report is accessed in a specific period.
  • How are these measures changing over time?
Recency of use
  • How recently was a report / dashboard used by relevant user groups?
  • What are the average days between use of each report / dashboard by relevant use group?
  • Number of dashboards / reports used or not used in a specific period (Day / Month / Year)
  • Number of users that have used or not used OBIEE in a specific period (Day / Month / Year)
  • How are these changing over time?
Reach of the system
  • Overall number of users that have used or not used OBIEE. This can be further broken down by user groups.
  • How is it changing over time?
User engagement KPI perspective

We have compared BI solutions to web-based products and services earlier in this post. Let’s look at some popular KPIs that many web-based products use to measure engagement and how they can be used to measure OBIEE engagement.

  • Stickiness: Generally defined as the amount of time spent at a site over a given period.
  • Daily Active Users (DAU): Number of unique users active in a day
  • Monthly Active Users (MAU): Number if unique users active in a month.

DAU and MAU are also used as a ratio (DAU / MAU) to give an approximation of utility.

The R&D division of Rittman Mead has developed the Rittman Mead User Engagement Toolkit, a set of tools and reports to capture and visualise user engagement metrics. The example charts given below have been developed using the R programming language.

Figure 2 – DAU over time with a trailing 30-day average (Red line)

 MAU trailing 30 day average V0.3

Figure 3 – Forecast DAU/MAU for 30 days after the data was generated


What Can You Do With These Insights?

Recall that Jon’s blog post points out the folowing drivers of user engagement:

  • User interface and user experience
  • Quality, relevance, and confidence in data
  • Performance
  • Ability to use the system
  • Accessibility – is the system available in the right way, at the right time?

There are several actions you can take to influence the drivers as a result of monitoring the aforementioned metrics.

  • Identify users or groups that are not using the system as much as they used to. Understand their concerns and address the user engagement drivers that are causing this.
  • Verify usage of any significant enhancement to the BI solution over time.
  • Analyse one of the key drivers, performance, from usage data.
  • Determine peak usage to project future hardware needs.

User engagement is the best way users can get value from their OBIEE systems. Measuring user engagement on an ongoing basis is important and can be monitored with the use of some standard metrics and KPIs.

Future blog posts in this series will address some of the key drivers behind user engagement in addition to providing an overview of the Rittman Mead User Engagement Toolkit.

If you are interested in hearing more about User Engagement please sign up to our mailing list below.

#mc_embed_signup{background:#fff; clear:left; font:14px Helvetica,Arial,sans-serif; } /* Add your own MailChimp form style overrides in your site stylesheet or in this style block. We recommend moving this block and the preceding CSS link to the HEAD of your HTML file. */

Other posts in this series
Categories: BI & Warehousing