Skip navigation.

BI & Warehousing

Orphan Table Rows ... ugh!

Tim Dexter - Fri, 2015-10-09 09:57

This week, orphaned table rows and how to avoid them.

Its a bit more subtle than rows breaking across a page border and the solution is a doozy!

Im using another video to demonstrate because

  1. I don't have to type and grab screen shots, even thou I have one above
  2. Its faster and more easily understood, even in my umming and erring English.
  3. I'm hip and happening and video help is the future kids!
  4. You get to hear my Southern (England) drawl; a great sleep aid for insomniacs!

Here it is. You might want to 'fullscreen' it. Enjoy!

Categories: BI & Warehousing

Amazon Quick Sight – BI on Cloud?

Dylan's BI Notes - Thu, 2015-10-08 08:08
In my post Data Warehouses on Cloud – Amazon Redshift, I mentioned that what would be really useful is providing BI on Cloud, not just Data Warehouse on Cloud. I felt that BICS makes more sense comparing to Amazon Redshfit. I discussed with a couple of people last night in a meetup.  Some of them […]
Categories: BI & Warehousing

Fundamentals of SQL Writeback in Dodeca

Tim Tow - Mon, 2015-10-05 21:00
One of the features of Dodeca is read-write functionality to SQL databases.  We often get questions as to how to write data back to a relational database, so I thought I would post a quick blog entry for our customers to reference.

This example will use a simple table structure in SQL Server though the concepts are the same when using Oracle, DB2, and most other relational databases.  The example will use a simple Dodeca connection to a JDBC database.  Here is the Dodeca SQL Connection object used for the connection.

The table I will use for this example was created with the following CREATE TABLE  statement.

CREATE TABLE [dbo].[Test](
[TestID] [int] IDENTITY(1,1) NOT NULL,
[TestCode] [nvarchar](50) NULL,
[TestName] [nvarchar](50) NULL,
  ([TestID] ASC)

First, I used the Dodeca SQL Excel View Wizard to create a simple view in Dodeca to retrieve the data into a spreadsheet.  The view, before setting up writeback capabilities, looks like this.
To make this view writeable, follow these steps.
  1. Add the appropriate SQL insert, update, or delete statements to the Dodeca SQL Passthrough Dataset object.  The values to be replaced in the SQL statement must be specified using the notation @ColumnName where ColumnName is the column name, or column alias, of the column containing the data.
  2. Add the column names of the primary key for the table to the PrimaryKey property of the SQL Passthrough DataSet object.
  3. Depending on the database used, define the column names and their respective JDBC datatypes in the Columns property of the SQL Passthrough Dataset.  This mapping is optional for SQL Server because Dodeca can obtain the required information from the Microsoft JDBC driver, however, the Oracle and DB2 JDBC drivers do not provide this information and it must be entered by the developer.
For insert, update, and delete operations, Dodeca parses the SQL statement to read the parameters that use the @ indicator and creates a JDBC prepared statement to execute the statements.  The prepared statement format is very efficient as it compiles the SQL statement once and then executes it multiple times.  Each inserted row is also passed to the server during the transaction.  The values from each row are then used in conjunction with the prepared statement to perform the operation.

Here is the completed Query definition.

Next, modify the DataSetRanges property of the Dodeca View object and, to enable insert operations, set the AllowAddRow property to True.  Note that if you added update and/or delete SQL to your SQL Passthrough Dataset object, be sure to enable those operations on the worksheet via the AllowDeleteRow and AllowModifyRow properties.

Once this step is complete, you can run the Dodeca View, add a row, and press the Save button to save the record to the relational database.

The insert, update, and delete functionalities using plain SQL statements is limited to operations on a single table.  If you need to do updates on multiple tables, you must use stored procedures to accomplish the functionality.  You can call a stored procedure in Dodeca using syntax similar to the following example:

{call sp_InsertTest(@TestCode, @TestName)}

Dodeca customers can contact support for further information at
Categories: BI & Warehousing

Do we really need semantic layer from OBIEE?

Dylan's BI Notes - Mon, 2015-10-05 09:22
Not all BI tools have the semantic layer.  For example, Oracle Discoverer seems not having a strong semantic layer. This page summarizes what OBIEE semantic layer can do for you… BI Platform Semantic Layer I think that if these features can be accomplished in other ways and can be proven that they are not necessary, […]
Categories: BI & Warehousing

Data Lake vs. Data Warehouse

Dylan's BI Notes - Sun, 2015-10-04 17:18
These are different concepts. Data Lake – Collect data from various sources in a central place.  The data are stored in the original form.  Big data technologies are used and thus the typical data storage is Hadoop HDFS. Data Warehouse – “Traditional” way of collecting data from various sources for reporting.  The data are consolidated […]
Categories: BI & Warehousing

Use DMZ to access BI from outside firewall

Dylan's BI Notes - Fri, 2015-10-02 13:12
DMZ is a technology that allows you to configure your network to be accessible outside firewall. Some of users may want to access some of corporate reports from mobile or from their personal computers. While VPN and Citrix may be useful for these cases, DMZ can provide another option. A good article – OBIEE Security […]
Categories: BI & Warehousing

Once A Week Is Never Enough

Rittman Mead Consulting - Thu, 2015-10-01 04:48

Over the last few months we have been looking at User Engagement and recently asked people take part in a global usage survey so we could create a series of benchmark metrics detailing how organisations used OBIEE.

We had a great response and have started to process the data. One of the first things that struck us was that the average user logs into their OBIEE system about once per week.



We derive the once per week figure from the DAU/MAU metric we have talked about here and here. DAU/MAU shows us the % of your user community repeatably using the tool – so a kind of stickiness metric.

DAU and MAU stand for Daily Average Users and Monthly Average Users and are the unique number of users who login and use the system over the period. The definition of DAU/MAU we are using is (30 day moving average of DAU)/(MAU) – using a moving average levels out any usage spikes on particular days of the week.

The reason this measure is significant is that is tells us of how integrated the system is into people’s (working) lives. If you believe that the information in your OBIEE system is invaluable to your employees carrying out their jobs, then you would expect your DAU/MAU to tend to 100%.

Note: DAU/MAU won’t identify users who don’t into your system at all, there is an assumption that they will login at least once over the period of a month. If this is not happening then you have an adoption problem, not an engagement problem.

Is More Engagement Better?

My view is that the higher the engagement in the system, the more valuable the system is. So the questions become whether one visit per week enough; what does good look like; and what did you expect when you designed the system?

If you follow this reasoning, then we could assume that if we could get people using their OBIEE systems more, then their organisations would be better off.

Raising Engagement

We have previous discussed 5 levers that we think drive user engagement:

  • User interface and user experience
  • Quality, relevance, and confidence in data
  • Performance
  • Ability to use the system
  • Accessibility – is the system available in the right way, at the right time?

I want to talk about a couple of them here and how they can effect engagement.

User Interface And User Experience

We have been looking at report and dashboard design for a while now and believe that every dashboard should tell a story. When a user goes to a screen it must support the job they are doing. This means that dashboards have to be designed first and foremost focusing on user’s business processes. We see too many dashboards that contain loosely coupled sets of data that developers thought should be in the same place.

If you need to understand what data a user or analyst needs to do their job, then observe them, look which reports and spreadsheets they use, follow their eyes to see which cells on the various tables they monitor. Are they looking at trends, comparisions or exceptions?

We cover this in depth through the Visual Redesign phase of our User Engagement Service.

The following quote is from one of our client who has used the service:

The new designs provide a simple and at the same time powerful view of data.

Quality, relevance, and confidence in data

However good your dashboard and report design is, if the underlying data is incorrect or inconsistent, or if the user thinks it is, then the system will not be valuable. Users will go to other sources for the data, and they will stop using the unstrusted source. Too many times trust is purely subjective. This is one of the reasons why Excel starts to take root in organisations as a reporting tool. It is somewhere users can self source data, transform and aggregate it as they please and, maybe misguidedly, trust the results.

This starts to open up a much wider conversation about Data Governance that I will address in future posts. We have spoken to a lot of organisations recently where the issue is not that their users don’t have access to data, its that fact there are multiple systems that they could get the answer from and multiple routes around their information architecture that the data flows. You need a clear business intelligence and analytics architecture and roadmap.

Business intelligence and analytics systems are typically judged by return on investment (ROI), however I wonder if organisations with mature information architectures supported by complex data flows could be better off by looking at the cost and complexity of their data architecture. By looking at this they could address both data governance and the total cost of ownership (TCO).

The first step should be to analyse the analytics driven data flows, find out where they apply business logic, where there is duplication, and where there are silos. If you can simplify this process then you start to build the foundation for building a better ROI business case.

So addressing quality, relevance, and confidence in data is a big undertaking. There are some simple things you can do, such as maintain a taxonomy of reporting attributes across the organisation, however in most organisations with mature information architectures the answer is likely to be more deep rooted.


Our goal at Rittman Mead is to help companies harness data in and around their organisations through a range of analytics strategies, processes and tools. We see User Engagement as a key part of this. We believe in our User Engagement Service we have the right tools to help you raise your DAU/MAU.

We have a fundamental belief that the more people embed business intelligence and analytics into their work lives, the better job they will do their company.

The great thing is that in user engagement in OBIEE is measurable. Usage Tracking provides unequivocal details of how your system is performing, who is using it and how often, now and after any changes.

The post Once A Week Is Never Enough appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Taking a Look at Oracle Big Data Preparation Cloud Service – Spark-Based Data Transformation in the Cloud

Rittman Mead Consulting - Sun, 2015-09-27 04:54

One of the sessions I’m delivering at the upcoming Oracle Openworld 2015 in San Francisco is entitled “Oracle Business Intelligence Cloud Service—Moving Your Complete BI Platform to the Cloud [UGF4906]”, and looks at how you can now migrate your entire OBIEE11g platform into Oracle Public Cloud including data warehouse and data integration routines. Using Oracle BI Cloud Services’ new on-premise RPD upload feature you can upload an existing RPD into BICS and run it from there, with the accompanying data warehouse database moving into Oracle’s Database Cloud Service (and with the restriction that you can’t then edit the repository within BICS, you need to do that on-premise and upload again). For ETL and data integration you can carry on using ODI12c which now has the ability to load data into Oracle Storage Cloud (for file sources) and BICS (via a REST API) as well as the full Oracle DBaaS, but another Oracle option for doing purely cloud-based data processing enrichment has recent become available – Oracle Big Data Preparation Cloud Service. So what is it, how does it work and how is it different to ODI12c?

Oracle Big Data Preparation Cloud Service (“BDP”) is a thin-client application within Oracle Cloud for ingesting, preparing and enriching datasets that don’t have a predefined schema and may well need certain fields obfuscated or cleansed. Being integrated with Oracle Storage Cloud and other infrastructure and platform services within Oracle cloud it’s obviously aimed mainly at data transformation tasks within the Oracle Cloud enviroment, but you can upload and download datasets from your browser for use with on-premise applications. Unlike the more general-purpose Oracle Data Integrator it’s aimed instead at a particular use-case – non-technical information analysts who need to get data transformed, wrangled and enriched before they can make use of it in an environment like Hadoop. In fact the product name is a bit misleading – it runs on a big data platform within Oracle Cloud and like Oracle Big Data Discovery uses Apache Spark for its data processing – but it could potentially be useful for a business analyst to prepare data for loading into Oracle BI Cloud Service, and I’ll cover this angle when I talk about data loading options in by Oracle Openworld session.

Within a logical architecture for a typical big data DW and BI system, BDP sits alongside ODI within the Data Factory and provides self-service, agile transformation capabilities to more business-orientated users. 


Oracle Big Data Cloud Preparation Service shares quite a bit of functionality and underlying technology, with Oracle Big Data Discovery – both run on Hadoop, they both use Apache Spark for data processing and transformation, and both offer data transformation and “wrangling” features aimed at non-technical users. Oracle are positioning Big Data Preparation Service as something you’d use in the execution layer of the Oracle Information Management Reference Architecture whereas Big Data Discovery is associated more with the discovery layer – I’d mostly agree but I can see a role for BDD even within the execution layer, as a front-end to the data reservoir that typically now runs alongside relationally-stored data warehouses.


Looking back at the slides from one of the recent Strata conferences, for example, sees Oracle positioning BDP as the “operational data preparation” tool for structured and unstructured data – with no defined schema – coming into your information platform, with the enriched output then being used BI tools, enterprise reporting and data discovery tools.



Apart from the scalability benefits of running BDP on Apache Spark, the other interesting feature in BDP is how it uses Spark’s machine learning capabilities to try to automate as much of the data preparation process as possible, for example detecting credit card numbers in data fields and recommending you obfuscate that column. Similar to BICS and how Oracle have tried to simplify the process of creating reports and dashboards for a small team, BDP runs in the cloud tries to automate and simplify as much of the data preparation and enrichment process as possible, with ODI12c still available for ETL developers to develop more complex transformations.

The development lifecycle for BDP (from the Oracle Big Data Preparation Cloud Service e-book on Oracle’s website) uses a cycle of ingesting, cleaning, enriching and then publishing data using scripts authored using the tool and run on the Apache Spark platform. The diagram below shows the BDP development lifecycle from Oracle’s Big Data Preparation Cloud Service Handbook, and shows how ingestion, enrichment, publishing and governance go in a cycle with the common foundation of the transformation scripts that you build using BDP’s web interface.


So let’s walk through an example data preparation exercise using a file of data stored initially in Oracle Storage Cloud Service. After logging into BDP via Oracle Cloud you’re first presented with the Catalog view, listing out all your previous transformations and showing you when they were last used to process some data.


To create a transformation you first give it a name, then select the data source and then the file you’re interested in. In my environment I’ve got Oracle Storage Cloud and HDFS available as my main data sources, or I could upload a file from my desktop and start from there.


BDP then ingests the file and then uses its machine learning features to process and classify data in each column, recommending column names such as “gender”, “city” and cc_number based on (presumably) some sort of classification model. In the screenshot below you can see a set of these recommendations on the left-hand side of the screen, with the columns themselves listed centre and a summary of the file profiling on the right.


Taking a closer look at the profile results panel you can see two of the columns have alerts raised, in red. Clicking on the alert shows that the two columns have credit card data stored in clear text, with the recommendation being to obfuscate or otherwise secure these fields. Clicking on a field then shows the various transformation options, with the obvious choice here being to automatically obfuscate the data in those fields.


Once you’ve worked through all the recommendations and added any transformations you choose to add yourself, the final step is to publish your transformation to one of the available targets. In the example below we’ve got Oracle Storage Cloud and HDFS again as potential targets; I’d imagine Oracle will add a connector to BICS soon, for example, so that you can use BDP as a data prep tool for file data that will then be added to your dataset in BICS.


So … it’ll be interesting to see where this one goes. Its interesting that Oracle have split out data preparation and data discovery into two tools whilst others are saying theirs can do both, and you’ll still need ODI for the more complex integration jobs. But I like the innovative use of machine learning to do away with much of the manual work required for classification of incoming data fields, and running the whole thing on Spark certainly gives it the potential of scale. A couple of years ago I was worried Oracle didn’t really have a strategy for data integration and ETL in the cloud, but we’re starting to see something happen now.

There’s a big push from the Oracle field at the moment to move customers into the cloud, and I can see BDP getting bundled in with Big Data Cloud Service and BICS as the accompanying cloud data preparation tool. The danger then of course is that Big Data Discovery starts to look less useful, especially with Visual Analyzer already available within BICS and coming soon on-premise with OBIEE12c. My guess is that what we’re seeing now with these initial releases of BDP and BDD is just the start, with BDP adding more automatic enrichment “smarts” and starting to cover integration use-cases too, whilst BDD will put more focus on data visualization and analytics on the data reservoir.

Categories: BI & Warehousing

Using the BI Server Metadata Web Service for Automated RPD Modifications

Rittman Mead Consulting - Fri, 2015-09-25 06:33

A little-known new feature of OBIEE 11g is a web service interface to the BI Server. Called the “BI Server Metadata Web Service” it gives a route into making calls into the BI Server using SOAP-based web services calls. Why is this useful? Because it means you can make any call to the BI Server (such as SAPurgeAllCache) from any machine without needing to install any OBIEE-related artefacts such as nqcmd, JDBC drivers, etc. The IT world has been evolving over the past decade or more towards a more service-based architecture (remember the fuss about SOA?) where we can piece together the functionality we need rather than having one monolithic black box trying to do everything. Being able to make use of this approach in our BI deployments is a really good thing. We can do simple things like BI Server cache management using a web service call, but we can also do more funky things, such as actually updating repository variable values in real time – and we can do it from within our ETL jobs, as part of an automated deployment script, and so on.

Calling the BI Server Metadata Web Service

First off, let’s get the web service configured and working. The documentation for the BI Server Metadata Web Service can be found here, and it’s important to read it if you’re planning on using this. What I describe here is the basic way to get it up and running.

Configuring Security

We need to configure the security against the web service to define what kind of authentication is required by it to use. If you don’t do this, you won’t be able to make calls to it. Setting up the security is a case of attaching a security policy in WebLogic Server to the web service (called AdminService) itself. I’ve used oracle/wss_http_token_service_policy which means that the credentials can be passed through using standard HTTP Basic authentication.

You can do this through Enterprise Manager:

Or you can do it through WLST using the attachWebServicePolicy call.

You also need to configure WSM, adding some entries to the credential store as detailed here.

Testing the Web Service

The great thing about web services is that they can be used from pretty much anywhere. Many software languages will have libraries for making SOAP calls, and if they don’t, it’s just HTTP under the covers so you can brew your own. For my testing I’m using the free version of SoapUI. In anger, I’d use something like curl for a standalone script or hook it up to ODI for integration in the batch.

Let’s fire up SoapUI and create a new project. Web services provide a Web Service Definition Language (WSDL) that describes what and how they work, which is pretty handy. We can pass this WSDL to SoapUI for it to automatically build some sample requests for us. The WSDL for this web service is


Where biserver is your biserver (duh) and port is the managed server port (usually 9704, or 7780).

We’ve now got a short but sweet list of the methods we can invoke:

Expand out callProcedureWithResults and double click on Request 1. This is a template SOAP message that SoapUI has created for you, based on the WSDL.

Edit the XML to remove the parameters section, and just to test things out in procedureName put GetOBISVersion(). Your XML request should look like this:

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  

If you try and run this now (green arrow or Cmd-Enter on the Mac) you’ll see the exact same XML appear on the right pane, which is strange… but click on Raw and you’ll see what the problem is :

Remember the security stuff we set up on the server previously? Well as the client we now need to keep our side of the bargain, and pass across our authentication with the SOAP call. Under the covers this is a case of sending HTTP Basic auth (since we’re using oracle/wss_http_token_service_policy), and how you do this depends on how you’re making the call to the web service. In SoapUI you click on the Auth button at the bottom of the screen, Add New Authentication, Type: Basic, and then put your OBIEE server username/password in.

Now you can click submit on the request again, and you should see in the response pane (on the right) in the XML view details of your BI Server version (you may have to scroll to the right to see it)

This simple test is just about validating the end-to-end calling of a BI Server procedure from a web service. Now we can get funky with it….

Updating Repository Variables Programatically

Repository variables in OBIEE are “global”, in that every user session sees the same value. They’re often used for holding things like “what was the last close of business date”, or “when was the data last updated in the data warehouse”. Traditionally these have been defined as dynamic repository variables with an initialisation block that polled a database query on a predefined schedule to get the current value. This meant that to have a variable showing when your data was last loaded you’d need to (a) get your ETL to update a row in a database table with a timestamp and then (b) write an init block to poll that table to get the value. That polling of the table would have to be as frequent as you needed in order to show the correct value, so maybe every minute. It’s kinda messy, but it’s all we had. Here I’d like to show an alternative approach.

Let’s say we have a repository variable, LAST_DW_REFRESH. It’s a timestamp that we use in report predicates and titles so that when they are run we can show the correct data based on when the data was last loaded into the warehouse. Here’s a rather over-simplified example:

The Title view uses the code:

Data correct as of: @{biServer.variables['LAST_DW_REFRESH']}

Note that we’re referencing the variable here. We could also put it in the Filter clause of the analysis. Somewhat tenuously, let’s imagine we have a near-realtime system that we’re federating a DW across direct from OLTP, and we just want to show data from the last load into the DW:

For the purpose of this example, the report is less important than the diagnostics it gives us when run, in nqquery.log. First we see the inbound logical request:

-------------------- SQL Request, logical request hash:  
   0 s_0,  
   "A - Sample Sales"."Time"."T05 Per Name Year" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
("Time"."T00 Calendar Date" <  VALUEOF("LAST_DW_REFRESH"))  

Note the VALUEOF clause. When this is parsed out we get to see the actual value of the repository variable that OBIEE is going to execute the query with:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-24 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

We can see the value through the Administration Tool Manage Sessions page too, but it’s less convenient for tracking in testing:

If we update the RPD online with the Administration Tool (nothing fancy at this stage) to change the value of this static repository variable :


And then rerun the report, we can see the value has changed:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-25 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Now let’s do this programatically. First off, the easy stuff, that’s been written about plenty before. Using biserverxmlgen we can create a XUDML version of the repository. Searching through this we can pull out the variable definition:

<Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
<Expr><![CDATA[TIMESTAMP '2015-09-25 00:30:00']]></Expr>  

and then wrap it in the correct XML structure and update the timestamp we want to use:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="">  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-26 00:30:00']]></Expr>  

(you can also get this automagically by modifying the variable in the RPD, saving the RPD file, and then comparing it to the previous copy to generate a patch file with the Administration Tool or comparerpd)

Save this XML snippet, with the updated timestamp value, as last_dw_refresh.xml. Now to update the value on the BI Server in-flight, first using the OBIEE tool biserverxmlcli. This is available on all OBIEE servers and client installations – we’ll get to web services for making this update call remotely in a moment.

biserverxmlcli -D AnalyticsWeb -R Admin123 -U weblogic -P Admin123 -I last_dw_refresh.xml

Here -D is the BI Server DSN, -U / -P are the username/password credentials for the server, and the -R is the RPD password.

Running the analysis again shows that it is now working with the new value of the variable:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-26 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Getting Funky – Updating RPD from Web Service

Let’s now bring these two things together – RPD updates (in this case to update a variable value, but could be anything), and BI Server calls.

In the above web service example I called the very simple GetOBISVersion. Now we’re going to use the slightly more complex NQSModifyMetadata. This is actually documented, and what we’re going to do is pass across the same XUDML that we sent to biserverxmlcli above, but through the web service. As a side note, you could also do this over JDBC if you wanted (under the covers, AdminService is just a web app with a JDBC connector to the BI Server).

I’m going to do this in SoapUI here for clarity but I actually used SUDS to prototype it and figure out the exact usage.

So as a quick recap, this is what we’re going to do:

  1. Update the value of a repository variable, using XUDML. We generated this XUDML from biserverxmlgen and wrapped it in the correct XML structure.
    We could also have got it through comparerpd or the Administration Tool ‘create patch’ function
  2. Use the BI Server’s NQSModifyMetadata call to push the XUDML to the BI Server online.
    We saw that biserverxmlcli can also be used as an alternative to this for making online updates through XUDML.
  3. Use the BI Server Metadata Web Service (AdminService) to invoke the NQSModifyMetadata call on the BI Server, using the callProcedureWithResults method

In SoapUI create a new request:

Set up the authentication:

Edit the XML SOAP message to remove parameters and specify the basic BI Server call:

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  

Now the tricky bit – we need to cram the XUDML into our XML SOAP message. But, XUDML is its own kind of XML, and all sorts of grim things happen here if we’re not careful (because the XUDML gets swallowed up into the SOAP message, it all being XML). The solution I came up with (which may not be optimal…) is to encode all of the HTML entities, which a tool like this does (and if you’re using a client library like SUDS will happen automagically). So our XUDML, with another new timestamp for testing:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="">  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-21 00:30:00']]></Expr>  


&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 00:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;

We’re not quite finished yet. Because this is actually a call (NQSModifyMetadata) nested in a call (callProcedureWithResults) we need to make sure NQSModifyMetadata gets the arguments (the XUDML chunk) through intact, so we wrap it in single quotes – which also need encoding (&apos;):

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 23:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

and then for final good measure, the single quotes around the timestamp need double-single quoting:

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

Nice, huh? The WSDL suggests that parameters for these calls should be able to be placed within the XML message as additional entities, which I wonder if would allow for proper encoding, but I couldn’t get it to work (I kept getting java.sql.SQLException: Parameter 1 is not bound).

So, stick this mess of encoding plus twiddles into your SOAP message and it should look like this: (watch out for line breaks; these can break things)

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  
         <procedureName>NQSModifyMetadata(&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;)</procedureName>  

Hit run, and with a bit of luck you’ll get a “nothing to report” response :

If you look in nqquery.log you’ll see:

[...] NQSModifyMetadata started.  
[...] NQSModifyMetadata finished successfully.

and all-importantly when you run your report, the updated variable will be used:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-21 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

If this doesn’t work … well, best of luck. Use nqquery.log, bi_server1.log (where the AdminService writes some diagnostics) to try and trace the issue. Also test calling NQSModifyMetadata from JDBC (or ODBC) directly, and then add in the additional layer of the web service call.

So, in the words of Mr Rittman, there you have it. Programatically updating the value of repository variables, or anything else, done online, and for bonus points done through a web service call making it possible to use without any local OBIEE client/server tools.

Categories: BI & Warehousing

Managing the OBIEE BI Server Cache from ODI 12c

Rittman Mead Consulting - Thu, 2015-09-24 13:47

I wrote recently about the OBIEE BI Server Cache and how useful it can be, but how important it is to manage it properly, both in the purging of stale data and seeding of new. In this article I want to show how to walk-the-walk and not just talk-the-talk (WAT? But you’re a consultant?!). ODI is the premier data integration tool on the market and one that we are great fans of here at Rittman Mead. We see a great many analytics implementations built with ODI for the data load (ELT, strictly speaking, rather than ETL) and then OBIEE for the analytics on top. Managing the BI Server cache from within your ODI batch makes a huge amount of sense. By purging and reseeding the cache directly after the data has been loaded into the database we can achieve optimal cache usage with no risk of stale data.

There are two options for cleanly hooking into OBIEE from ODI 12c with minimal fuss: JDBC, and Web Services. JDBC requires the OBIEE JDBC driver to be present on the ODI Agent machine, whilst Web Services have zero requirement on the ODI side, but a bit of config on the OBIEE side.

Setting up the BI Server JDBC Driver and Topology

Here I’m going to demonstrate using JDBC to connect to OBIEE from ODI. It’s a principle that was originally written up by Julien Testut here. We take the OBIEE JDBC driver bijdbc.jar from $FMW_HOME/Oracle_BI1/bifoundation/jdbc and copy it to our ODI machine. I’m just using a local agent for my testing, so put it in ~/.odi/oracledi/userlib/. For a standalone agent it should go in $AGENT_HOME/odi/agent/lib.

[oracle@ip-10-103-196-207 ~]$ cd /home/oracle/.odi/oracledi/userlib/  
[oracle@ip-10-103-196-207 userlib]$ ls -l  
total 200  
-rw-r----- 1 oracle oinstall    332 Feb 17  2014 additional_path.txt  
-rwxr-xr-x 1 oracle oinstall 199941 Sep 22 14:50 bijdbc.jar

Now fire up ODI Studio, sign in to your repository, and head to the Topology pane. Under Physical Architecture -> Technologies and you’ll see Oracle BI

Right click and select New Data Server. Give it a sensible name and put your standard OBIEE credentials (eg. weblogic) under the Connection section. Click the JDBC tab and click the search icon to the right of the JDBC Driver text box. Select the default,, and then in the JDBC Url box put your server and port (9703, unless you’ve changed the listen port of OBIEE BI Server)

Now click Test Connection (save the data server when prompted, and click OK at the message about creating a physical schema), and select the Local Agent with which to run it. If you get an error then click Details to find out the problem.

One common problem can be the connection through to the OBIEE server port, so to cut ODI out of the equation try this from the command prompt on your ODI machine (assuming it’s *nix):

nc -vz 9703

If the host resolves correctly and the port is open then you should get:

Connection to 9703 port [tcp/*] succeeded!

If not you’ll get something like:

nc: port 9703 (tcp) failed: Connection refused

Check the usual suspects – firewall (eg iptables) on the OBIEE server, firewalls on the network between the ODI and OBIEE servers, etc.

Assuming you’ve got a working connection you now need to create a Physical Schema. Right click on the new data server and select New Physical Schema.

OBIEE’s BI Server acts as a “database” to clients, within which there are “schemas” (Subject Areas) and “tables” (Presentation Tables). On the New Physical Schema dialog you just need to set Catalog (Catalog), and when you click the drop-down you should see a list of the Subject Areas within your RPD. Pick one – it doesn’t matter which.

Save the physical schema (ignore the context message). At this point your Physical Architecture for Oracle BI should look like this:

Now under Logical Architecture locate the Oracle BI technology, right click on it and select New Logical Schema. From the Physical Schemas dropdown select the one that you’ve just created. Give a name to the Logical Schema.

Your Logical Architecture for Oracle BI should look like this:

Building the Cache Management Routine Full Cache Purge

Over in the Designer tab go to your ODI project into which you want to integrate the OBIEE cache management functions. Right click on Procedures and select Create New Procedure. Give it a name such as OBIEE Cache – Purge All and set the Target Technology to Oracle BI

Switch to the Tasks tab and add a new Task. Give it a name, and set the Schema to the logical schema that you defined above. Under Target Command enter the call you want to make to the BI Server, which in this case is

call SAPurgeAllCache();

Save the procedure and then from the toolbar menu click on Run. Over in the Operator tab you should see the session appear and soon after complete – all being well – successfully.

You can go and check your BI Server Cache from the OBIEE Administration Tool to confirm that it is now empty:

And confirm it through Usage Tracking:

From what I can see at the default log levels, nothing gets written to either nqquery.log or nqserver.log for this action unless there is an error in your syntax in which case it is logged in nqserver.log:

(For more information on that particular error see here)

Partial Cache Purge

This is the same pattern as above – create an ODI Procedure to call the relevant OBIEE command, which for purging by table is SAPurgeCacheByTable. We’re going to get a step more fancy now, and add a variable that we can pass in so that the Procedure is reusable multiple times over throughout the ODI execution for different tables.

First off create a new ODI Variable that will hold the name of the table to purge. If you’re working with multiple RPD Physical Database/Catalog/Schema objects you’ll want variables for those too:

Now create a Procedure as before, with the same settings as above but a different Target Command, based on SAPurgeCacheByTable and passing in the four parameters as single quoted, comma separated values. Note that these are the Database/Catalog/Schema/Table as defined in the RPD. So “Database” is not your TNS or anything like that, it’s whatever it’s called in the RPD Physical layer. Same for the other three identifiers. If there’s no Catalog (and often there isn’t) just leave it blank.

When including ODI Variable(s) make sure you still single-quote them. The command should look something like this:

Now let’s seed the OBIEE cache with a couple of queries, one of which uses the physical table and one of which doesn’t. When we run our ODI Procedure we should see one cache entry go and the other remain. Here’s the seeded cache:

And now after executing the procedure:

And confirmation through Usage Tracking of the command run:

Cache Seeding

As before, we use an ODI Procedure to call the relevant OBIEE command. To seed the cache we can use SASeedQuery which strictly speaking isn’t documented but a quick perusal of the nqquery.log when you run a cache-seeding OBIEE Agent shows that it is what is called in the background, so we’re going to use it here (and it’s mentioned in support documents on My Oracle Support, so it’s not a state secret). The documentation here gives some useful advice on what you should be seeding the cache with — not necessarily only exact copies of the dashboard queries that you want to get a cache hit for.

Since this is a cookie-cutter of what we just did previously you can use the Duplicate Selection option in ODI Designer to clone one of the other OBIEE Cache procedures that you’ve already created. Amend the Target Command to:

When you run this you should see a positive confirmation in the nqserver.log of the cache seed:

[2015-09-23T23:23:10.000+01:00] [OracleBIServerComponent] [TRACE:3]  
[USER-42] [] [ecid: 005874imI9nFw000jzwkno0007q700008K,0] [tid: 9057d700]  
[requestid: 477a0002] [sessionid: 477a0000] [username: weblogic]  
Query Result Cache: [59124]  
The query for user 'weblogic' was inserted into the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_84190_2.TBL'. [[

A very valid alternative to calling SASeedQuery would be to call the OBIEE SOA Web Service to trigger an OBIEE Agent that populated the cache (by setting ‘Destination’ to ‘Oracle BI Server Cache (For seeding cache)’). OBIEE Agents can also be ‘daisy chained’ so that one Agent calls another on completion, meaning that ODI could kick off a single ‘master’ OBIEE Agent which then triggered multiple ‘secondary’ OBIEE Agents. The advantage of this approach over SASeedQuery is that cache seeding is more likely to change as OBIEE usage patterns do, and it is easier for OBIEE developers to maintain all the cache seeding code within ‘their’ area (OBIEE Presentation Catalog) than put in a change request to the ODI developers each time to change a procedure.

Integrating it in the ODI batch

You’ve two options here, using Packages or Load Plans. Load Plans were introduced in ODI and are a clearer and more flexible of orchestrating the batch.

To use it in a load plan create a serial step that will call a mapping followed by the procedure to purge the affected table. In the procedure step in the load plan set the value for the variable. At the end of the load plan, call the OBIEE cache seed step:

Alternatively, to integrate the above procedures into a Package instead of a load plan you need to add two steps per mapping. First, the variable is updated to hold the name of the table just loaded, and then the OBIEE cache is purged for the affected table. At the end of the flow a call is made to reseed the cache:

These are some very simple examples, but hopefully illustrate the concept and the powerful nature of integrating OBIEE calls directly from ODI. For more information about OBIEE Cache Management, see my post here.

Categories: BI & Warehousing

OBIEE BI Server Cache Management Strategies

Rittman Mead Consulting - Wed, 2015-09-23 11:36

The OBIEE BI Server cache can be one of the most effective ways of improving response times of OBIEE dashboards. By using data already in the cache it reduces load on the database, the network, and the BI Server.

Should you be using it? I always describe it as the “icing on the cake” – it’s not a fix for a badly-designed OBIEE system, but it does make a lot of sense to use once you’re happy that the foundations for the system are in place. If the foundations are not not in place? Then you’re just papering over the cracks and at some point it’s probably going to come back to bite you. As Mark Rittman put it nearly seven years ago, it’s “[…]usually the last desperate throw of the dice”. The phrase “technical debt”? Yeh, that. But, BI Server caching used after performance review and optimisation rather than instead of – then it’s a Good Thing.

So you’ve decided to use the BI Server cache, and merrily trotted over to Enterprise Manager to enable it, restarted the BI Server, and now your work is done, right? Not quite. Because the BI Server cache will start to store data from all the queries that you run, and use it to satisfy subsequent queries. Not only will it match on a direct hit for the same query, it will use a subset of an existing cache entry where appropriate, and can even aggregate up from what’s in the cache to satisfy a query at a higher level. Clever stuff. But, what happens when you load new data into your data warehouse? Well, the BI Server continues to serve requests out of the cache, because why shouldn’t it? And herein lies the problem with “just turn caching on”. You have to have a cache management strategy.

A cache management strategy sounds grand doesn’t it? But it boils down to two things:

  1. Accuracy – Flush any data from the cache that is now stale
  2. Speed – Prime the cache so that as many queries get a hit on it, first time
Maintaining an Accurate Cache

Every query that is run through the BI Server, whether from a Dashboard, Answers, or more funky routes such as custom ODBC clients or JDBC, will end up in cache. It’s possible to “seed” (“prime”/“warmup”) the cache explicitly, and this is discussed later. The only time you won’t see data in the cache is if (a) you have BI Server caching disabled, or (b) you’ve disabled the Cacheable option for a physical table that is involved in providing the data for the query being run.

You can see metadata for the current contents of the cache in the Administration Tool when connected online to the BI Server, through the Manage -> Cache menu option. This gives you lots of useful information (particularly when you come to optimising cache usage) including the size of each entry, when it was created, when it was last used, and so on.

Purging Options

So we’ve a spread of queries run that hit various dimension and fact tables and created lots of cache entries. Now we’ve loaded data into our underlying database, so we need to make sure that the next time a user runs an OBIEE query that uses the new data they can see it. Otherwise we commit the cardinal sin of any analytical system and show the user incorrect data which is a Bad Thing. It may be fast, but it’s WRONG….

We can purge the whole cache, but that’s a pretty brutal approach. The cache is persisted to disk and can hold lots of data stretching back months – to blitz all of that just because one table has some new data is overkill. A more targeted approach is to purge by physical database, physical table, or even logical query. When would you use these?

  • Purge entire cache – the nuclear option, but also the simplest. If your data model is small and a large proportion of the underlying physical tables may have changed data, then go for this
  • Purge by Physical Database – less brutal that clearing the whole cache, if you have various data sources that are loaded at different points in the batch schedule then targeting a particular physical database makes sense.
  • Purge by Physical Table – if many tables within your database have remained unchanged, whilst a large proportion of particular tables have changed (or it’s a small table) then this is a sensible option to run for each affected table
  • Purge by Query – If you add a few thousand rows to a billion row fact table, purging all references to that table from the cache would be a waste. Imagine you have a table with sales by day. You load new sales figures daily, so purging the cache by query for recent data is obviously necessary, but data from previous weeks and months may well remain untouched so it makes sense to leave queries against those in the cache. The specifics of this choice are down to you and your ETL process and business rules inherent in the data (maybe there shouldn’t be old data loaded, but what happens if there is? See above re. serving wrong data to users). This option is the most complex to maintain because you risk leaving behind in the cache data that may be stale but doesn’t match the precise set of queries that you purge against.

Which one is correct depends on

  1. your data load and how many tables you’ve changed
  2. your level of reliance on the cache (can you afford low cache hit ratio until it warms up again?)
  3. time to reseed new content

If you are heavily dependant on the cache and have large amounts of data in it, you are probably going to need to invest time in a precise and potentially complex cache purge strategy. Conversely if you use caching as the ‘icing on the cake’ and/or it’s quick to seed new content then the simplest option is to purge the entire cache. Simple is good; OBIEE has enough moving parts without adding to its complexity unnecessarily.

Note that OBIEE itself will perform cache purges in some situations including if a dynamic repository variable used by a Business Model (e.g. in a Logical Column) gets a new value through a scheduled initialisation block.

Performing the Purge

There are several ways in which we can purge the cache. First I’ll discuss the ones that I would not recommend except for manual testing:

  1. Administration Tool -> Manage -> Cache -> Purge. Doing this every time your ETL runs is not a sensible idea unless you enjoy watching paint dry (or need to manually purge it as part of a deployment of a new RPD etc).
  2. In the Physical table, setting Cache persistence time. Why not? Because this time period starts from when the data was loaded into the cache, not when the data was loaded into your database.
    An easy mistake to make would be to think that with a daily ETL run, setting the Cache persistence time to 1 day might be a good idea. It’s not, because if your ETL runs at 06:00 and someone runs a report at 05:00, there is a going to be a stale cache entry present for another 23 hours. Even if you use cache seeding, you’re still relinquishing control of the data accuracy in your cache. What happens if the ETL batch overruns or underruns?
    The only scenario in which I would use this option is if I was querying directly against a transactional system and wanted to minimise the number of hits OBIEE made against it – the trade-off being users would deliberately be seeing stale data (but sometimes this is an acceptable compromise, so long as it’s made clear in the presentation of the data).

So the two viable options for cache purging are:

  1. BI Server Cache Purge Procedures
  2. Event Polling Table
BI Server Cache Purge Procedures

These are often called “ODBC” Procedures but technically ODBC is just one – of several – ways that the commands can be sent to the BI Server to invoke.

As well as supporting queries for data from clients (such as Presentation Services) sent as Logical SQL, the BI Server also has its own set of procedures. Many of these are internal and mostly undocumented (Christian Berg does a great job of explaining them here, and they do creep into the documentation here and here), but there are some cache management ones that are fully supported and documented. They are:

  • SAPurgeCacheByQuery
  • SAPurgeCacheByTable
  • SAPurgeCacheByDatabase
  • SAPurgeAllCache
  • SAPurgeCacheBySubjectArea (>=
  • SAPurgeCacheEntryByIDVector (>=

The names of these match up to the purge processes that I describe above. The syntax is in the documentation, but what I am interested in here is how you can invoke them. They are my preferred method for managing the BI Server cache because they enable you to tightly couple your data load (ETL) to your cache purge. Setting the cache to purge based on a drop-dead timer (whether crontab, tivoli, Agent/iBot, whatever) gives you a huge margin of error if your ETL runtime does not remain consistent. Whether it organically increases in runtime as data volumes increase, or it fails and has to be fixed and restarted, ETL does not always finish bang-on when it is ‘supposed’ to.

You can call these procedures in the several ways, including:

  1. nqcmd – one of the most common ways, repeated on many a blog, but requires nqcmd/OBIEE to be installed on the machine running it. nqcmd is a command-line ODBC client for connecting to the BI Server
  2. ODBC – requires BI to be installed on the machine running it in order to make the OBIEE ODBC driver available
  3. JDBC – just requires the OBIEE JDBC driver, which is a single .jar file and thus portable
  4. Web Service – the OBIEE BI Server Web Service can be used to invoke these procedures from any machine with no dependencies other than some WSM configuration on the OBIEE server side.

My preference is for JDBC or Web Service, because they can be called from anywhere. In larger organisations the team building the ETL may have very little to do with OBIEE, and so asking them to install OBIEE components on their server in order to trigger cache purging can be quite an ask. Using JDBC only a single .jar needs copying onto the server, and using the web service not even that:

curl --silent --header "Content-Type: text/xml;charset=UTF-8" 
--user weblogic:Admin123 
--data @purge_cache_soap.xml

[59118] Operation SAPurgeAllCache succeeded!

For details of configuring ODI to use the BI Server JDBC driver in order to tightly couple the cache management into an existing ODI load job, stay tuned for a future blog!

Event Polling Tables (EPT)

NB Not Event “Pooling” Tables as I’ve often seen this called

The second viable approach to automated cache purging is EPT, which is a decoupled approach to managing the cache purge, with two components:

  1. An application (your ETL) inserts a row into the table S_NQ_EPT (which is created at installation time by the RCU in the BIPLATFORM schema) with the name of the physical table in which data has been changed
  2. The BI Server polls (hence the name) the S_NQ_EPT table periodically, and if it finds entries in it, purges the cache of data that is from those tables.

So EPT is in a sense the equivalent of using SAPurgeCacheByTable, but in a manner that is not tightly coupled. It relies on configuring the BI Server for EPT, and there is no easy way to know from your ETL if the cache purge has actually happened. It also means that the cache remains stale potentially as long as the polling interval that you’ve configured. Depending on when you’re running your ETL and the usage patterns of your users this may not be an issue, but if you are running ETL whilst users are on the system (for example intra-day micro ETL batches) you could end up with users seeing stale data. Oracle themselves recommend not setting the polling interval any lower than 10 minutes.
EPT has the benefit of being very easy to implement on the ETL side, because it is simply a database table into which the ETL developers need to insert a row for each table that they update during the ETL.

Seeding the Cache

Bob runs an OBIEE dashboard, and the results are added to the cache so that when Bill runs the same dashboard Bill gets a great response rate because his dashboard runs straight from cache. Kinda sucks for Bob though, because his query ran slow as it wasn’t in the cache yet. What’d be nice would be that for the first user on a dashboard the results were already in cache. This is known as seeding the cache, or ‘priming’ it. Because the BI Server cache is not dumb and will hit the cache for queries that aren’t necessarily direct replicas of what previously ran working out the optimal way to seed the cache can take some trial and error careful research. The documentation does a good job of explaining what will and won’t qualify for a cache hit, and it’s worth reading this first.

There are several options for seeding the cache. These all assume you’ve figured out the queries that you want to run in order to load the results into cache.

  1. Run the analysis manually, which will return the analysis data to you and insert it into the BI Server Cache too.
  2. Create an Agent to run the analysis with destination set to Oracle BI Server Cache (For seeding cache), and then either:
    1. Schedule the analysis to run from an Agent on a schedule
    2. Trigger it from a Web Service in order to couple it to your ETL data load / cache purge batch steps.
  3. Use the BI Server Procedure SASeedQuery (which is what the Agent does in the background) to load the given query into cache without returning the data to the client. This is useful for doing over JDBC/ODBC/Web Service (as discussed for purging above). You could just run the Logical SQL itself, but you probably don’t want to pull the actual data back to the client, hence using the procedure call instead.
Sidenote – Checking the RPD for Cacheable Tables

The RPD Query Tool is great for finding objects matching certain criteria. However, it seems to invert results when looking for Cacheable Physical tables – if you add a filter of Cacheable = false you get physical tables where Cacheable is enabled! And the same in reverse (Cacheable = true -> shows Physical tables where Cacheable is disabled)

Day in the Life of an OBIEE Cache Entry (Who Said BI Was Boring?)

In this example here I’m running a very simple report from SampleApp v406:

The Logical SQL for this is:

   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Why’s that useful to know? Because when working with the cache resubmitting queries is needed frequently and doing so directly from an interface like nqcmd is much faster (for me) than a web GUI. Horses for courses…

So I’ve run the query and now we have a cache entry for it. How do we know? Because we see it in the nqquery.log (and if you don’t have it enabled, go and enable it now):

[2015-09-23T15:58:18.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-42]  
[] [ecid: 00586hFR07mFw000jzwkno0005Qx00007U,0] [tid: 84a35700]  
[requestid: a9730015] [sessionid: a9730000] [username: weblogic]  
Query Result Cache: [59124] The query for user 'weblogic' was inserted into 
the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_57498_0.TBL'.

We see it in Usage Tracking (again, if you don’t have this enabled, go and enable it now):


We can also see it in the Administration Tool (when connected online to the BI Server):

We can even see it and touch it (figuratively) on disk:

So we have the data in the cache. The same query run again will now use the cache entry, as seen in nqquery.log:

[2015-09-23T16:09:24.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-21]
[] [ecid: 11d1def534ea1be0:6066a19d:14f636f1dea:-8000-000000000000b948,0:1:1:5]  
[tid: 87455700] 
[requestid: a9730017] [sessionid: a9730000] [username: weblogic]  
Cache Hit on query: [[  
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/weblogic/Cache Test 01',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Created by:     weblogic

and in Usage Tracking:

“Interestingly” Usage Tracking shows a count of 1 for number of DB queries run, which we would not expect for a cache hit. The nqquery.log shows the same, but no query logged as being sent to the database, so I’m minded to dismiss this as an instrumentation bug.

Now what about if we want to run a query but not use the BI Server Cache? This is an easy one, plenty blogged about it elsewhere – use the Request Variable DISABLE_CACHE_HIT=1. This overrides the built in system session variable of the same name. Here I’m running it directly against the BI Server, prefixed onto my Logical SQL – if you want to run it from within OBIEE you need the Advanced tab in the Answers editor.

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Now we get a cache ‘miss’, because we’ve specifically told the BI Server to not use the cache. As you’d expect, Usage Tracking shows no cache hit, but it does show a cache insert – because why shouldn’t it?

If you want to run a query without seeding the cache either, you can use DISABLE_CACHE_SEED=1:

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

These request variables can be set per analysis, or per user by creating a session initialisation block to assign the required values to the respective variables.

Cache Location

The BI Server cache is held on disk, so it goes without saying that storing it on fast (eg SSD) disk is a Good Idea. There’s no harm in giving it its own filesystem on *nix to isolate it from other work (in terms of filesystems filling up) and to make monitoring it super easy.

Use the DATA_STORAGE_PATHS configuration element in NQSConfig.ini to change the location of the BI Server cache.

  1. Use BI Server Caching as the ‘icing on the cake’ for performance in your OBIEE system. Make sure you have your house in order first – don’t use it to try to get around bad design.
  2. Use the SAPurgeCache procedures to directly invoke a purge, or the Event Polling Tables for a more loosely-coupled approach. Decide carefully which purge approach is best for your particular caching strategy.
  3. If using the SAPurgeCache procedures, use JDBC or Web Services to call them so that there is minimal/no installation required to call them from your ETL server.
  4. Invest time in working out an optimal cache seeding strategy, making use of Usage Tracking to track cache hit ratios.
  5. Integrate both purge and seeding into your ETL. Don’t use a schedule-based approach because it will come back to haunt you with its inflexibility and scope for error.
Categories: BI & Warehousing

Lessons Learned – Data Warehouse on Cloud – Amazon Redshift

Dylan's BI Notes - Tue, 2015-09-22 11:37
I feel that these are the rules applicable for any cloud based data warehouse solution.  In general, I feel that the on-premise data warehouse deployment probably will remain for a while. 1. For a columnar database, “select *” is bad I think that the projection needs to be done as early as possible and should […]
Categories: BI & Warehousing

Tracing Data Lineage and Impact Analysis in Oracle Enterprise Metadata Management 12c

Rittman Mead Consulting - Fri, 2015-09-18 22:11

At the beginning of the year, I wrote a couple of posts about Oracle Enterprise Metadata Management, one about how to install the tool and another one explaining the harvesting (importing) and stitching (relating) processes. A lot of things have happened since then, including a new version of OEMM released on May with some nice features as HTML 5 support, a workflow for business glossary and including the concept of groups in security.

OEMM is the Oracle solution for data governance and lifecycle management.  “Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise”. So, the process of knowing how my data is related, the integrity of it, who has access to which part of the data and with which application, from where people can access the data, defining the stewards of the data, etc are part of data governance.

To refresh the main concepts, the first thing that you have to do when you start working with OEMM is the harvesting or importing process. We import the metadata creating one model for each different metadata source. After the successful importing process, we need to create a configuration to relate all the required models.

One of the main features of OEMM is to perform Data Flow Lineage and Impact Analysis. Why is this so important? Because the typical problem that we face in IT is to understand how all the company data is related, what will happen if I modify one column in a source schema,  or how I can find out where a particular piece of data come from. In this post I will talk about these process which are exactly the same as in the previous version.

I created an environment where I installed the latest release of OEMM (, Oracle Database 12c that contains the source, staging and DW schemas and ODI 12c which is the ELT tool to load and transform the data into our DW schema. In addition I also have OBIEE 11g which repository is based on the DW schema and some OBI analyses to query the data. I imported all the metadata from these different sources and create a configuration that includes all the models created.


Data Impact Analysis

The process of Data Impact Analysis is to find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.

There are many ways to start the tracing data impact process in OEMM. We can start the analysis impact process in OEMM from the model where the object belongs or from a diagram.

We can open the model from the repository panel or from the configuration Architecture diagram. Once the model is opened, select the object that you want to perform the impact analysis, right click on it, and select Trace Data Impact or Trace Lineage (Advanced) option. Either option opens the same dialog window.


Then, we need to choose between the model and the configuration as the scope for this impact analysis. The most common option is to select one configuration to see the consequences of a possible change in all the target models present in this configuration. Also we have the option to see the result in graphic or in text mode.

In our example, we are going to trace the data impact of the ORDERS table which is in the database source model and is the source for the fact table in the DW.

In the following picture, you will find the resulted diagram of the data impact analysis. In order to see the diagrams better in OEMM,  you can collapse the Properties windows at your right and press the Fit to Content button in the diagram menu.


There are plenty of interesting things that you can do from here. First, if you take a closer look you will see that the lines between the Orders Source and Order Staging as well as the ones from Orders Staging and Orders_DW are ticker than the ones that connect Orders_DW to the OBIEE Model. This means there are some ETL process in the middle of these connections. When we right click on one of these arrows and select Trace ETL details, a new tab is opened inside the configuration and will show the mappings involved and the operations (a JOIN in our example). Also if we expand the windows properties and select the JOIN element we can see its condition.


Coming back to the Data Impact diagram, you can select one particular object like a column and press the highlight path button to emphasise only a specific data flow. This is very useful when you have many objects in the diagram and you need to focus in some particular data path.


If you want to keep this diagram saved so you can come back easily to it, you can add a bookmark. You can create folders to organise your bookmarks and they will appear at the bottom left of the screen.




Tracing Data Lineage

The Trace Data Lineage process is to find the path from the source to the target object that you selected.

In OEMM, tracing data lineage is a very similar process to impact analysis. In fact, it uses the same dialog window but now you should select the Trace Data Lineage or Trace Lineage (Advanced) option. Again, either option opens the same dialog window.

In our example, we are going to use the Table View from a particular OBI Analysis. We will perform first the data lineage using the model as a scope and then run another one in the context of the configuration in order to make clear the differences between these two data lineages.

Once I have the OBIEE Model opened, I double-click the required analysis and the analysis with all its components (criteria and layout) will appear in the Metadata Browser. I will run first the data lineage based on the model.


Another tab is shown with the data lineage based on the model. You can select one measure for example and see the properties. It will show details as the expression or formula (if it is a calculated item) and the default aggregation. In the diagram appears three main objects (that can be collapsed or expanded as required), one that represents the physical layer in the OBI repository (rpd), another for the Presentation layer and finally the table view in the OBI Analysis.

As you probably realise by now, there are a lot of tabs and panels that appears as you open different objects and execute some tasks. The panels are collapsible and can be resized, so you will need to do this a lot in order to visualise what you want.


We are going to repeat the same process but now selecting the configuration that contains the OBIEE model. In this new diagram you will see where the data in the OBI Analysis is coming from and the relationship with the other models in the configuration including the ODI model and the different database schemas. You can expand or collapse the objects in the diagram, tracing ETL details, highlight the path and all the features that we’ve seen for the data impact diagram as it is the same diagram.


Another useful and new feature of this release is the quick find that allows you to search for specific words and show you a list of all the objects in the diagram that matches the search criteria. If you double-click in one of this results, the object will appear highlighted in the diagram.


Model Connection Overview

The Model Connection Overview diagram shows the connection of the objects inside a model. As its name suggests is just an overview of the connections. You cannot go deeper into the object details, but is useful to have a rough idea of the object relationship in a model.  In the next pictures you will see an example of he Model Connection diagram of the OBIEE model and the diagram for a particular mapping in the ODI model. You can also save it as a bookmark as the other diagrams.


And this is how you can perform data lineage and impact analysis in OEMM, analysing the relationships among your company data.


Categories: BI & Warehousing

Data Warehouse on Clould – Amazon Redshift

Dylan's BI Notes - Thu, 2015-09-17 15:50
Here is a brief summary of what I learned by reading these materials. 1. The data warehouse is stored in clusters It can support scale out, not scale up. “Extend the existing data warehouse rather than adding hardware” 2. Use SQL to access the data warehouse 3. Load data from Amazon S3 (Storage Service) using […]
Categories: BI & Warehousing

OTBI Enterprise

Dylan's BI Notes - Tue, 2015-09-15 09:39
OTBI Enterprise is the BI cloud service, a SaaS deployment of OBIA.  It is using a data warehouse based architecture.  The ETL processes are handled within the cloud.  The data are first loaded from either on premise or cloud sources using various means in the original formats.   The data are first loaded into the […]
Categories: BI & Warehousing

Managing Impala and Other Mixed Workloads on the Oracle Big Data Appliance

Rittman Mead Consulting - Tue, 2015-09-15 04:56

One of our current client projects uses Cloudera Impala to provide fast ad-hoc querying to the data we’re loading into their Oracle Big Data Appliance Hadoop environment. Impala bypasses MapReduce to provide faster queries than Hive, but to do so it does a lot of processing in-memory and runs server processes on each node in the cluster, leading in some cases to runaway queries blocking other workloads in the same way that OBIEE queries on an Oracle Database can sometimes block ETL and application workloads. Several projects share this same Big Data Appliance, so to try and limit the impact Impala could have on other cluster workloads the client had disabled the Impala Daemons on nine of the twelve nodes in their Big Data Appliance; our concern with this approach was that an Impala query could access data from any datanode in the Big Data Appliance cluster, so whilst HDFS data is typically stored and replicated to three nodes in the cluster running the Impala daemons on just a quarter of the available nodes was likely to lead to data locality issues for Impala and blocks getting shipped across the network unnecessarily.

Going back to OBIEE and the Oracle Database, Oracle have a resource management feature for the Oracle database that allows you to put users and queries into separate resource pools and manage the share of overall resources that each pool gets. I covered this concept on the blog a few years ago, and the version of Cloudera Hadoop (CDH5.3) as used on the client’s Big Data Appliance has a feature called “YARN”, or Yet Another Resource Negotiator, that splits out the resource management and scheduling parts that were bound into MapReduce in Hadoop 1.0 so that MapReduce then just runs as a workload type on Hadoop, and with it then possible to run other workload types, for example Apache Spark, on that same cluster management framework.



Impala isn’t however configured to use YARN by default and uses an internal scheduler to govern how concurrent queries run and use cluster resources, but it can be configured to use YARN in what Cloudera term “Integrated Resource Management” and our initial response was to recommend this approach; however YARN is really optimised for longer-running batch jobs and not the shorter jobs that Impala generates (such that Cloudera recommends you don’t actually use YARN, and control Impala resource usage via service-level process constraints or through a new Impala feature called Admission Control instead). Taking a step back though, how do we actually see what resources Impala is using across the cluster when a query runs, and is there a feature similar to the Oracle Database’s SQL Explain Plan to help us understand how an Impala SQL query is executed? Then, using this and the various resource management options to us, can we use them to understand how YARN and other options will affect the Impala users on the client’s cluster if we enable them? And, given that we were going to test this all out on one of our development Hadoop clusters running back at the office on VMWare, how well could we simulate the multiple concurrent queries and mixed workload we’d then encounter on the real customer Big Data Appliance?

When trying to understand what goes on when a Cloudera Impala SQL query runs, the two main tools in your toolbox are EXPLAIN plans and query profiles. The concept of EXPLAIN plans will be familiar to Oracle developers, and putting “explain” before your Impala SQL query when you’re using the Impala Shell (or pressing the “Explain” button when you’re using the Impala Editor in Hue) will display an output like the one below, showing the steps the optimiser plans to take to return the query results:

[] > explain
select sum( as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;
Query: explain select sum( as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000
| Explain String                                                      |
| Estimated Per-Host Requirements: Memory=154.01MB VCores=2           |
|                                                                     |
| 10:EXCHANGE [UNPARTITIONED]                                         |
| |                                                                   |
| 09:AGGREGATE [FINALIZE]                                             |
| |  output: sum:merge(                                     |
| |  group by: d.dest_city                                            |
| |  having: sum( > 3000                                    |
| |                                                                   |
| 08:EXCHANGE [HASH(d.dest_city)]                                     |
| |                                                                   |
| 05:AGGREGATE                                                        |
| |  output: sum(                                           |
| |  group by: d.dest_city                                            |
| |                                                                   |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.orig = o.orig                                 |
| |                                                                   |
| |--07:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  02:SCAN HDFS [airlines.geog_origin o]                            |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: o.orig_state IN ('Florida', 'New York', 'Alaska') |
| |                                                                   |
| 03:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.dest = d.dest                                 |
| |                                                                   |
| |--06:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  01:SCAN HDFS [airlines.geog_dest d]                              |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: d.dest_state = 'California'                       |
| |                                                                   |
| 00:SCAN HDFS [airlines.flight_delays f]                             |
|    partitions=1/1 files=1 size=64.00MB                              |
Fetched 35 row(s) in 0.21s

Like an Oracle SQL explain plan, Impala’s cost-based optimiser uses table and partition stats that you should have gathered previously using Impala’s “compute stats” command to determine what it thinks is the optimal execution plan for your query. To see the actual cost and timings for the various plan steps that are run for a query, you can then use the “summary” statement after your query has run (or for more detail, the “profile” statement”) to see the actual timings and stats for each step in the query execution.

[] > summary;                                 > ;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 20.35us  | 20.35us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 142.18ms | 180.81ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 59.86us  | 123.39us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 171.72ms | 208.36ms | 60      | 1.93K      | 22.73 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 89.42ms  | 101.82ms | 540.04K | 131.88M    | 12.79 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 16.32us  | 19.63us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 302.83ms | 302.83ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 936.71ms | 1.10s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 19.02us  | 46.49us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 266.99ms | 266.99ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.07s    | 1.90s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

Output from the Summary statement gives us some useful information in working out the impact of the various resource management options for the Oracle Big Data Appliance, at least in terms of its impact on individual Impala queries – we’ll look at the impact on the overall Hadoop cluster and individual nodes later on. From the output of the above Summary report I can see that my query ran on all six nodes in the cluster (queries I ran earlier on a smaller version of the fact table ran on just a single node), and I can see how long each step in the query actually took to run. So what happens if I run the same query again on the cluster but disable the Impala daemon service role on three of the nodes, using Cloudera Manager?


Here’s the Summary output after running the query again:

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 22.01us  | 22.01us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 3      | 111.12ms | 117.24ms | 7       | 193        | 6.27 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 3      | 30.09us  | 39.02us  | 30      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 3      | 161.26ms | 173.57ms | 30      | 1.93K      | 22.84 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 3      | 156.50ms | 238.90ms | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 3      | 20.19us  | 28.93us  | 1.41K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 477.38ms | 477.38ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 3      | 1.48s    | 1.66s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 3      | 12.07us  | 14.89us  | 519     | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 308.83ms | 308.83ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 3      | 3.39s    | 6.85s    | 131.88M | 131.88M    | 74.11 MB  | 480.00 MB     | airlines.flight_delays_full f |

What the Summary statement doesn’t show you is the overall time the query took to run, and the query ran against three nodes took 9.48s to run compared to 3.59s for the one before where I had all six nodes’ Impala daemon enabled. In-fact I’d expect a query running on the client’s BDA with just three out of twelve nodes enabled to run even slower because of the block locality issue – Impala has a feature called block locality tracking which keeps track of where HDFS data blocks are actually located on the cluster and tries to run impalad tasks on the right nodes, but three out of twelve nodes running makes that job really hard – but the other factor that we need to consider is how running multiple queries concurrently affects things when only a few nodes are handling all the Impala user queries.

To try and simulate concurrent queries running I opened six terminal session against nodes actually running Impala Daemon service roles and submitted the same query from each session, with a second or two gap between each query; with all six nodes enabled the average response time rose to about 6s, but with just three enabled the response rose fairly consistently to around 27s.


This is of course what you’d expect when everything was trying to run on the same three (now resource-starved) server nodes, and again I’d expect this to be even more pronounced on the client’s twelve-node BDA. What this test of course didn’t cover was running workloads other than Impala on the same cluster, or running queries against different datasets, but it did at least show us how response-time increases fairly dramatically (albeit consistently) as more Impala users come onto the system.

So now we have some baseline benchmarking figures, let’s configure Impala to use YARN, using Cloudera Manager on the CDH5.3 setup used on the client’s BDA and our development cluster back in the office. There’s actually two parts to Impala running on YARN in CDH5.x; YARN itself as the overall cluster resource management layer, and another component called Llama (Low-Latency, or “Long-Lived”, Application Master) that sits between YARN and Impala and reduces the time that each Impala query takes to obtain YARN resource allocations.

llama arch

Enabling YARN and Llama (and if you want to, configuring Llama and thereby Impala for high-availability) is done through a wizard in CDH5.3 that also offers to set up an Linux feature called Cgroups that YARN can use to limit the “containers” it uses for resource management at the OS-level.

Once you’ve run through the wizard and restarted the cluster, Impala should be configured to use YARN instead of its own scheduler to request resources, which in-theory will allow Hadoop and the Big Data Appliance to consider Impala workloads alongside MapReduce, Spark and HBase when scheduling jobs across the cluster. Before we get into the options YARN gives us for managing these workloads I ran the same Impala queries again, first as a single query and then with six running concurrently, to see what impact YARN on its own had on query response times.

The single query on its own took around the same time as without YARN to run (3-4s), but when I ran six concurrent queries together the response time went up from the 3-4s that I saw without YARN enabled to between 5s and 18s depending on the session, with quite a bit of variation between response times compared to the consistent times I saw when YARN wasn’t being used – which surprised me as one of the stated benefits of YARN is making job execution times more predictable and smooth, though this cloud be more of an overall-cluster thing and there are also recommendations around configuring YARN and Llama’s resource estimation more efficient for Impala in the Cloudera docs.

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 41.38us  | 41.38us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 115.28ms | 123.04ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 44.44us  | 67.62us  | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 170.91ms | 201.47ms | 60      | 1.93K      | 22.82 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 82.25ms  | 98.34ms  | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.39us  | 18.99us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 244.40ms | 244.40ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 850.55ms | 942.47ms | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 13.99us  | 19.05us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 222.03ms | 222.03ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.54s    | 2.88s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

But it seems clear that users of Impala on the client cluster should expect some sort of overhead from using YARN to manage Impala’s resources, with the payoff being better balance between Impala workloads and the other uses they’re putting the BDA cluster too – however I think there’s more we can do to fine-tune how Llama and YARN allocate memory to Impala queries up-front (allocating a set amount of memory for all queries, rather than making an estimate and then adding more memory mid-query if it’s needed) and of course we’ve not really tested it on a cluster with a full, mixed workload running. But what about our original scenario, where only a certain percentage of the overall cluster resources or nodes are allocated to Impala query processing? To set up that sort of division resources we can use another feature of YARN called dynamic allocation, and dynamic resource pools that we can set up through Cloudera Manager again.

Dynamic allocation is one of the ways that YARN can be configured to manage multiple workloads on a Hadoop cluster (the other way is through static service pools, and I’ll come to those in a moment). Using dynamic allocation I can set up a resource pool for the airline flight delays application that my Impala SQL queries are associated with and allocate it 25% of overall cluster resources, with the remainder of cluster resources allocated to other applications. I can keep that weighting simple as I have done in the screenshot below, or I can allocate resources based on virtual cores and memory, but I found it simpler to just set these overall weightings and let YARN worry about cores and RAM. 


Depending on the scheduling policy you select, YARN will prioritise Impala and other jobs in different ways, but the recommended scheduling policy for mixed workloads is dominent resource fairness which balances RAM and CPU depending on which resource pool needs them most at a particular time. Note also that Impala can either be managed as part of the overall YARN workload or separately, a choice you can make in the Impala service configuration settings in Cloudera Manager (the “Enable Dynamic Resource Pools” setting that’s checked below, but was unchecked for the screenshot above)


There’s also a separate control you can place on Impala queries called Admission Control, that limits the number of queries that can run or be queued for a resource pool at any particular time. The docs are a bit vague on when to use admission control, when to use YARN or not and so on, but my take on this is that if it’s just Impala queries you’re worried about and throttling their use solves the problem then use this feature and leave Impala outside of YARN, but if you need to manage overall mixed workloads then do it all through YARN. For my testing example though I just went with simple resource pool weighting, and you can see from the screenshot below where multiple queries are running at once for my pool, CPU and RAM resources are constrained as expected.


To make a particular Impala query run within a specific resource pool you can either allocate that user to a named resource pool, or you can specific the resource pool in your Impala shell session like this:

[] > set request_pool = airlines; 
REQUEST_POOL set to airlines
[] > select sum( as total_flights, d.dest_city
from airlines.flight_delays_full f join airlines.geog_dest d on f.dest = d.dest
join airlines.geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;

Looking then at a typical summary output for a query running with these restrictions (25% of resources overall) and other queries running concurrently, the numbers don’t look all that different to before and results took between 8s and 30s to return – again I was surprised on the variance but I think YARN is more about overall cluster performance rather than individual queries, and you shouldn’t read too much into specific times on a dev server with an unrepresentative overall workload.

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 26.78us  | 26.78us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 209.10ms | 262.02ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 63.20us  | 118.89us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 282.56ms | 401.37ms | 60      | 1.93K      | 22.76 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 99.56ms  | 114.14ms | 540.04K | 131.88M    | 12.85 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.49us  | 17.94us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 531.08ms | 531.08ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 1.20s    | 1.54s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 24.29us  | 68.23us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 287.31ms | 287.31ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 2.34s    | 3.13s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

A point to note is that I found it very hard to get Impala queries to run when I got down to specifying virtual core and memory limits rather than just overall weightings, so I’d go with these high-level resource pool prioritisations which seemed to work and didn’t unduly affect query response times. For example the setting below looked clever, but queries always seemed to time out and I never really got a satisfactory setup working properly.


Note that for YARN dynamic resource pools to be used, all Linux/CDH users will need to be assigned to resource pools so they don’t run as “unconstrained”; this can also be done from the Dynamic Resource Pools configuration page.

Finally though, all of this management through resource pools might not be the best way to control resource usage by YARN. The Cloudera docs say quite clearly on the Integrated Resource Management page that:

“When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.

What this means in-practice is that, if you’ve got a single project using the Big Data Appliance and you just want to specify at a high-level what proportion of resources Impala, HBase, MapReduce and the other services under YARN management use, you can define this as static service pool settings in Cloudera Manager and have these restrictions enforced by Linux Cgroups. In the screenshot below I unwound all of the dynamic resource pool settings I created a moment ago and allocated 25% of overall cluster resources to Impala, with the wizard then using those top-level values to set limits for services across all nodes in the cluster based on their actual RAM and CPU, the services running on them and so on.


Then, going back to Cloudera Manager and running some queries, you can see these static service pool limits being applied in real-time and their effect in the form of graphs for particular cluster resources.


So given all of this, what was our recommendation to the client about how best to set up resource management for Impala and other workloads on their Big Data Appliance? Not too much should be read into individual numbers – it’s hard to simulate a proper mixed workload on a development server, and of course their BDA has 12 nodes, more memory, faster CPUs. However it’s probably fair to say these are the obvious conclusions:

  • Running Impala daemons on just a subset of nodes isn’t actually a bad way to constrain resources used by Impala, but it’ll only work on clusters with a small amount of nodes so that there’s a good chance one node will have one of the three copies of a data block. On a system of the scale of our customer’s, we’ll probably hit unacceptable overheads in terms of block locality. I would not carry on with this approach because of that.
  • If the customer BDA will be running a mixed workload, i.e. data loading, long-running Hive/Pig/Spark jobs as well as short-running Impala jobs, enabling Impala for YARN and setting overall resource pools for applications would be the best approach, but individual Impala queries will probably run slower than now (even given the restriction in resources), due to the overhead YARN imposes when scheduling and running jobs. But this will be the best way to allocate resource between applications and provide a generally “smoother” experience for users
  • If the BDA needs to be optimised mostly for Impala queries, then don’t manage Impala under YARN, leave it outside of this and just use Static service pools to allocate Impala roughly 25% of resources across all nodes. In both this and the previous instance (Impala on YARN) then all nodes should be re-enabled for Impala so as to minimize issues over block locality
  • If the only real issue is Impala queries for a particular application taking all resources/becoming runaway, Impala could be left outside of YARN but enabled for admission control so as to limit the total number of running/queued queries for a particular application.
Categories: BI & Warehousing

Is Apache Spark becoming a DBMS?

Dylan's BI Notes - Wed, 2015-09-09 21:02
I attended a great meetup and this is the question I have after the meeting. Perhaps the intent is to make it like a DBMS, like Oracle, or even a BI platform, like OBIEE? — The task flow it actually very similar to a typical database profiling and data analysis job. 1. Define your question […]
Categories: BI & Warehousing

Oracle Big Data Discovery 1.1 now GA, and Available as part of BigDataLite 4.2.1

Rittman Mead Consulting - Sat, 2015-09-05 01:53

The new Oracle Big Data Discovery 1.1 release went GA a couple of weeks ago, and came with a bunch of features that addressed show-stoppers in the original 1.0 release; the ability to refresh and reload datasets from Hive, compatibility with Cloudera CDH and Hortonworks HDP Hadoop platforms, Kerberos integration, and the ability to bring in datasets from remote JDBC datasources. If you’re new to Big Data Discovery I covered the initial release in a number of blog posts over the past year or so:

So let’s start by loading some data into Big Data Discovery so that we can explore what’s in it, see the range of attributes and their values and do some basic data clean-up and enrichment. As with Big Data Discovery 1.0 you import (or “sample”) data into Big Data Discovery’s DGraph engine either via file upload, or by using a command-line utility. Data in Hadoop has to be registered in the Hive HCatalog metadata layer, and I’ll start by importing a Hive table mapped to some webserver log files via a Hive SERDE:


To import or sample this table into BDD’s DGraph engine I use the following command to invoke the Big Data Discovery Data Processing engine, which reads the Hive table metadata, loads the Hive table data into the DGraph engine (either all rows, or a representative sample) and process/enriches the data to add geocoding, for example:

[oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t apachelog_parsed

This then runs as an Apache Spark job under YARN, progress of which you can track either from the console or through Cloudera Manager / Hue.

[2015-09-04T14:20:43.792-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] 
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 start time: 1441390841404
 final status: UNDEFINED
 tracking URL: http://bigdatalite.localdomain:8088/proxy/application_1441385366282_0001/
 user: oracle
[2015-09-04T14:20:45.794-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] Application report for application_1441385366282_0001 (state: ACCEPTED)

Going over to Big Data Discovery Studio I can then see the dataset within the catalog, and then load it into a project and start exploring and transforming the dataset. In the screenshots below I’m cleaning up the date and time field to turn it into a timestamp, and arranging the automatically-derived country, city, region and state attributes into a geography hierarchy. BDD1.1 comes with a bunch of other transformation enhancements including new options for enrichment, the ability to tag values via a whitelist and so on – a full list of new features for BDD1.1 can be found in MOS Doc.ID 2044712.1


Crucially now in BDD1.1 you can either refresh a data set with new data from Hive (re-load), or do an incremental update after you’ve selected an attribute as the update (identity) column – in the screenshot below I’m doing this for a dataset uploaded from a file, but you can reload and update dataset from the command-line too which then opens-up the possibility of scripting, scheduling etc.


You can also define JDBC data connections in the administration part of BDD Studio, and then type in SQL queries to define data sources that can then be added into your project as a dataset – loading their data directly into the DGraph engine rather than having to stage it in Hadoop beforehand.


Then, as with the initial release of Big Data Discovery, you can define joins between the data sets in a project based on common attributes – in this case I’m joining the page URLs in the webserver logs with the page and article data extracted from our WordPress install, sourced from both Hive and Oracle (via JDBC)


Other new features in BDD1.1 include the ability to define “applications”, projects and datasets that are considered “production quality” and include details on how to refresh and incrementally load their datasets (designed presumably to facilitate migrations from Endeca Information Discovery), and a number of other new features around usability, data exploration and security. You can download BDD1.1 from Oracle’s Edelivery site, or download it pre-installed and configured as part of the new Big Data Lite 4.2.1 Virtualbox virtual machine.

Categories: BI & Warehousing

Primeros Cursos de Rittman Mead en Español en América Latina

Rittman Mead Consulting - Fri, 2015-09-04 08:04

La semana pasada y en el contexto del OTN Tour LA 2015 dimos nuestros primeros cursos en Español en Quito, Ecuador y Cali, Colombia.

Se trató de un workshop en este caso de un día (8 horas) sobre Desarrollo Front-End de OBIEE 11g.

En esta oportunidad los cursos fueron presenciales, dónde todo el material teórico-práctico era en español y cada alumno tenía a su disposición durante el entrenamiento una máquina virtual exclusiva donde realizar la parte práctica contando con el apoyo y la asistencia del instructor en todo momento.

Hace muchos años que vengo dando cursos. El primer curso que di, fue allá por 1999 (ejem, sí, en el siglo pasado). Desde ahí he dado muchísimos más sobre distintos productos Oracle inclusive como instructora de Oracle University lo cuál he sido por más de 10 años. En total debo haber dado cursos a más de 500 personas en mi vida.

Hace un año y medio que estoy trabajando felizmente en Rittman Mead desarrollando y dictando cursos en Inglés, lo que ha sido un reto maravilloso. Pero volver a dar cursos en español fue como quien dice “cómo volver a casa”. Realmente una experiencia super linda.


El feedback que tuvimos del curso fue realmente muy bueno, y nos llena de alegría haber logrado una vez más exceder las expectativas de los  participantes. Esto son algunos ejemplos del feedback que hemos recibido.



Queremos agradecer a todos los asistentes del curso por su participación y también a nuestros partners en Ecuador Bluefrog y Refundation, y a los Grupos de Usuario de Oracle de Ecuador y Colombia por la organización.

Si quieres recibir nuestros cursos en tu propio idioma, ya sea en Español o Portugués o quieres convertirte en nuestro partner de entrenamientos, mándanos un correo a y nos pondremos inmediatamente en contacto contigo.

Categories: BI & Warehousing

Schema On Demand and Extensibility

Dylan's BI Notes - Fri, 2015-09-04 00:44
Today I see a quite impressive demo in the Global Big Data Conference. AtScale provides a BI metadata tool for data stored in Hadoop. At first, I thought that this is just another BI tool that access Hadoop via Hive like what we have in OBIEE.  I heard that that the SQL performance for BI […]
Categories: BI & Warehousing