Skip navigation.

BI & Warehousing

Preview of Maria Colgan, and Andrew Bond/Stewart Bryson Sessions at RM BI Forum 2014

Rittman Mead Consulting - Wed, 2014-04-16 02:11

We’ve got a great selection of presentations at the two upcoming Rittman Mead BI Forum 2014 events in Brighton and Atlanta, including sessions on Endeca, TimesTen, OBIEE (of course), ODI, GoldenGate, Essbase and Big Data (full timetable for both events here). Two of the sessions I’m particularly looking forward to though are ones by Maria Colgan, product manager for the new In-Memory Option for Oracle Database, and another by Andrew Bond and Stewart Bryson, on an update to Oracle’s reference architecture for Data Warehousing and Information Management.

The In-Memory Option for Oracle Database was of course the big news item from last year’s Oracle Openworld, promising to bring in-memory analytics and column-storage to the Oracle Database. Maria is of course well known to the Oracle BI and Data Warehousing community through her work with the Oracle Database Cost-Based Optimizer, so we’re particular glad to have her at the Atlanta BI Forum 2014 to talk about what’s coming with this new feature. I asked Maria to jot down a few worlds for the blog on what she’ll be covering, so over to Maria:


NewImage“Given this announcement and the performance improvements promised by this new functionality is it still necessary to create a separate access and performance layer in your data warehouse environment or to run  your Oracle data warehouse  on an Exadata environment?“At Oracle Open World last year, Oracle announced the upcoming availability of the Oracle Database In-Memory option, a solution for accelerating database-driven business decision-making to real-time. Unlike specialized In-Memory Database approaches that are restricted to particular workloads or applications, Oracle Database 12c leverages a new in-memory column store format to speed up analytic workloads.

This session explains in detail how Oracle Database In-Memory works and will demonstrate just how much performance improvements you can expect. We will also discuss how it integrates into the existing Oracle Data Warehousing Architecture and with an Exadata environment.”

The other session I’m particularly looking forward to is one being delivered jointly by Andrew Bond, who heads-up Enterprise Architecture at Oracle and was responsible along with Doug Cackett for the various data warehousing, information management and big data reference architectures we’ve covered on the blog over the past few years, including the first update to include “big data” a year or so ago.

NewImage

Back towards the start of this year, Stewart, myself and Jon Mead met up with Andrew and his team to work together on an update to this reference architecture, and Stewart carried on with the collaboration afterwards, bringing in some of our ideas around agile development, big data and data warehouse design into the final architecture. Stewart and Andrew will be previewing the updated reference architecture at the Brighton BI Forum event, and in the meantime, here’s a preview from Andrew:

“I’m very excited to be attending the event and unveiling Oracle’s latest iteration of the Information Management reference architecture. In this version we have focused on a pragmatic approach to “Analytics 3.0″ and in particular looked at bringing an agile methodology to break the IT / business barrier. We’ve also examined exploitation of in-memory technologies and the Hadoop ecosystem and guiding the plethora of new technology choices.

We’ve worked very closely with a number of key customers and partners on this version – most notably Rittman Mead and I’m delighted that Stewart and I will be able to co-present the architecture and receive immediate feedback from delegates.”

Full details of the event, running in Brighton on May 7-9th 2014 and Atlanta, May 15th-17th 2014, can be found on the Rittman Mead BI Forum 2014 homepage, and the agendas for the two days are on this blog post from earlier in the week.

Categories: BI & Warehousing

Final Timetable and Agenda for the Brighton and Atlanta BI Forums, May 2014

Rittman Mead Consulting - Mon, 2014-04-14 07:00

It’s just a few weeks now until the Rittman Mead BI Forum 2014 events in Brighton and Atlanta, and there’s still a few spaces left at both events if you’d still like to come – check out the main BI Forum 2014 event page, and the booking links for Brighton (May 7th – 9th 2014) and Atlanta (May 14th – 16th 2014).

We’re also able now to publish the timetable and running order for the two events – session order can still change between now at the events, but this what we’re planning to run, first of all in Brighton, with the photos below from last year’s BI Forum.

Brighton

Brighton BI Forum 2014, Hotel

Seattle, Brighton

Wednesday 7th May 2014 – Optional 1-Day Masterclass, and Opening Drinks, Keynote and Dinner

  • 9.00 – 10.00 – Registration
  • 10.00 – 11.00 : Lars George Hadoop Masterclass Part 1
  • 11.00 – 11.15 : Morning Coffee 
  • 11.15 – 12.15 : Lars George Hadoop Masterclass Part 2
  • 12.15 – 13.15 : Lunch
  • 13.15 – 14.15 : Lars George Hadoop Masterclass Part 3
  • 14.15 – 14.30 : Afternoon Tea/Coffee/Beers
  • 14.30 – 15.30 : Lars George Hadoop Masterclass Part 4
  • 17.00 – 19.00 : Registration and Drinks Reception
  • 19.00 – Late :  Oracle Keynote and Dinner at Hotel
Thursday 8th May 2014
  • 08.45 – 09.00 : Opening Remarks Mark Rittman, Rittman Mead
  • 09.00 – 10.00 : Emiel van Bockel : Extreme Intelligence, made possible by …
  • 10.00 – 10.30 : Morning Coffee 
  • 10.30 – 11.30 : Chris Jenkins : TimesTen for Exalytics: Best Practices and Optimisation
  • 11.30 – 12.30 : Robin Moffatt : No Silver Bullets : OBIEE Performance in the Real World
  • 12.30 – 13.30 : Lunch
  • 13.30 – 14.30 : Adam Bloom : Building a BI Cloud
  • 14.30 – 14.45 : TED: Paul Oprea : “Extreme Data Warehousing”
  • 14.45 – 15.00 : TED : Michael Rainey :  ”A Picture Can Replace A Thousand Words”
  • 15.00 – 15.30 : Afternoon Tea/Coffee/Beers
  • 15.30 – 15.45 : Reiner Zimmerman : About the Oracle DW Global Leaders Program
  • 15.45 – 16.45 : Andrew Bond & Stewart Bryson : Enterprise Big Data Architecture
  • 19.00 – Late: Depart for Gala Dinner, St Georges Church, Brighton

Friday 9th May 2014

  • 9.00 – 10.00 : Truls Bergensen – Drawing in a New Rock on the Map – How will of Endeca Fit in to Your Oracle BI Topography
  • 10.00 – 10.30 : Morning Coffee 
  • 10.30 – 11.30 : Nicholas Hurt & Michael Rainey : Real-time Data Warehouse Upgrade – Success Stories
  • 11.30 – 12.30 : Matt Bedin & Adam Bloom : Analytics and the Cloud
  • 12.30 – 13.30 : Lunch13.30 – 14.30 : Gianni Ceresa : Essbase within/without OBIEE – not just an aggregation engine
  • 14.30 – 14.45 : TED : Marco Klaasen : “Speed up RPD Development”
  • 14.45 – 15:00 : TED : Christian Berg : “Neo’s Voyage in OBIEE:”
  • 15.00 – 15.30 : Afternoon Tea/Coffee/Beers
  • 15.30 – 16.30 : Alistair Burgess : “Tuning TimesTen with Aggregate Persistence”
  • 16.30 – 16.45 : Closing Remarks (Mark Rittman)
Then directly after Brighton we’ve got the US Atlanta event, running the week after, Wednesday – Friday, with last year’s photos below:   Us

Atlanta BI Forum 2014, Renaissance Mid-Town Hotel, Atlanta

Wednesday 14th May 2014 – Optional 1-Day Masterclass, and and Opening Drinks, Keynote and Dinner

  • 9.00-10.00 – Registration
  • 10.00 – 11.00 : Lars George Hadoop Masterclass Part 1
  • 11.00 – 11.15 : Morning Coffee 
  • 11.15 – 12.15 : Lars George Hadoop Masterclass Part 2
  • 12.15 – 13.15 : Lunch
  • 13.15 – 14.15 : Lars George Hadoop Masterclass Part 3
  • 14.15 – 14.30 : Afternoon Tea/Coffee/Beers
  • 14.30 – 15.30 : Lars George Hadoop Masterclass Part 4
  • 16.00 – 18.00 : Registration and Drinks Reception
  • 18.00 – 19.00 : Oracle Keynote & Dinner

Thursday 15th May 2014

  • 08.45 – 09.00 : Opening Remarks Mark Rittman, Rittman Mead
  • 09.00 – 10.00 : Kevin McGinley : Adding 3rd Party Visualization to OBIEE
  • 10.00 – 10.30 : Morning Coffee 
  • 10.30 – 11.30 : Chris Linskey : Endeca Information Discovery for Self-Service and Big Data
  • 11.30 – 12.30 : Omri Traub : Endeca and Big Data: A Vision for the Future
  • 12.30 – 13.30 : Lunch
  • 13.30 – 14.30 : Dan Vlamis : Capitalizing on Analytics in the Oracle Database in BI Applications
  • 14.30 – 15.30 : Susan Cheung : TimesTen In-Memory Database for Analytics – Best Practices and Use Cases
  • 15.30 – 15.45 : Afternoon Tea/Coffee/Beers
  • 15.45 – 16.45 : Christian Screen : Oracle BI Got MAD and You Should Be Happy
  • 18.00 – 19.00 : Special Guest Keynote : Maria Colgan : An introduction to the new Oracle Database In-Memory option
  • 19.00 – leave for dinner

Friday 16th May 2014

  • 09.00 – 10.00 : Patrick Rafferty : More Than Mashups – Advanced Visualizations and Data Discovery
  • 10.00 – 10.30 : Morning Coffee 
  • 10.30 – 11.30 : Matt Bedin : Analytics and the Cloud
  • 11.30 – 12.30 : Jack Berkowitz : Analytic Applications and the Cloud
  • 12.30 – 13.30 : Lunch
  • 13.30 – 14.30 : Philippe Lions : What’s new on 2014 HY1 OBIEE SampleApp
  • 14.30 – 15.30 : Stewart Bryson : ExtremeBI: Agile, Real-Time BI with Oracle Business Intelligence, Oracle Data Integrator and Oracle GoldenGate
  • 15.30 – 16.00 : Afternoon Tea/Coffee/Beers
  • 16.00 – 17.00 : Wayne Van Sluys : Everything You Know about Oracle Essbase Tuning is Wrong or Outdated!
  • 17.00 – 17.15 : Closing Remarks (Mark Rittman)
Full details of the two events, including more on the Hadoop Masterclass with Cloudera’s Lars George, can be found on the BI Forum 2014 home page.

Categories: BI & Warehousing

The Riley Family, Part III

Chet Justice - Thu, 2014-04-10 20:44


That's Mike and Lisa, hanging out at the hospital. Mike's in his awesome cookie monster pajamas and robe...must be nice, right? Oh wait, it's not. You probably remember why he's there, Stage 3 cancer. The joys.

In October, we helped to send the entire family to Game 5 of the World Series (Cards lost, thanks Red Sox for ruining their night).

In November I started a GoFundMe campaign, to date, with your help, we've raised $10,999. We've paid over 9 thousand dollars to the Riley family (another check to be cut shortly).

In December, Mike had surgery. Details can be found here. Shorter: things went fairly well, then they didn't. Mike spent 22 days in the hospital and lost 40 lbs. He missed Christmas and New Years at home with his family. But, as I've learned over the last 6 months, the Riley family really knows how to take things in stride.

About 6 weeks ago Mike started round 2 of chemo, he's halfway through that one now. He complains (daily, ugh) about numbness, dizziness, feeling cold (he lives in St. Louis, are you sure it's not the weather?), and priapism (that's a lie...I hope).

Mike being Mike though, barely a complaint (I'll let you figure out where I'm telling a lie).

Four weeks ago, a chilly (65) Saturday night, Mike and Lisa call. "Hey, I've got some news for you."

"Sweet," I think to myself. Gotta be good news.

"Lisa was just diagnosed with breast cancer."

WTF?

ARE YOU KIDDING ME? (Given Mike's gallows humor, it's possible).

"Nope. Stage 1. Surgery on April 2nd."

FFS

(Surgery was last week. It went well. No news on that front yet.)

Talking to them two of them that evening you would have no idea they BOTH have cancer. Actually, one of my favorite stories of the year...the hashtag for Riley Family campaign was #fmcuta. Fuck Mike's Cancer (up the ass). I thought that was hilarious, but I didn't think the Riley's would appreciate it. They did. They loved it. I still remember Lisa's laugh when I first suggested it. They've dropped the latest bad news and Lisa is like, "Oh, wait until you hear this. I have a hashtag for you."

"What is it?" (I'm thinking something very...conservative. Not sure why, I should know better by now).

#tna

I think about that for about .06 seconds. Holy shit! Did you just say tna? Like "tits and ass?"

(sounds of Lisa howling in the background).

Awesome. See what I mean? Handling it in stride.

"We're going to need a bigger boat." All I can think about now is, "what can we do now?"

First, I raised the campaign goal to 50k. This might be ambitious, that's OK, cancer treatments are expensive enough for one person, and 10K (the original amount) was on the low side. So...50K.

Second, Scott Spendolini created a very cool APEX app, ostensibly called the Riley Support Group (website? gah). It's a calendar/scheduling app that allows friends and family coordinate things like meals, young human (children) care and other things that most of us probably take for granted. Pretty cool stuff. For instance, Tim Gorman provides pizza on Monday nights (Dinner from pizza hut...1 - large hand-tossed cheese lovers, 1 - large thin-crispy pepperoni, 1 - 4xpepperoni rolls, 1 - cheesesticks).

Third. There is no third.

So many of you have donated your hard earned cash to the Riley family, they are incredibly humbled by, and grateful for, everyone's generosity. They aren't out of the woods yet. Donate more. Please. If you can't donate, see if there's something you can help out with (hit me up for details, Tim lives in CO, he's not really close). If you can't do either of those things, send them your prayers or your good thoughts. Any and all help will be greatly appreciated.
Categories: BI & Warehousing

Details on Two New Upcoming Courses in Brighton – Including ODI12c

Rittman Mead Consulting - Fri, 2014-04-04 05:09

Just a quick note to mention two courses that are being run from our Brighton office soon, that might interested readers of the blog.

Our OBIEE 11g Bootcamp course, newly updated for OBIEE 11.1.1.7, is a start-to-finish introduction to Oracle Business Intelligence Enterprise Edition 11g aimed at developers and admins new to the platform. Starting with an overview of the platform, then taking you through RPD development, creating reports and dashboards and through to security, configuring for high-availability and systems management, this is our most popular course and includes a free copy of my book, “Oracle Business Intelligence Developers Guide”.

This 5-day course covers the following topics:

  • OBIEE 11g Overview & Product Architecture
  • Installation, Configuration & Upgrades
  • Creating Repositories from Relational Sources
  • Advanced Repository Modelling from Relational Sources
  • Systems Management using Oracle Enterprise Manager
  • Creating Analyses and Dashboards
  • Actionable Intelligence
  • KPIs, Scorecards & Strategy Management
  • Creating Published Reports (BI Publisher)
  • OBIEE 11g Security
  • High-Availability, Scaleout & Clustering
  •  OBIEE 11g Bootcamp, April 28th-May 2nd, 2014

Details on the full course agenda are available here, and the next run of the course is in Brighton on April 28th – May 2nd, 2014 – register using that link.

We’re also very pleased to announce the availability of our new Data Integrator 12c 3-day course, aimed at developers new to ODI as well as those upgrading their knowledge from the ODI11g release. Written entirely by our ODI development team and also used by us to train our own staff, this is a great opportunity to learn ODI12c based on Rittman Mead’s own, practical field experience.

The topics we’ll be covering in this course are:

  • Getting Started with ODI 12c
  • ODI 12c TopologyODI 12c Projects
  • Models & Datastores
  • Data Quality in a Model
  • Introduction to ODI Mappings
  • ODI Procedures, Variables, Sequences, & User Functions
  • Advanced ODI Mappings
  • ODI Packages
  • Scenarios in ODI
  • The ODI Debugger
  • ODI Load Plans

We’re running this course for the first time, in Brighton on May 12th – 14th 2014, and you can register using this link.

Full details of all our training courses, including public scheduled courses and private, standard or customised courses, can be found on our Training web page or for more information, contact the Rittman Mead Training Team.

Categories: BI & Warehousing

BI Forum 2014 preview – No Silver Bullets : OBIEE Performance in the Real World

Rittman Mead Consulting - Thu, 2014-04-03 03:35

I’m honoured to have been accepted to speak at this year’s Rittman Mead BI Forum, the sixth year of this expert-level conference that draws some of the best Oracle BI/DW minds together from around the world. It’s running May 8th-9th in Brighton, and May 15-16th in Atlanta, with an optional masterclass from Cloudera’s Lars George the day before the conference itself at each venue.

My first visit to the BI Forum was in 2009 where I presented Performance Testing OBIEE, and now five years later (five years!) I’m back, like a stuck record, talking about the same thing – performance. That I’m still talking about it means that there’s still an audience for it, and this time I’m looking beyond just testing performance, but how it’s approached by people working with OBIEE. For an industry built around 1s and 0s, computers doing just what you tell them to and nothing else, there is a surprising amount of suspect folklore and “best practices” used when it comes to “fixing” performance problems.

OBIEE performance good luck charm

Getting good performance with OBIEE is just a matter of being methodical. Understanding where to look for information is half the battle. By understanding where the time goes, improvements can be targeted where they will be most effective. Heavily influence by Cary Millsap and his Method-R approach to performance, I will look at how to practically apply this to OBIEE. Most of the information needed to build up a full picture is available readily from OBIEE’s log files

I’ll also dig a bit deeper into OBIEE, exploring how to determine how the system’s behaving “under the covers”. The primary technique for this is through OBIEE’s DMS metrics which I have written about recently in relation to the new Rittman Mead open-source tool, obi-metrics-agent and am using day-to-day to rapidly examine and resolve performance problems that clients see.

I’m excited to be presenting again on this topic, and I hope to see you in Brighton next month. The conference always sells out, so don’t delay – register today!

Categories: BI & Warehousing

Data Integration Tips: ODI 12c – Varchar2 (CHAR or BYTE)

Rittman Mead Consulting - Tue, 2014-04-01 11:28

Continuing with our Data Integration Tips series, here’s one that applies to both Oracle Data Integrator 11g and 12c. This “tip” was actually discovered as a part of a larger issue involving GoldenGate, Oracle Datapump, and ODI. Maybe a future post will dive deeper into those challenges, but here I’m going to focus just on the ODI bit.

The Scenario

During our setup of GoldenGate and ODI, it was discovered that the source and target databases were set to use different character sets.

Source:  WE8ISO8859P1

Target (DW):  AL32UTF8

During my research, I found that the source is a single-byte character set and the target is multi-byte. What this means is that a special character, such as  ”Ǣ“, for example, may take up more than one byte when stored in a column with a VARCHAR2 datatype (as described in the Oracle documentation – “Choosing a Character Set“). When attempting to load a column of datatype VARCHAR2(1) containing the text “Ǣ”, the load would fail with an error, similar to the one below.

ORA-12899: value too large for column "COL_NAME" (actual: 2, maximum: 1)

The difference in character sets is clearly the issue, but how do we handle this when performing a load between the two databases? Reading through the Oracle doc referenced above, we can see that it all depends on the target database column length semantics. Specifically, for the attributes of VARCHAR2 datatype, we need to use character semantics in the target, “VARCHAR2(1 CHAR)”, rather than byte semantics, ”VARCHAR2(1 BYTE)”. The former can handle multi-byte character sets simply by storing the characters as they are inserted. The latter will store each byte necessary for the character value individually. Looking back at the example, the character “Ǣ” inserted into a column using byte semantics (which is the default, in this case, when BYTE or CHAR is not specified) would require 2 bytes, thus causing the error.

Here’s the Tip…

The overall solution is to modify any VARCHAR2 columns that may have special characters inserted to use character semantics in the target database. Quite often we cannot determine which columns may or may not contain certain data, requiring the modification of all columns to use character semantics. Using the database system tables, the alter table script to make the necessary changes to existing columns can be generated and executed. But what about new columns generated by ODI? Here we’ll need to use the power of the Oracle Data Integrator metadata to create a new datatype.

In the ODI Topology, under the Physical Architecture accordion, the technologies that can be used as a data source or target are listed. Each technology, in turn, has a set of datatypes defined that may be used as Datastore Attributes when the technology is chosen in a Model.

Oracle Technology

Further down in the list, you will find the VARCHAR2 datatype. Double-click the name to open the object. In the SQL Generation Code section we will find the syntax used when DDL is generated for a column of type VARCHAR2.

Oracle technology - VARCHAR2 datatype

As you can see, the default is to omit the type of semantics used in the datatype syntax, which most likely means BYTE semantics are used, as this is typically the default in an Oracle database. This syntax can be modified to always produce character semantics by adding the CHAR keyword after the length substitution value.

VARCHAR(%L CHAR)

Before making the change to the “out of the box” VARCHAR2 datatype, you may want to think about how this datatype will be used on Oracle targets and sources. Any DDL generated by ODI will use this syntax when VARCHAR2 is selected for an attribute datatype. In some cases, this might be just fine as the ODI tool is only used for a single target data warehouse. But quite often, ODI is used in many different capacities, such as data migrations, data warehousing, etc. To handle both forms of semantics, the best approach is to duplicate the VARCHAR2 datatype and create a new version for the use of characters.

VARCHAR2 datatype edited

Now we can assign the datatype VARCHAR2 (CHAR) to any of our Datastore columns. I recommend the use of a Groovy script if changing Attributes in multiple Datastores.

Change Datatype - VARCHAR2 CHAR

Now when Generate DDL is executed on the Model, the Create Table step will have the appropriate semantics for the VARCHAR2 datatype.

Generate DDL - VARCHAR2 CHAR

As you can see, the power of Oracle Data Integrator and the ability to modify and customize its metadata provided me with the solution in this particular situation. Look for more Data Integration Tips from Rittman Mead – coming soon!

Categories: BI & Warehousing

Visualising OBIEE DMS metrics with Graphite

Rittman Mead Consulting - Sun, 2014-03-30 13:50

Assuming you have set up obi-metrics-agent and collectl as described in my previous post, you have a wealth of data at your disposal for graphing and exploring in Graphite, including:

  • OS (CPU, disk, network, memory)
  • OBIEE’s metrics
  • Metrics about DMS itself
  • Carbon (Graphite’s data collector agent) metrics

In this post I’ll show you some of the techniques we can use to put together a simple dashboard.

Building graphs

First off, let’s see how Graphite actually builds graphs. When you select a data series from the Metrics pane it is added to the Graphite composer where you can have multiple metrics. They’re listed in a legend, and if you click on Graph Data you can see the list of them.

Data held in Graphite (or technically, held in whisper) can be manipulated and pre-processed in many ways before Graphite renders it. This can be mathmatical transforms of the data (eg Moving Average), but also how the data and its label is shown. Here I’ll take the example of several of the CPU metrics (via collectl) to see how we can manipulate them.

To start with, I’ve just added idle, wait and user from the cputotals folder, giving me a nice graph thus:

We can do some obvious things like add in a title, from the Graph Options menu

Graphite functions

Looking at the legend there’s a lot of repeated text (the full qualification of the metric name) which makes the graph more cluttered and less easy to read. We can use a Graphite function to fix this. Click on Graph Data, and use ctrl-click to select all three metrics:

Now click on Apply Function -> Set Legend Name By Metric. The aliasByMetric function is wrapped around the metrics, and the legend on the graph now shows just the metric names which is much smarter:

You can read more about Graphite functions here.

Another useful technique is being able to graph out metrics using a wildcard. Consider the ProcessInfo group of metrics that DMS provides about some of the OBIEE processes:

Let’s say we want a graph that shows cpuTime for each of the processes (not all are available). We could add each metric individually:

But that’s time consuming, and assumes there are only two processes. What if DMS gives us data for other processes? Instead we can use a wildcard in place of the process name:

obieesample.DMS.dms_cProcessInfo.ProcessInfo.*.cpuTime

You can do this by selecting a metric and then amending it in the Graph Data view, or from the Graph Data view itself click on Add and use the auto-complete to manually enter it.

But now the legend is pretty unintelligable, and this time using the aliasByMetric function won’t help because the metric name is constant (cpuTime). Instead, use the Set Legend Name By Node function. In this example we want the third node (the name of the process). Combined with a graph title this gives us:

This aliasbyNode method works well for Connection Pool data too. However it can be sensitive to certain characters (including brackets) in the metric name, throwing a IndexError: list index out of range error. The latest version of obi-metrics-agent should workaround this by modifying the metric names before sending them to carbon.

The above graph shows a further opportunity for using Graphite functions. The metric is a cumulative one – amount to CPU time that the process has used, in total. What would be more useful would be if we could show the delta between each occurrence. For this, the derivative function is appropriate:

Sometimes you’ll get graphs with gaps in; maybe the server was busy and the collector couldn’t keep up.

2014-03-28_07-29-47

To “gloss over” these, use the Keep Last Value function:

2014-03-28_07-30-51

Saving graphs

You don’t have to login to Graphite by default, but to save and return to graphs and dashboards between sessions you’ll want to. If you used the obi-metrics-agent installation script then Graphite will have a user oracle with password Password01. Click the Login button in the top right of the Graphite screen and enter the credentials.

Once logged in, you should see a Save icon (for you young kids out there, that’s a 3.5″ floppy disk…).

You can return to saved graphs from the Tree pane on the left:

flot

As well as the standard Graphite graphing described above, you also have the option of using flot, which is available from the link in the top-right options, or the icon on an existing graph:

2014-03-30_21-44-43

Graphlot/Flot is good for things like examining data values at specific times:

2014-03-30_21-47-36

Creating a dashboard

So far we’ve seen individual graphs in isolation, which is fine for ad-hoc experimentation but doesn’t give us an overall view of a system. Click on Dashboard in the top-right of the Graphite page to go to the dashboards area, ignoring the error about the default theme.

You can either build Graphite dashboards from scratch, or you can bring in graphs that you have prepared already in the Graphite Composer and saved.

At the top of the Graphite Dashboard screen is the metrics available to you. Clicking on them drills down the metric tree, as does typing in the box underneath

Selecting a metric adds it in a graph to the dashboard, and selecting a second adds it into a second graph:

You can merge graphs by dragging and dropping one onto the other:

Metrics within a graph can be modified with functions in exactly the same way as in the Graphite Composer discussed above:

To add in a graph that you saved from Graphite Composer, use the Graphs menu

You can resize the graphs shown on the dashboard, again using the Graphs menu:

To save your dashboard, use the Dashboard -> Save option.

Example Graphite dashboards

Here are some examples of obi-metrics-agent/Graphite being used in anger. Click on an image to see the full version.

  • OS stats (via collectl)
    OS stats from collectl
  • Presentation Services sessions, cache and charting
    Presentation Services sessions, cache and charting
  • BI Server (nqserver) Connection and Thread Pools
    BI Server (nqserver) Connection and Thread Pools
  • Response times vs active users (via JMeter)
    Response times vs active users (via JMeter)
Categories: BI & Warehousing

Built-In OBIEE Load Testing with nqcmd

Rittman Mead Consulting - Fri, 2014-03-28 05:21

nqcmd ships with all installations of OBIEE and includes some very useful hidden functionality – the ability to generate load tests against OBIEE. There are lots of ways of generating load against OBIEE, but most require third party tools of varying degrees of complexity to work with.

It’s easy to try this out. First set the OBIEE environment:  [I'm using SampleApp v309R2 as an example; your FMW_HOME path will vary]

. ~/obiee/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

and then the “open sesame” setting which enables the hidden nqcmd functionality:

export SA_NQCMD_ADVANCED=Yes

On Windows, run set SA_NQCMD_ADVANCED=YES instead. If you don’t set this environment variable then nqcmd just throws an error if you try to use one of the hidden options.

Now if you list the available options for nqcmd you’ll see lots of new options in addition to the usual ones:

Command: nqcmd - a command line client which can issue SQL statements
                 against either Oracle BI server or a variety
                 of ODBC compliant backend databases.
SYNOPSIS
         nqcmd [OPTION]...
DESCRIPTION
         -d<data source name>
         -u<user name>
         -p<password>
         -s<sql input file name>
         -o<output result file name>
         -D<Delimiter>
         -b<super batch file name>
         -w<# wait seconds>
         -c<# cancel interval seconds>
         -C<# number of fetched rows by column-wise binding>
         -n<# number of loops>
         -r<# number of requests per shared session>
         -R<# number of fetched rows by row-wise binding>
         -t<# number of threads>
         -T (a flag to turn on time statistics)
         -a (a flag to enable async processing)
         -f (a flag to enable to flush output file for each write)
         -H (a flag to enable to open/close a request handle for each query)
         -z (a flag to enable UTF8 in the output result file
         -utf16 (a flag to enable UTF16 for communicating to Oracle BI ODBC driver)
         -q (a flag to turn off row output)
         -NoFetch (a flag to disable data fetch with query execution)
         -SmartDiff (a flag to enable SmartDiff tags in output)
         -NotForwardCursor (a flag to disable forwardonly cursor)
         -v (a flag to display the version)
         -P<the percent of statements to disable cache hit>
         -impersonate <the impersonate username>
         -runas <the runas username>
         -td <the time duration to run >
         -qsel <the query selection>
         -ds <the dump statistics duration in secs>
         -qstats <print Query statistics at end of run>
         -login <login scenario for PSR. login/execute sqls/logout for sql file>
         -ShowQueryLog <to display query log from server, -H is required for this setting>
         -i <ramup interval for each user for load testing, -i is required for this setting>
         -ONFormat<FormatString, i.e. TM9, 0D99>

You’re own your own figuring the new options out as they’re not documented (and therefore presumably not supported and liable to change or be dropped at any time). What I’ve done below is my best guess at how to use them – don’t take this as gospel. The one source that I did find is a post on Oracle’s CEAL blog: OBIEE 11.1.1 – Advanced Usage of nqcmd command, from which I’ve taken some of the detail below.

Let’s have a look at how we can generate a load test. First off, I’ll create a very simple query:

and from the Advanced tab extract the Logical SQL from it:

SELECT
   0 s_0,
   "A - Sample Sales"."Products"."P2  Product Type" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2
FROM "A - Sample Sales"
ORDER BY 1, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

This Logical SQL I’ve saved to a file, report01.lsql.

To run this Logical SQL from nqcmd I use the standard (documented) syntax, passing the Logical SQL filename with the -s flag:

[oracle@obieesample loadtest]$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Connection open with info:
[0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16
SELECT
   0 s_0,
   "A - Sample Sales"."Products"."P2  Product Type" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2
FROM "A - Sample Sales"
ORDER BY 1, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY
[...]

0            Smart Phones   6773120.36
--------------------
Row count: 11
--------------------

Processed: 1 queries

Adding the -q flag will do the same, but suppress the data output:

oracle@obieesample loadtest]$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q

[...]
----------------------------------------------------------------------
Row count: 11
-------------------------------------------------------------------------------------------------------------   
Processed: 1 queries

The basic parameters for load testing are

  • -t – how many threads [aka Virtual Users]
  • -td – test duration
  • -ds – how frequently to write out load test statistics
  • -T – enable time statistics [without this they will not be reported correctly]

You also need to supply -o with an output filename. Even if you’re not writing the data returned from the query to disk (which you shouldn’t, and -q disables), nqcmd needs this in order to be able to write its load test statistics properly (I got a lot of zeros and nan otherwise). In addition, the -T (Timer) flag should be enabled for accurate timings.

So to run a test for a minute with 5 threads, writing load test stats to disk every 5 seconds, you’d run:

nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q -T -td 60 -t 5 -ds 5 -o output

The load test stats are written to a file based on the name given in the -o parameter, with a _Counters.txt suffix:

$ cat output_Counters.txt
                        nQcmd Load Testing
TimeStamp       Sqls/Sec        Avg RT  CumulativePrepareTime   CumulativeExecuteTime   CumulativeFetchTime
00:00:05        56.200000       0.065925        2.536000                13.977000               2.012000
00:00:10        66.800000       0.065009        5.641000                33.479000               4.306000
00:00:15        69.066667       0.066055        8.833000                52.234000               7.366000
00:00:20        73.100000       0.063984        11.978000               71.944000               9.622000
[...]

Using obi-metrics-agent to pull out the OBIEE metrics and Graphite to render them we can easily visualise what happened when we ran the test. The Oracle_BI_General.Total_sessions metric shows:

nq07

Ramping Up the Load

nqcmd also has a -i parameter, to specify the ramp up per thread. Most load tests should incorporate a “ramp up”, whereby the load is introduced gradually. This is important so that you don’t overwhelm a server all at once. It might be the server will not support the total number of users planned, so by using a ramp up period you can examine the server’s behaviour as the load increases gradually, spotting the point at which the wheels begin to come off.

The -i parameter for nqcmd is the delay between each thread launching, and this has an interesting effect on the duration of the test. If you specify a test duration (-td) of 5 seconds, five threads (-t), and a rampup (-i) of 10 seconds the total elapsed will be c.55 seconds (5×10 + 5).

I’ve used the standard time command on Linux to validate this by specifying it before the nqcmd call.

$ time nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q -td 5 -t 5 -ds 1 -o $(date +%Y-%m-%d-%H%M%S) -T -i 10 

[...]

real    0m56.896s
user    0m2.350s
sys     0m1.434s

So basically the -td is the “Steady State” once all threads are ramped up, and the literal test duration is equal to (rampup * number of threads) + (desired steady state)

The above ramp-up can be clearly seen:

nq06

BTW a handy trick I’ve used here is to use a timestamp for the output name so that the Counter.txt from one test doesn’t overwrite another, by specifying date using an inline bash command :

nqcmd [...]   -o $(date +%Y-%m-%d-%H%M%S)   [...]

Whilst we’re at it for tips & tricks – if you want to stop nqcmd running but Ctrl-C isn’t instant enough for you, the following will stop it in its tracks:

pkill -9 nqcmd

Wait a Moment…

…or two. Wait time, or “think time”, is also important in producing a realistic load test. Unless you want to hammer your server just for the lulz to see how fast you can overload it, you’ll want to make sure the workload you’re simulating represents how it is actually used — and in reality users will be pausing (thinking) between report requests. The -w flag provides this option to nqcmd.

In this test below, whilst the Total Sessions is as before (no ramp up), the Connection Pool shows far fewer busy connections. On previous tests the busy connections were equal to the number of active threads, because the server was continuously running queries.

nq09

And the CPU, which in the previous test was exhausted at five users with no wait time, now is a bit more relaxed

nq10

for comparison, this was the CPU in the first test we ran (5 threads, no wait time, no ramp up). Note that ‘idle’ drops to zero, i.e. the CPU is flat-out.

nq11

Load Test in Action

Let’s combine ramp up and wait times to run a load test and see what we can see in the underlying OBIEE metrics. I’m specifying:

  • Write the output to a file with the current timestamp (date, in the format YYYY-MM-DD HH:MM:SS)
    -o $(date +%Y-%m-%d-%H%M%S)
  • 20 threads
    -t 20
  • 10 second gap between starting each new thread
    -i  10
  • 5 second wait between each thread submitting a new query
    -w 5
  • Run for a total of 230 seconds (20 thread x 10 second ramp up = 200 seconds, plus 30 second steady state)
    -td 230

$ date;time nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s queries.lsql -q -T -o $(date +%Y-%m-%d-%H%M%S) -t 20 -ds 5 -td 230 -w 5 -i 10;date

Here’s what happened.

  • At first, as the users ramp up the Connection Pool gets progressively busier
    2014-03-28_10-24-11
  • However, when we hit c.14 threads, things start to go awry. The busy count stays at 10, even though the user count is increasing: 2014-03-28_10-26-12
    (This was displayed in flot which you can get to on the /graphlot URL of your Graphite server)
  • So the user count is increasing, but we’re not seeing increasing activity on the Connection Pool… so what does that do for the response times? 2014-03-28_10-30-50
    OK, so the Average Query Elapsed Time is a metric I’d normally be wary of, but this is a dedicated server running just my load test workload (and a single query within it) so in this case it’s a valid indicator — and it’s showing that the response time it going up. Why’s it going up?
  • Looking more closely at the Connection Pool we can see a problem — we’re hitting the capacity of ten connections, and requests are starting to queue up: 2014-03-28_10-38-06
    Note how once the Current Busy Connection Count hits the Capacity of ten, the Current Queued Requests value starts to increase — because the number of users is increasing, trying to run more queries, but having to wait.

So this is a good example of where users would see slow performance, but some of the usual “Silver Bullets” around hardware and the database would completely miss the target, because the bottleneck here is actually in the configuration of the Connection Pool.

If you’re interested in hearing more about this subject, make sure you register for the BI Forum in Brighton, 7-9 May where I’m delighted to be speaking for the second time, presenting “No Silver Bullets : OBIEE Performance in the Real World“.

Categories: BI & Warehousing

Data Integration Tips: ODI 12c – Substitution API GUIDs

Rittman Mead Consulting - Thu, 2014-03-27 13:08

With the release of Oracle Data Integrator 12c and Oracle GoldenGate 12c still fresh in our minds, I thought I would start an ongoing series focused on Data Integration Tips – specifically around the latest version of Oracle’s data integration tools. The goal is to provide insight into some of the hidden, undocumented, or lesser known features of these products. Most of these tips and tricks will come from our experiences when solving challenges for our clients. I’m sure some of my data integration colleagues at Rittman Mead will jump on board and share their knowledge as well.

The Scenario

While working with a client who is using ODI 12c, we were attempting to capture the session ID (known as the Batch ID) for each execution of a Load Plan. The process uses an ODI Variable, which, when refreshed, would capture the identifier from the following substitution API call. This variable would then be mapped to the audit dimension key column in each dimension mapping for use in ETL process auditing.

<%=odiRef.getLoadPlanInstance( "BATCH_ID" )%>

Much to my surprise, this code produced an error. Even more surprising, the ODI 12c documentation showed that I had the syntax and parameter value correctly defined for this API call. The error received was different than any I had seen before:

java.lang.Exception: The application script threw an exception: com.sunopsis.tools.core.exception.SnpsSimpleMessageException: Exception getLoadPlanInstance("BATCH_ID") : ### KEY com.sunopsis.res.gen / ODI-17769: OdiRef: Internal ID 5 of IObjects SnpLoadPlanInstance should not be used when the repository is 12c mode.### BSF info: null at line: 0 column: columnNo

The message states that this code “should not be used when the repository is in 12c mode“. Hmm…something about this call is clearly not compatible with ODI 12c.

Here’s the Tip…

As you may recall, ODI 11g used numeric IDs, made up of the object ID and the Work Repository ID, to uniquely identify objects stored in the repository. This had the potential for issues when migrating objects between repositories, should the repository ID be the same across environments. This led Oracle to change how objects are stored in Oracle Data Integrator 12c – switching from numeric IDs to GUIDs.

During my research of this error, I was reminded of this change and directed to the ODI 12c “Upgrading your environment” documentation. Throughout this document, the decision on which upgrade path to take depends on the use of  ”legacy numeric identifiers” in odiRef API calls versus the new GUID approach in ODI 12c. This is a hint that certain substitution API calls using identifiers are not going to work in a standard ODI 12c installation.

I was on the right track, but still hadn’t found anything that looked like a solution. I reached out on Twitter, and once again Mr. David Allan of Oracle’s Data Integration team had the “unwritten answer” to which the documentation was referring.

@mRainey you have to use odiRef.getLoadPlanInstance(“BATCH_GUID”) in #odi12c … Guids(12c) versus Ids(11g)

— David Allan (@i_m_dave) March 25, 2014

Now we’re getting somewhere. I made the change in the ODI variable call to use the GUID, and it worked as expected.

<%=odiRef.getLoadPlanInstance( "BATCH_GUID" )%>

batch_guid

Final thought – if you’re using any “ID” parameter calls in your substitution API code, be sure to modify them to use GUIDs. And, as in this case, make sure you make the change to any logical and physical table attributes to handle the difference in datatypes (numeric vs varchar2).

More Data Integration Tips will be coming soon – I look forward to your feedback on the blog series.

Categories: BI & Warehousing

Installing obi-metrics-agent, Graphite, and collectl

Rittman Mead Consulting - Thu, 2014-03-27 09:31

In the previous post I introduced obi-metrics-agent, an open-source tool from Rittman Mead that can collect DMS metrics from OBIEE and write them to several formats including graphite. Graphite is a tool for storing and graphing time-based metrics.

Now in this post we will look at how to install the tools and get them up and running on your server. The install process has been tested on both OL5 and OL6 (and so by extension, Cent OS and RHEL). There’s no reason why it oughtn’t work on other *nixes, but it’s not been tested.

I’m using Oracle’s SampleApp v309R2 server which is built on Oracle Linux (OL) 5 so that anyone can try this out at home running it on VirtualBox (or VMWare Fusion).

There are three parts to the obi-metrics-agent installation.

  1. obi-metrics-agent – python script to extract DMS metrics from OBIEE
  2. [optional] Graphite – store and graph data, including that extracted from obi-metrics-agent and collectl
  3. [optional] collectl – OS metrics viewer. Can send data for graphing in Graphite

Only one of these is mandatory – the installation of obi-metrics-agent itself. That in itself is just a case of cloning the repository in order to get the obi-metrics-agent python script on the server. However, the remaining articles in this blog series will assume that graphite and collectl have also been installed, so these are covered too.

Assumptions & pre-requisites
  1. Server has internet access (try ping google.com to check)
  2. Installation is running as the oracle user
  3. Installation path is /home/oracle/graphite
  4. User has sudo rights

Before you run this on SampleApp v309R2, you must first run the following to fix a bug in yum’s configuration. If you’re not running SampleApp, you don’t need to do this:

sudo sed -i.bak -e 's/proxy=/#proxy=/g' /etc/yum.conf

You need the EPEL yum repository setup (for packages such as git, etc):

  • OL 5/Cent OS 5/RHEL 5 (including SampleApp v309R2):
    sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
  • OL 6/Cent OS 6/RHEL 6:
    sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm
Installation

The bulk of the installation is done by a couple of shell scripts. To get these, you should clone the git repository. The following steps will do this.

Now run the following to install git and clone the repository:

sudo yum install -y git
export FMW_HOME=/home/oracle/obiee # Change this value for your installation
cd $FMW_HOME
git clone https://github.com/RittmanMead/obi-metrics-agent.git

Then launch the installation:

  • OL5:
    cd obi-metrics-agent
    ./install-ol5.sh
  • OL6:
    cd obi-metrics-agent
    ./install-ol6.sh

This will install git (if not present), and create a folder called obi-metrics-agent in the installation folder of OBIEE (FMW Home). You can put obi-metrics-agent wherever you want, this locating is just a suggestion.

The installation for Graphite builds a standalone python environment using virtualenv in which graphite will run under Python 2.6. The reason for this is that OL5 ships with Python 2.4 (Noah was seen coding in this shortly before boarding his ark) and the OS’ python version cannot be easily upgraded without causing all sorts of complications. Virtualenv is generally a good way to deploy python applications in isolation from others, so is also used for the OL6 version of the install.

The script will take a few minutes to run. Once complete, you can test the installation following the steps below. Do note that Graphite is somewhat notorious for installation problems, so whilst these instructions have been tested, you may hit quirks on your own server that may need a bit of Google-Fu to resolve. If you want to follow step-by-step instructions detailing each step, they are provided:

If you’re on a different *nix platform then feel free to adapt the install scripts and submit a pull request. Graphite works flawlessly on Debian-based distributions, and there is no reason why you should run it local to the OBIEE server on which obi-metrics-agent is running.

Testing the installation Testing obi-metrics-agent

First, start up the OBIEE stack and make sure that you can login to OBIEE. One it is running, you can run obi-metrics-agent by entering:

export FMW_HOME=/home/oracle/obiee # Change this value for your installation
export OPMN_BIN=$FMW_HOME/instances/instance1/bin/opmnctl
python $FMW_HOME/obi-metrics-agent/obi-metrics-agent.py

Expected output:

[...]
--Gather metrics--
    Time of sample: Wed, 19 Mar 2014 22:27:42 +0000 (1395268062)

    Get metrics for coreapplication_obips1
         Processed :    469 data values @ Wed, 19 Mar 2014 22:27:42 +0000       Oracle BI Presentation Server
            Appended CSV data to ./data/metrics.csv
    Get metrics for coreapplication_obis1
         Processed :    229 data values @ Wed, 19 Mar 2014 22:27:42 +0000       Oracle BI Server
            Appended CSV data to ./data/metrics.csv
    Get metrics for opmn
         Processed :    91 data values @ Wed, 19 Mar 2014 22:27:42 +0000        opmn
            Appended CSV data to ./data/metrics.csv

    Processed: 3    Valid: 3 (100.00%)      Invalid: 0 (0.00%)
    -- Sleeping for 5 seconds (until 1395268062)--
[...]

For more advanced options, including sending data to graphite, see below.

Testing Graphite

Go to http://<server>/ and you should see the Graphite launch page:

2014-03-25_09-32-17

By default Graphite will collect data about its own performance, of the carbon agent. This means you can easily test that Graphite’s graphs are working by selecting one of the carbon metrics to graph. Expand the Metrics tree and click cpuUsage which will add it into the graph in the main pane:

2014-03-25_09-33-37

Graphite is installed with a default user oracle with password Password01.

Testing collectl

From the command line enter collectl:

[oracle@obieesample obi-metrics-agent]$ collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   1   0   340    430      0      0      0      0      0      0      0       0
   0   0   444    499     48      2      0      0      0      1      0       1
   1   0   316    430      0      0      0      0      0      1      0       1

(press Ctrl-C to exit)

To check that collectl is successfully sending data to Graphite go to Graphite and under the Metrics folder you should see your machine’s hostname (for example obieesample for SampleApp). Under that should be the metrics that collectl is sending to Graphite:

2014-03-25_09-45-10

Graphing OBIEE DMS metrics in Graphite

Now that we’ve gone through the installation, let us see how to use the three components together, graphing out some OBIEE DMS data in Graphite.

Run obi-metrics-agent.py as you did in the test above, but this time specify the output carbon and carbon-server parameters

export OPMN_BIN=$FMW_HOME/instances/instance1/bin/opmnctl
python $FMW_HOME/obi-metrics-agent/obi-metrics-agent.py --output carbon --carbon-server localhost

Now go to Graphite at http://<server>/ and under Metrics you should see all of the OBIEE DMS metrics in folders. Note that it may take a couple of minutes for the metrics to first appear in Graphite. If you don’t see them, or the complete list, wait a minute and then hi

2014-03-25_12-05-19

(the DMS folder is metrics relating to DMS itself – the OBIEE DMS metrics are under OBI)

Find the metric for Presentation Services logged in sessions, Oracle_BI_PS_Sessions.Sessions_Logged_In and click on it to add it to the graph. By default the graph shows a large time period so click on the Select Recent Data icon

2014-03-25_12-08-13

Set the time period to 5 minutes, and click the Auto-Refresh button at the bottom of the graph.

Now log in to OBIEE, and go back to the Graphite page – you should see that the metric has increased accordingly.

2014-03-25_12-13-03

I’ll look at using Graphite in detail in a later blog post; this is just to demonstrate that the collection and rendering is working.

Different ways of using obi-metrics-agent

obi-metrics-agent has three modes of operation:

  1. Collect and parse metrics to output
    • Near-real-time rendering of OBI metrics in Graphite
    • Near-real-time load of data into Oracle (via CSV & external table)
  2. Collect metrics to disk only
    • For analysis at a later date
    • If graphite/carbon is not available to send data to
    • If outbound network bandwidth is constrained (or could be, by OBIEE)
    • For lowest host system overhead
  3. Parse existing metrics from disk
    • Parse previously collected data and output to Graphite or Oracle (via CSV & external table)
Syntax

Options:
  -h, --help            show this help message and exit
  -o OUTPUTFORMAT, --output=OUTPUTFORMAT
                        The output format(s) of data, comma separated. More
                        than one can be specified. Unparsed options: raw, xml
                        Parsed options: csv , carbon, sql
  -d DATA, --data-directory=DATA
                        The directory to which data files are written. Not
                        needed if sole output is carbon.
  -p, --parse-only      If specified, then all raw and xml files specified in
                        the data-directory will be processed, and output to
                        the specified format(s) Selecting this option will
                        disable collection of metrics.
  --fmw-instance=FMW_INSTANCE
                        Optional. The name of a particular FMW instance. This
                        will be prefixed to metric names.
  --carbon-server=CARBON_SERVER
                        The host or IP address of the Carbon server. Required
                        if output format 'carbon' specified.
  --carbon-port=CARBON_PORT
                        Alternative carbon port, if not 2003.
  -i INTERVAL, --interval=INTERVAL
                        The interval in seconds between metric samples.
  --opmnbin=OPMN_BIN    The complete path to opmnctl. Watch out for spaces.

For further details and syntax examples, see the USAGE.MD file in the github repository.

Categories: BI & Warehousing

Introducing obi-metrics-agent – an Open-Source OBIEE Metrics Collector

Rittman Mead Consulting - Wed, 2014-03-26 14:47

Understanding what is going on inside OBIEE is important for being able to diagnose issues that arise, monitor its health, and dig deep into its behaviour under stress in a load test. OBIEE exposes a set of metrics through the Dynamic Monitoring Service (DMS) and viewable through Enterprise Manager (EM) Fusion Middleware Control. EM is a great tool but doesn’t meet all requirements for accessing these metrics, primarily because it doesn’t retain any history.

obi-metrics-agent is a tool that extracts OBIEE’s performance metrics from the Dynamic Monitoring Service (DMS) functionality. Venkat wrote the original version, which I have rewritten in python and added additional functionality. It polls DMS on a specified interval and output the data to a variety of formats. It was written to aid OBIEE performance monitoring either as part of testing or longer-term use. Its features include:

  • Multiple output options, including CSV, XML, and Carbon (for rendering in Graphite etc)
  • Parse data as it is collected, or write to disk
  • Parse data collected previously

2014-03-26_07-03-19

How does it work?

obi-metrics-agent is written in Python, and uses the documented OPMN functionality to expose DMS metrics opmnctl metric op=query. We experimented with the WLST route but found the overhead was too great. OPMN supplies the DMS data as a large XML message, which obi-metrics-agent can either store raw, or parse out into the constituent metrics. It can write these to CSV or XML files, generate INSERT statements for sending them to a database, or send them to graphite (see below).

obi-metrics-agent can also parse previously-extracted raw data, so if you want to store data in graphite but don’t have the server to hand at execution time it can be loaded retrospectively.

2014-03-26_06-07-18

On which platforms does it work?
  • Tested thoroughly on Oracle Linux 5 and 6
  • Works on Windows 2003, should work on later versions
Which OBI metrics are collected?

All of the ones that OPMN supports. Currently, BI Server and BI Presentation Services, plus the opmn process metrics (such as CPU time of each OBI component)

To explore the DMS metrics available, you can use Enterprise Manager, or the DMS Spy servlet that is installed by default with OBIEE and available at http://<obi-server>:7001/dms/ (assuming your AdminServer is on port 7001).

perf01

I have used DMS metrics primarily when investigating OBIEE’s behaviour under stress in performance testing, but some of the higher-level metrics are useful for day-to-day monitoring too. The DMS metrics let you peer into OBIEE’s workings and deduce or hypothesise the cause of behaviour you are seeing.

  • How many users does OBIEE see as logged in?
  • How many active requests are there from Presentation Services to BI Server (nqserver)?
  • How many active connections are there from each Connection Pool to the database? How many queued connections?
  • What’s the average (careful…) response time by database?
  • What’s the error rate for Presentation Services queries?
  • How does the memory profile of each OBIEE component behave during testing?
  • How are the BI Server’s internal thread pools coping with the load? Do they need resizing?
  • How many queries per second are being run on each database?
  • How is the graphing engine behaving? Is it queuing requests?
  • What’s the Presentation Services and BI Server cache hit rate?
Sounds great, where do I get it?

Rittman Mead have released obi-metrics-agent as open source. You can find it on GitHub: https://github.com/RittmanMead/obi-metrics-agent.

Simply clone the repository and run the python script. You need to install the lxml library first – full details are supplied in the repository.

$ export OPMN_BIN=$FMW_HOME/instances/instance1/bin/opmnctl
$ python ./obi-metrics-agent.py

# ===================================================================
# Developed by @rmoff / Rittman Mead (http://www.rittmanmead.com)
# Absolutely no warranty, use at your own risk
# Please include this notice in any copy or reuse of the script you make
# ===================================================================

---------------------------------------
Output format             : csv
raw/csv/xml/carbon/sql    : False/True/False/False/False
Data dir                  : ./data
FMW instance              : None
OPMN BIN                  : /u01/app/oracle/product/fmw/instances/instance1/bin/opmnctl
Sample interval (seconds) : 5
---------------------------------------

--Gather metrics--
        Time of sample: Wed, 26 Mar 2014 10:38:38 +0000 (1395830318)

        Get metrics for coreapplication_obips1
                 Processed :    469 data values @ Wed, 26 Mar 2014 10:38:38 +0000       Oracle BI Presentation Server
                        Appended CSV data to ./data/metrics.csv
        Get metrics for coreapplication_obis1
                 Processed :    230 data values @ Wed, 26 Mar 2014 10:38:38 +0000       Oracle BI Server
                        Appended CSV data to ./data/metrics.csv
        Get metrics for opmn
                 Processed :    91 data values @ Wed, 26 Mar 2014 10:38:38 +0000        opmn
                        Appended CSV data to ./data/metrics.csv

        Processed: 3    Valid: 3 (100.00%)      Invalid: 0 (0.00%)
[...]

See the next blog post for a step-by-step on getting it set up and running using SampleApp v309R2 as an example server, including with Graphite and Collectl for visualising the data and collecting OS stats too.

Visualising the collected data – Graphite

Graphite is an open-source graphing tool that comes with a daemon called carbon that receives incoming data and stores it its own times-series database (called whisper). Graphite is a very popular tool meaning there’s lots of support out there for it and additional tools written to complement it, some of which I’ll be exploring in later blog posts. It’s also very easy to get data into graphite, and because it stores it in a time series you can then display OBIEE DMS data alongside anything else you may have – for example, OS metrics from collectl, or jmeter performance test counters.

obi-metrics-agent architecture7defbbf23a98ad16184c80041fdf1bc9

Whilst obi-metrics-agent can be used on its own and data stored to CSV for subsequent parsing, or accessing in Oracle as an external table, the focus on this and subsequent blog posts will primarily be on using obi-metrics-agent writing data to graphite and the benefits this brings when it comes to visualising it.

“Graphite?! Haven’t we got a visualisation tool already in OBIEE?”

You can graph this data out through OBIEE, if you want. The OBI metric data can be loaded in by external table from the CSV files, or using the generated INSERT statements.

The benefit of using Graphite is twofold:

  1. Its primary purpose is graphing time-based metrics. Metrics in, time-based graphs out. You don’t need to build an RPD or model a time dimension. It also supports one-click rendering of wildcarded metric groups (for example, current connection count on all connection pools), as well as one-click transformations such as displaying deltas of a cumulative measure.
  2. It gives an alternative to a dependency on OBIEE. If we use OBIEE we’re then rendering the data on the system that we’re also monitoring, thus introducing the significant risk of just measuring the impact of our monitoring! To then set up a second OBIEE server just for rendering graphs then it opens up the question of what’s the best graphing tool for this particular job, and Graphite is a strong option here.

Graphite just works very well in this particular scenario, which is why I use it….YMMV.

Contributing

There are several obvious features that could be added so do please feel free to fork the repository and submit your own pull requests! Ideas include:

  • support for scaled-out clusters
  • init.d / run as a daemon
  • selective metric collection
Known Issues

Prior to OBIEE 11.1.1.7, there was a bug in the opmn process which causes corrupt XML sometimes. This could sometimes be as much as 15% of samples. On corrupt samples, the datapoint is just dropped.

The FMW patch from Oracle for this issue is 13055259.

License

===================================================================
Developed by @rmoff / Rittman Mead (http://www.rittmanmead.com)
Absolutely no warranty, use at your own risk
Please include this notice in any copy or reuse of the script you make
===================================================================

What next?

There are several posts on this has subject to come, including :

  1. Installing obi-metrics-agent, graphite, and collectl.
  2. Exploring the graphite interface, building simple dashboards
  3. Alternative front-ends to graphite
Categories: BI & Warehousing

The Secret Life of Conditional Formatting in OBIEE

Rittman Mead Consulting - Wed, 2014-03-26 05:27

When dealing with conditional formatting the GUI allow us to add as many formats as we want, but every single format has a unique and single condition, not more, not less. In this post I am going to show how it is possible in OBIEE to use multiple conditions to define a special formatting on an analysis column. Sometimes the requirement for conditional formatting is a little more complex, mainly when using pivots, requiring 2 conditions like for example “year = current-year AND country = UK”. In general we try to manage this requirement by a set of conditional formats sorted in a specific order trying to achieve something as close as possible to this, sometimes also setting one format first and then overriding it back in some cells of a pivot to make them looks like the cells without special format. It would be a lot easier to be able to just define a more complex rule for the special format but the GUI doesn’t allow it.

Does it means OBIEE can’t manage complex conditions formatting? No, not at all! Let me introduce you the secret life of conditional formatting…

Before we look at the practical implementation, first let us consider the “history” of conditional formatting, and why we can be optimistic that the method I describe will work.

It must be noted that because it’s not a functionality available through the GUI there is no guarantee it will work correctly or OBIEE will support it in futures releases, but for me it works fine in OBIEE 11.1.1.7 and related patched versions.

Some theory

OBIEE Analyses are stored in an XML format that OBIEE interprets when building and running the report. Using the Advanced tab of an analysis we can examine this XML. The XML can tell you a lot of information and give some hints about functionality the GUI doesn’t allow to perform but based on the XML you can easily guess it will work if coded by hand. Let’s focus on the XML related to conditional formatting as it’s the topic of this post.

Part of the XML standard is an associated XSD file, the XML Schema definition. OBIEE’s XML honours this and we can use this to examine the XML that OBIEE will accept and expect for an analysis. jDeveloper is a good tool to analyze XSD thanks to its built-in “design/source” viewer.
This is the graphical representation of the XSD describing the code behind conditional formatting (click to zoom). If you compare it with the XML you will recognize most of the elements. The important element is into the formatRule block, inside condition: the XSD expect a sawx:expr element there.

cond-format_displayFormat_XSD

Why is it important? Well if you look at a separate part of the XSD, which deals with the filters in an analysis.

cond-format_filter_XSD

Surprise: it’s the same sawx:expr. What does it mean for us? Because the filters accept AND and OR logical operator to join filters together, the XML of the analysis will accept the same syntax in the conditional formatting condition block. There is no guarantee at this point it will work, the XML will be valid but if the code parsing and interpreting the XML doesn’t implement this functionality it will not produce anything (or an error in the worst case). Time to move to the practical part and try it.

Step by step example

I start by creating my pivot where I display Revenues by Country (filtered list to some elements) and Line of Business (LOB).

cond-format_simple_pivot

Now my target is to highlight figures for “Switzerland” in the countries and “Games” in the LOB, so I add a conditional format on the Revenue column and set a red background for Switzerland and yellow background for Games.

cond-format_normal_condition

As expected the result is my pivot with a red row (for Switzerland) and a yellow column (for Games).

cond-format_normal_condition_result

It doesn’t really looks good because the intersection of Switzerland and Games will use the format of the last matched condition, in my case the yellow of Games because it’s the second conditional formatting defined. I would prefer to have this intersection in a nice orange background, but the GUI doesn’t have an option to combine conditions.

The GUI doesn’t do it but it doesn’t mean it’s not possible! Let’s have a look at the XML managing the conditional formatting. First we look an example of the code producing the red background based on the country.

<saw:conditionalDisplayFormat>
  <saw:formatRule>
    <saw:condition>
      <sawx:expr xsi:type="sawx:comparison" op="equal">
        <sawx:expr xsi:type="sawx:columnRefExpr" columnID="c632a4f0428ecfa91"/>
        <sawx:expr xsi:type="xsd:string">Switzerland</sawx:expr>
      </sawx:expr>
    </saw:condition>
    <saw:formatSpec backgroundColor="#FF0000" wrapText="true"/>
  </saw:formatRule>
</saw:conditionalDisplayFormat>

Now, as we saw before, the XSD allow me to use the syntax of filters, including filters containing logical operators (AND, OR) to join conditions together. So if I add the XML with the AND operator I will be able to use the condition country=Switzerland AND LOB=Games to apply a different format.

In a text editor I prepare the new XML for the analysis with an additional <saw:conditionalDisplayFormat> element with my new hand-made condition.

It looks good and I can now try to paste it back in OBIEE and see if the syntax is correct and my XML will be accepted: in the Advanced tab of the analysis I replace the existing XML with my own version of the code.

<saw:conditionalDisplayFormat>
  <saw:formatRule>
    <saw:condition>
      <sawx:expr xsi:type="sawx:logical" op="and">
        <sawx:expr xsi:type="sawx:comparison" op="equal">
          <sawx:expr xsi:type="sawx:columnRefExpr" columnID="c3d00191589cdd4e2"/>
          <sawx:expr xsi:type="xsd:string">Games</sawx:expr>
        </sawx:expr>
        <sawx:expr xsi:type="sawx:comparison" op="equal">
          <sawx:expr xsi:type="sawx:columnRefExpr" columnID="c632a4f0428ecfa91"/>
          <sawx:expr xsi:type="xsd:string">Switzerland</sawx:expr>
        </sawx:expr>
      </sawx:expr>
    </saw:condition>
    <saw:formatSpec backgroundColor="#FF9900" wrapText="true"/>
  </saw:formatRule>
</saw:conditionalDisplayFormat>

When clicking “Apply XML” OBIEE will parse and evaluate the XML. If there is an error a message is displayed and the code is not applied (so as to not break the analysis). If you get an error, double check your XML, make sure every tag you open is closed and read the error message as it says what is the problem.

cond-format_edit_xml

Time to click the Results tab and check the new result of my analysis. YES! There is a nice orange background in the intersection now, proving that even if the GUI can’t do it OBIEE is able to accept complex conditions to define conditional formatting.

cond-format_normal_condition_final_result

What do we see if we check the properties of the column in the GUI? As you can see in the next screenshot we see there is a condition setting an orange background, but the condition itself is just shown as sawx:expr:sawx:logicaland. Where does this text come from? It’s a concatenation of the name of the first XML tag containing the logical operator and the values of the attributes of the tag itself. So if you use a OR instead of AND you will see a condition named sawx:expr:sawx:logicalor.

cond-format_final_condition

Can we edit the condition in the GUI? Not really, the edit window is opened, but the formula will not be recognized and if you set an operator and a value you will lose your hand-made conditional formatting and generate a meaningless piece of XML, so don’t do it for any reason. On the other hand the format can be edited using the GUI with no impact on the rule, so feel free to do it and correct the format if you don’t feel comfortable coding a format by hand in the XML.

cond-format_final_condition_edit

To resume, and some advice
  • It’s possible to have a complex formula as condition in a conditional formatting rule.
  • It’s not supported by the GUI, needs to be done by hand in XML.
  • Do it as one of the last steps when building an analysis because you can’t edit it via the GUI.
  • Be familiar with the produced XML before you play with it.
  • Use the Filters section to create the condition and avoid mistakes, so a copy/paste of the generated rule will help in creating the condition.
  • Document, document, document! As the condition is visible only in the XML document it somewhere, even just in a Static Text view you add to the analysis itself (and don’t display on the screen).
  • Be aware of the maintenance overhead that direct XML manipulation will have – don’t abuse the “cheating”.
  • If you don’t want to hack the XML, an alternative solution can be to use a extra column added to the analysis to evaluate the complex condition and produce a flag used for the conditional formatting. The downside of this is that the condition will be sent to the database as part of the query for evaluation.
Categories: BI & Warehousing

Rittman Mead BI Forum 2014 Registration Now Open – Don’t Miss Out!

Rittman Mead Consulting - Tue, 2014-03-25 03:00

Just a quick reminder to say that registration for the Rittman Mead BI Forum 2014 is now open, with the speaker and presentation list now up on the event website. As with previous years, the BI Forum runs in Brighton on the first week, and then moves over to Atlanta on the second, with the dates and venues as follows:

We’ve got a fantastic line-up of sessions and speakers, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Wayne Van Sluys from InterRel, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Dan Vlamis on using Oracle Database analytics with the Oracle BI Applications
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Peak Indicators’ Alastair Burgess on tuning TimesTen with Aggregate Persistence
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)

NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!

Categories: BI & Warehousing

What's New in Dodeca 6.7.1?

Tim Tow - Mon, 2014-03-24 10:28
Last week, we released Dodeca version 6.7.1.4340 which focuses on some new relational functionality. The major new features in 6.7.1 are:
  • Concurrent SQL Query Execution
  • Detailed SQL Timed Logging
  • Query and Display PDF format
  • Ability to Launch Local External Processes from Within Dodeca
Concurrent SQL Query ExecutionDodeca has a built-in SQLPassthroughDataSet object that supports queries to a relational database.  The SQLPassthroughDataSet functionality was engineered such that a SQLPassthroughDataSet object can include multiple queries that get executed and returned on a single trip to the server and several of our customers have taken great advantage of that functionality.  We have at least one customer, in fact, that has some SQLPassthroughDataSets that execute up to 20 queries in a single trip to the server.  The functionality was originally designed to run the queries sequentially, but in some cases it would be better to run the queries concurrently.  Because this is Dodeca, of course concurrent query execution is configurable at the SQLPassthroughDataSet level.

Detailed SQL Timed LoggingIn Dodeca version 6.0, we added detailed timed logging for Essbase transactions.  In this version, we have added similar functionality for SQL transactions and have formatted the logs in pipe-delimited format so they can easily be loaded into Excel or into a database for further analysis.  The columns of the log include the log message level, timestamp, sequential transaction number, number of active threads, transaction GUID, username, action, description, and time to execute in milliseconds.

Below is an example of the log message.

Click to enlarge

Query and Display PDF formatThe PDF View type now supports the ability to load the PDF directly from a relational table via a tokenized SQL statement.  This functionality will be very useful for those customers who have contextual information, such as invoice images, stored relationally and need a way to display that information.  We frequently see this requirement as the end result of a drill-through operation from either Essbase or relational reports.

Ability to Launch Local External Processes from Within DodecaCertain Dodeca customers store data files for non-financial systems in relational data stores and use Dodeca as a central access point.  The new ability to launch a local process from within Dodeca is implemented as a Dodeca Workbook Script method which provides a great deal of flexibility in how the process is launched.

The new 6.7.1 functionality follows closely on the 6.7.0 release that introduces proxy server support and new MSAD and LDAP authentication services.  If you are interested in seeing all of the changes in Dodeca, highly detailed Dodeca release notes are available on our website at http://www.appliedolap.com/resources/downloads/dodeca-technical-docs.

Categories: BI & Warehousing

Using Oracle R Enterprise to Analyze Large In-Database Datasets

Rittman Mead Consulting - Sun, 2014-03-23 12:18

The other week I posted an article on the blog about Oracle R Advanced Analytics for Hadoop, part of Oracle’s Big Data Connectors and used for running certain types of R analysis over a Hadoop cluster. ORAAH lets you move data in and out of HDFS and Hive and into in-memory R data frames, and gives you the ability to create Hadoop MapReduce jobs but using R commands and syntax. If you’re looking to use R to analyse, prepare and explore your data, and you’ve got access to a large Hadoop cluster, ORAAH is a useful way to go beyond the normal memory constraints of R running on your laptop.

But what if the data you want to analyse is currently in an Oracle database? You can export the relevant tables to flat files and then import them into HDFS, or you can use a tools such as sqoop to copy the data directly into HDFS and Hive tables. Another option you could consider though is to run your R analysis directly on the database tables, avoiding the need to move data around and taking advantage of the scalability of your Oracle database – which is where Oracle R Enterprise comes in.

Oracle R Enterprise is part of the Oracle Database Enterprise Edition “Advanced Analytics Option”, so it’s licensed separately to ORAAH and the Big Data Connectors. What it gives you is three things:

image2

  • Some client packages to install locally on your desktop along. installed into regular R (or ideally, Oracle’s R distribution)
  • Some database server-side R packages to provide a “transparency layer”, converting R commands into SQL ones, along with extra SQL stats functions to support R
  • The ability to spawn-off R engines within the Oracle Database’s using the extproc mechanism, for performing R analysis directly on the data rather than through the client on your laptop

Where this gets interesting for us is that the ORE transparency layer makes it simple to move data in and out of the Oracle Database, but more importantly it allows us to use database tables and views as R “ore.frames” – proxies for “data frames”, the equivalent to database tables in R and the basic data set that R commands work on. Going down this route avoids the need to export the data we’re interesting out of the Oracle Database, with the ORE transparency layer converting most R function calls to Oracle Database SQL ones – meaning that we can use the data analyst-friendly R language whilst using Oracle under the covers for the heavy lifting.

NewImage

There’s more to ORE than just the transparency layer, but let’s take a look at how you might use ORE and this feature, using the same “flight delays” dataset I used in my post a couple of months ago on Hadoop, Hive and Impala. We’ll use the OBIEE 11.1.1.7.1 SampleApp v309R2 that you can download from OTN as it’s got Oracle R Enterprise already installed, although you’ll need to follow step 10 in the accompanying deployment guide to install the R packages that Oracle couldn’t distribute along with SampleApp.

In the following examples, we’ll:

  • Connect to the main PERFORMANCE fact table in the BI_AIRLINES schema, read in it’s metadata (columns), and then set it up as a “virtual” R data frame that actually  points through to the database table
  • Then we’ll perform some basic analysis, binning and totalling for that table, to give us a sense of what’s in it
  • And then we’ll run some more R analysis on the table, outputting the results in the form of graphs and answering questions such as “which days of the week are best to fly out on?” and “how have airlines relative on-time performance changed over time?”

Let’s start off them by starting the R console and connecting to the database schema containing the flight delays data.

[oracle@obieesample ~]$ R
 
Oracle Distribution of R version 2.15.1  (--) -- "Roasted Marshmallows"
Copyright (C)  The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
[Previously saved workspace restored]
 
> library(ORE)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following object(s) are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: OREdm
Loading required package: lattice
Loading required package: OREpredict
Loading required package: ORExml
> ore.connect("bi_airlines","orcl","localhost","BI_AIRLINES",all=TRUE)
Loading required package: ROracle
Loading required package: DBI
> 

Note that “library(ORE)” loads up the Oracle R Enterprise R libraries, and “ore.connect” connects the R session to the relevant Oracle database.

I then synchronise R’s view of the objects in this database schema with its own metadata views, list out what tables are available to us in that schema, and attach that schema to my R session so I can manipulate them from there.

> ore.sync()
> ore.ls()
 [1] "AIRCRAFT_GROUP"           "AIRCRAFT_TYPE"           
 [3] "AIRLINE_ID"               "AIRLINES_USER_DATA"      
 [5] "CANCELLATION"             "CARRIER_GROUP_NEW"       
 [7] "CARRIER_REGION"           "DEPARBLK"                
 [9] "DISTANCE_GROUP_250"       "DOMESTIC_SEGMENT"        
[11] "OBIEE_COUNTY_HIER"        "OBIEE_GEO_AIRPORT_BRIDGE"
[13] "OBIEE_GEO_ORIG"           "OBIEE_ROUTE"             
[15] "OBIEE_TIME_DAY_D"         "OBIEE_TIME_MTH_D"        
[17] "ONTIME_DELAY_GROUPS"      "PERFORMANCE"             
[19] "PERFORMANCE_ENDECA_MV"    "ROUTES_FOR_LINKS"        
[21] "SCHEDULES"                "SERVICE_CLASS"           
[23] "UNIQUE_CARRIERS"         
> ore.attach("bi_airlines")
> 

Now although we know these objects as database tables, what ORE does is present them to R as “data frames” using ore.frame as a proxy, the fundamental data structure in R that looks just like a table in the relational database world. Behind the scenes though, ORE maps these data frames to the underlying Oracle structures using the ore.frame proxy, and turns R commands into SQL function calls including a bunch of new ones added specifically for ORE. Note that this is conceptually different to Oracle R Advanced Analytics for Hadoop, which doesn’t map (or overload) standard R functions to their Hadoop (MapReduce or Hive) equivalent – it instead gives you a set of new R functions that you can use to create MapReduce jobs, which you can then submit to a Hadoop cluster for processing, giving you a more R-native way of creating MapReduce jobs; ORE in-contrast tries to map all of R functionality to Oracle database functions, allowing you to run normal R sessions but with Oracle Database allowing you process bigger R queries closer to the data.

Let’s use another two R commands to see how it views the PERFORMANCE table in the flight delays data set, and get some basic sizing metrics.

> class(PERFORMANCE)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> dim(PERFORMANCE)
[1] 6362422     112

Now at this point I could pull the data from one of those tables directly into an in-memory R data frame, like this:

> carriers <- ore.pull(UNIQUE_CARRIERS)
Warning message:
ORE object has no unique key - using random order 
> class(UNIQUE_CARRIERS)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> class(carriers)
[1] "data.frame"
> 

As you see, R sees the UNIQUE_CARRIERS object as an ore.frame, whilst carriers (into which data from UNIQUE_CARRIERS was loaded) is a regular data.frame object. In some cases you might want to load data from Oracle tables into a regular data.frame, but what’s interesting here is that we can work directly with ore.frame objects and let the Oracle database do the hard work. So let’s get to work on the PERFORMANCE ore.frame object and do some initial analysis and investigation.

> df <- PERFORMANCE[,c("YEAR","DEST","ARRDELAY")]
> class(df)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> head(df)
  YEAR DEST ARRDELAY
1 2010  BOI      -13
2 2010  BUF       44
3 2010  BUF      -14
4 2010  BUR       -6
5 2010  BUR       -2
6 2010  BUR       -9
Warning messages:
1: ORE object has no unique key - using random order 
2: ORE object has no unique key - using random order 
> options(ore.warn.order = FALSE)
> head(PERFORMANCE[,c(1,4,23)])
  YEAR DAYOFMONTH DESTWAC
1 2010         16      83
2 2010         16      22
3 2010         16      22
4 2010         16      91
5 2010         16      91
6 2010         16      91
>

In the above script, the first command creates a temporary ore.frame object made up of just three of the columns from the PERFORMANCE table / ore.frame. Then I switch off the warning about these tables not having unique keys (“options(ore.warn.order = FALSE)”), and then I select three more columns directly from the PERFORMANCE table / ore.frame.

> aggdata <- aggregate(PERFORMANCE$DEST,
+                      by = list(PERFORMANCE$DEST),
+                      FUN = length)
> class(aggdata)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
 
> head(aggdata)
    Group.1     x
ABE     ABE  4104
ABI     ABI  2497
ABQ     ABQ 33298
ABR     ABR     5
ABY     ABY  1028
ACK     ACK   346
 
> (t <- table(PERFORMANCE$DAYOFWEEK))
 
     1      2      3      4      5      6      7 
943305 924442 932113 942066 956123 777203 887170
 
> dat = PERFORMANCE[PERFORMANCE$ARRDELAY<100 & PERFORMANCE$ARRDELAY>-100,]
> ad = with(dat, split(ARRDELAY,UNIQUECARRIER))
> boxplot(ad,col = "blue", notch = TRUE, cex = 0.5, varwidth = TRUE)

In the above set of scripts, I first aggregate flights by destination airports, then count flights by day of week. In the final set of commands I get a bit more advanced and create a box plot graph showing the range of flight delays by airline, which produces the following graph from the R console:

NewImage

whereas in the next one I create a histogram of flight delays (minutes), showing the vast majority of delays are just a few minutes.

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")

NewImage

All of this so far, to be fair, you could do just as easily in SQL or in a tool like Excel, but they’re the sort of commands an R analyst would want to run before getting onto the interesting stuff, and it’s great that they could now do this on the full dataset in an Oracle database, not just on what they can pull into memory on their laptop. Let’s do something more interesting now, and answer the question “which day of the week is best for flying out, in terms of not hitting delays?”

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")
> ontime <- PERFORMANCE
> delay <- ontime$ARRDELAY
> dayofweek <- ontime$DAYOFWEEK
> bd <- split(delay, dayofweek)
> boxplot(bd, notch = TRUE, col = "red", cex = 0.5,
+         outline = FALSE, axes = FALSE,
+         main = "Airline Flight Delay by Day of Week",
+         ylab = "Delay (minutes)", xlab = "Day of Week")

NewImage

Looks like Tuesday’s the best. So how has a selection of airlines performed over the past few years?

> ontimeSubset <- subset(PERFORMANCE, UNIQUECARRIER %in% c("AA", "AS", "CO", "DL","WN","NW")) 
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> for(i in rindex) {
+   temp <- data.frame(index = cindex, avg_delay = res22[i,])
+   plot(avg_delay ~ index, data = temp, col = "black",
+        axes = FALSE, ylim = g_range, xlab = "", ylab = "",
+        main = attr(res22, "dimnames")[[1]][i])
+        axis(1, at = cindex, labels = attr(res22, "dimnames")[[2]]) 
+        axis(2, at = 0:ceiling(g_range[2]))
+        abline(lm(avg_delay ~ index, data = temp), col = "green") 
+        lines(lowess(temp$index, temp$avg_delay), col="red")
+ } 
>

NewImage

See this presentation from the BIWA SIG for more examples of ORE queries against the flight delays dataset, which you can adapt from the ONTIME_S dataset that ships with ORE as part of the install.

Now where R and ORE get really interesting, in the context of BI and OBIEE, is when you embed R scripts directly in the Oracle Database and use them to provide forecasting, modelling and other “advanced analytics” features using the database’s internal JVM and an R engine that gets spun-out on-demand. Once you’ve done this, you can expose the calculations through an OBIEE RPD, as Oracle have done in the OBIEE 11.1.1.7.1 SampleApp, shown below:

NewImage

But that’s really an article in itself – so I’ll cover this process and how you surface it all through OBIEE in a follow-up post soon.

Categories: BI & Warehousing

Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM

Rittman Mead Consulting - Fri, 2014-03-21 17:21

This is old-hat for most Hadoop veterans, but I’ve been meaning to note it on the blog for a while, for anyone who’s first encounter with Hadoop is Oracle’s BigDataLite VM.

Most people looking to bring external data into Hadoop, do so through flat-file exports that they then import into HDFS, using the “hadoop fs” command-line tool or Hue, the web-based developer tool in BigDataLite, Cloudera CDH, Hortonworks and so on. They then often create Hive tables over them, either creating them from the Hive / Beeswax shell or through Hue, which can create a table for you out of a file you upload from your browser. ODI, through the ODI Application Adapter for Hadoop, also gives you a knowledge module (IKM File to Hive) that’s used in the ODI demos also on BigDataLite to load data into Hive tables, from an Apache Avro-format log file.

NewImage

What a lot of people don’t know who’re new to Hadoop, is that you can skip the “dump to file” step completely, and load data straight into HDFS direct from the Oracle database, without an intermediate file export step. The tool you use for this comes as part of the Cloudera CDH4 Hadoop distribution that’s on BigDataLite, and it’s called “Sqoop”.

“Sqoop”, short for “SQL to Hadoop”, gives you the ability to do the following Oracle data transfer tasks amongst other ones:

  • Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop’s file system, HDFS
  • Export data from HDFS back out to these databases – with the export and import being performed through MapReduce jobs
  • Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables
  • Perform incremental loads, specifying a key column to determine what to exclude
  • Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Documentation for Sqoop as shipped with CDH4 can be found on the Cloudera website here, and there are even optimisations and plugins for databases such as Oracle to enable faster, direct loads – for example OraOOP.

Normally, you’d need to download and install JDBC drivers for sqoop before you can use it, but BigDataLite comes with the required Oracle JDBC drivers, so let’s just have a play around and see some examples of Sqoop in action. I’ll start by importing the ACTIVITY table from the MOVIEDEMO schema that comes with the Oracle 12c database also on BigDataLite (make sure the database is running, first though):

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY

You should then see sqoop process your command in its console output, and then run the MapReduce jobs to bring in the data via the Oracle JDBC driver:

14/03/21 18:21:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0
14/03/21 18:21:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/03/21 18:21:37 INFO manager.SqlManager: Using default fetchSize of 1000
14/03/21 18:21:37 INFO tool.CodeGenTool: Beginning code generation
14/03/21 18:21:38 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ACTIVITY t WHERE 1=0
14/03/21 18:21:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
14/03/21 18:21:38 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar
Note: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/03/21 18:21:41 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.jar
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:42 INFO mapreduce.ImportJobBase: Beginning import of ACTIVITY
14/03/21 18:21:42 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/03/21 18:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ACTIVITY_ID), MAX(ACTIVITY_ID) FROM ACTIVITY
14/03/21 18:21:45 INFO mapred.JobClient: Running job: job_201403111406_0015
14/03/21 18:21:47 INFO mapred.JobClient: map 0% reduce 0%
14/03/21 18:22:19 INFO mapred.JobClient: map 25% reduce 0%
14/03/21 18:22:27 INFO mapred.JobClient: map 50% reduce 0%
14/03/21 18:22:29 INFO mapred.JobClient: map 75% reduce 0%
14/03/21 18:22:37 INFO mapred.JobClient: map 100% reduce 0%
...
14/03/21 18:22:39 INFO mapred.JobClient: Map input records=11
14/03/21 18:22:39 INFO mapred.JobClient: Map output records=11
14/03/21 18:22:39 INFO mapred.JobClient: Input split bytes=464
14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0
14/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430
14/03/21 18:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176
14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056
14/03/21 18:22:39 INFO mapred.JobClient: Total committed heap usage (bytes)=506724352
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds (1.8241 bytes/sec)
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.

By default, sqoop will put the resulting file in your user’s home directory in HDFS. Let’s take a look and see what’s there:

[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items
-rw-r--r-- 1 oracle supergroup 0 2014-03-21 18:22 /user/oracle/ACTIVITY/_SUCCESS
drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/ACTIVITY/_logs
-rw-r--r-- 1 oracle supergroup 27 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00000
-rw-r--r-- 1 oracle supergroup 17 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00001
-rw-r--r-- 1 oracle supergroup 24 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00002
-rw-r--r-- 1 oracle supergroup 35 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00003
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000001,Rate
2,Completed
3,Pause
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start
5,Browse

What you can see there is that sqoop has imported the data as a series of “part-m” files, CSV files with one per MapReduce reducer. There’s various options in the docs for specifying compression and other performance features for sqoop imports, but the basic format is a series of CSV files, one per reducer.

You can also import Oracle and other RDBMS data directly into Hive, with sqoop creating equivalent datatypes for the data coming in (basic datatypes only, none of the advanced spatial and other Oracle ones). For example, I could import the CREW table in the MOVIEDEMO schema in like this, directly into an equivalent Hive table:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import<

Taking a look at Hive, I can then see the table this is created, describe it and count the number of rows it contains:

[oracle@bigdatalite ~]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/oracle/hive_job_log_effb9cb5-6617-49f4-97b5-b09cd56c5661_1747866494.txt

hive> desc CREW;
OK
crew_id double 
name string 
Time taken: 2.211 seconds

hive> select count(*) from CREW;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403111406_0017, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201403111406_0017
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201403111406_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-03-21 18:33:40,756 Stage-1 map = 0%, reduce = 0%
2014-03-21 18:33:46,797 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:47,812 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:48,821 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:49,907 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:50,916 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:51,929 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:52,942 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:53,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:54,961 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_201403111406_0017
MapReduce Jobs Launched: 
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.75 sec HDFS Read: 135111 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
6860
Time taken: 20.095 seconds

I can even do an incremental import to bring in new rows, appending their contents to the existing ones in Hive/HDFS:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import --incremental append --check-column CREW_ID

The data I’ve now loaded can be processed by a tool such as ODI, or you can use a tool such as Pig to do some further analysis or number crunching, like this:

grunt> RAW_DATA = LOAD 'ACTIVITY' USING PigStorage(',') AS  
grunt> (act_type: int, act_desc: chararray);  
grunt> B = FILTER RAW_DATA by act_type < 5; 
grunt> STORE B into 'FILTERED_ACTIVITIES' USING PigStorage(‘,'); 

the output of which is another file on HDFS. Finally, I can export this data back to my Oracle database using the sqoop export feature:

[oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY_FILTERED --export-dir FILTERED_ACTIVITIES

So there’s a lot more to sqoop than just this, including features and topics such as compression, transforming data and so on, but it’s a useful tool and also something you could call from the command-line, using an ODI Tool if you want to. Recent versions of Hue also come with a GUI for Sqoop, giving you the ability to create jobs graphically and also schedule them using Oozie, the Hadoop scheduler in CDH4,

Categories: BI & Warehousing

OBIEE Dashboard prompt: at least one mandatory

Rittman Mead Consulting - Mon, 2014-03-17 07:07

It’s almost two years now I joined the Rittman Mead team and never wrote on this blog before, time to start. For this first post nothing long, I wanted to share something I had to setup few weeks ago on a project and it can be interesting for others as well.

My need was related to dashboard prompts: it’s easy to have all of them optional (the default behavior) or the make a prompt (or many of them) mandatory. But if you need to have at least one mandatory field between many in a dashboard prompt? For example if you have a prompt to select a customer and it contains 2 fields: “customer name” and “customer number”. A user can select a customer based on its name or number and I want the user to always select a customer, so at least one of these 2 fields needs to be mandatory.

There is not a magic checkbox in OBIEE to achieve this behavior, setting both fields as mandatory means the user need to enter the customer details twice all the time, not acceptable from a usability point of view.

There is a way to simulate this behavior and will not impact (or in a negligible way) performances without generating unneeded queries on the database.

First I create my dashboard prompt and for my 2 fields I set a presentation variable: pv_CustomerName and pv_CustomerNumber in my example.

dbprompt_prompt_customer_name

Then I create an analysis where I will perform the check to detect if at least one of my prompt fields is set. Add a single column, not important which one of your subject area because I will edit the formula to set a fixed value, a value I’m absolutely sure doesn’t exist in my prompted fields, so for example something like ‘x0xx_I_can_not_exist_xx0x’. Thanks to this unique column with a fixed value this analysis will never send a query to the database, the BI server will manage it.

dbprompt_condition_analysis_column_formula

Time to add some filter to this analysis, my filters will be set on this single column and will use the presentation variables I defined in the dashboard prompt.  I set the condition to be ‘is equal to / is in’ and as value I choose “Presentation variable” and enter the name of the first of my variables and, really important, I set a default value: x0xx_I_can_not_exist_xx0x. Does it remind you something? Yes, it’s the same strange value I manually set as being the value of the my unique column (really important, it must be exactly the same value).

dbprompt_condition_analysis_add_filter

What will this filter do? If there is no variable set the filter is like a “1=1″, if there is a variable set from the prompt it will behave like a “1=2″. Now add all the other presentation variables in the same way using a “AND” condition.

dbprompt_condition_analysis_filters

This analysis will return exactly a unique row with the value of “x0xx_I_can_not_exist_xx0x” or no rows at all, I will use it as a condition to know if at least one dashboard prompt is set or not.

dbprompt_condition_analysis_result

Save this analysis ideally with the word “condition” in the name as we will use it as condition to know if the prompt is filled or not (mine is called “condition_no_prompt_set”).

Time now to finish the work as we have all the elements. Edit the dashboard where you want to add the “at least one mandatory” field functionality, add the dashboard prompt and 2 additional sections: in one of these sections add the analysis you want to run when at least one field is set, in the other one add a text object with a polite message asking the user to use at least one of the dashboard prompt fields.

dbprompt_dashboard_2_sections

Last step is to add conditions to decide when to display the first or the second section.

dbprompt_dashboard_add_condition

For the one with the message set a condition when the “condition_no_prompt_set” analysis return 1 row, in the other section (the one with the analysis to be executed if the prompt is correctly set) set a condition on the same “condition_no_prompt” analysis when it return 0 rows.

dbprompt_dashboard_condition_no_prompt

dbprompt_dashboard_condition_prompt_ok

Done! I have my “at least one mandatory dashboard prompt” as you can see in the next screenshots:

dbprompt_final_output_no_prompt

dbprompt_final_output_customer_name

dbprompt_final_output_customer_number

And the nice thing is that the condition managing the display doesn’t generate any query on the database (as you can see in the logs below), only the BI server will get the query and it’s a really simple one, similar to a “SELECT ‘xxxx’ FROM dual WHERE 1=1″.

dbprompt_condition_query_log1
dbprompt_condition_query_log2

Categories: BI & Warehousing

Oracle Data Integrator 12c Patch – Missing Components

Rittman Mead Consulting - Wed, 2014-03-12 13:26

While working with a client this week, I came across an interesting issue, and ultimately a workaround, when patching Oracle Data Integrator 12c. But first, a little background…

The client was working through the OWB to ODI 12c migration process, detailed nicely by Stewart Bryson in his OTN article, “Making the Move from Oracle Warehouse Builder to Oracle Data Integrator 12c“. After patching OWB to enable the OWB to ODI Migration Tool, they performed the migration and assessed the migration report for errors. While most of the mappings were generated successfully in ODI, albeit with a few minor issues, there was one that was skipped completely. The report entry stated:

[ERROR][Migration][MU-FRAMEWORK.null.NO??][ERROR][Migration][ODI]No valid component type found with name: PIVOT @ oracle.migration.impl.handlers.OWBMappingMigrationHandler.internalMigrate#89

“No valid component type found with name: PIVOT“. That line indicates that the patch to ODI 12c had not yet been completed. Easy enough to resolve, right? Following the patch guidelines that Stewart provided, we installed patch 17053768, reopened ODI Studio, and …

Components - No PIVOT!

Still no PIVOT component!

After verifying that indeed the correct patch was applied, we dug deeper. A query against the ODI repository metadata showed that the PIVOT/UNPIVOT components did indeed exist, but for some reason, they were not being displayed in ODI Studio.

Query component metadata

After a call for help to several ODI experts, I received the same response from both Jérôme Françoisse and David Allan (sorry David, Jérôme beat you to it!). While ODI Studio is closed, delete the following folder:

C:\Users\<USERNAME>\AppData\Roaming\odi\system12.1.2.0.0

Once ODI Studio is reopened, this folder will recreate itself and the components will magically appear.

Components - with PIVOT

It seems this has something to do with caching of the ODI Studio components, specifically for OSGi - according to David Allan, in an attempt to speed things up for the developer. Hopefully, subsequent patches will remove the folder for us so we don’t have to think so hard next time!

Big thanks to Jérôme and David for their help!

Edit: As Jeff pointed out in the comments (and my colleague Pete Scott did after reading this post!), the ReadMe.txt file for the patch mentions the need to delete the  folder. Thanks Jeff and Pete! 

post-install

Categories: BI & Warehousing

Running R on Hadoop using Oracle R Advanced Analytics for Hadoop

Rittman Mead Consulting - Tue, 2014-03-11 18:16

When most people think about analytics and Hadoop, they tend to think of technologies such as Hive, Pig and Impala as the main tools a data analyst uses. When you talk to data analysts and data scientists though, they’ll usually tell you that their primary tool when working on Hadoop and big data sources is in fact “R”, the open-source statistical and modelling language “inspired” by SAS but now with its own rich ecosystem, and particularly suited to the data preparation, data analysis and data correlation tasks you’ll often do on a big data project.

You can see examples where R has been used in recent copies of Oracle’s OBIEE SampleApp, where R is used to predict flight delays, with the results then rendered through Oracle R Enterprise, part of the Oracle Database Enterprise Edition Advanced Analytics Option and which allows you to run R scripts in the database and output the results in the form of database functions.

NewImage

Oracle actually distribute two R packages as part of a wider set of Oracle R technology products – Oracle R, their own distribution of open-source R that you can download via the Oracle Public Yum repository, and Oracle R Enterprise (ORE), an add-in to the database that provides efficient connectivity between R and Oracle and also allows you to run R scripts directly within the database’s JVM. ORE is actually surprisingly pretty good, with its main benefit being that you can perform R analysis directly against data in the database, avoiding the need to dump data to a file and giving you the scalability of the Oracle Database to run your R models, rather than begin constrained by the amount of RAM in your laptop. In the screenshot below, you can see part of an Oracle R Enterprise script we’ve written that analyses data from the flight delays dataset:

NewImage

with the results then output in the R console:

NewImage

But in many cases though, clients won’t want you to run R analysis on their main databases due to the load it’ll put on them, so what do you do when you need to analyse large datasets? A common option is to run your R queries on Hadoop, giving you the flexibility and power of the R language whilst taking advantage of the horizontal scalability of Hadoop, HDFS and MapReduce. There’s quite a few options for doing this – the open-source RHIPE and the R package “parallel” both provide R-on-Hadoop capabilities – but Oracle also have a product in this area, “Oracle R Advanced Analytics for Hadoop” (ORAAH) previously known as “Oracle R Connector for Hadoop” that according to the docs is particularly well-designed for parallel reads and writes, has resource management and database connectivity features, and comes as part of Oracle Big Data Appliance, Oracle Big Data Connectors and the recently released BigDataLite VM. The payoff here then is that by using ORAAH you can scale-up R to work at Hadoop-scale, giving you an alternative to the more set-based Hive and Pig languages when working with super-large datasets.

NewImage

Oracle R Advanced Analytics for Hadoop is already set-up and configured on the BigDataLite VM, and you can try it out by opening a command-line session and running the “R” executable:

NewImage

Type in library(ORCH) to load the ORAAH R libraries (ORCH being the old name of the product), and you should see something like this:

NewImage

At that point you can now run some demo programs to see examples of what ORAAH can do; for example read and write data to HDFS using the demo(“hdfs_cpmv.basic”), or use Hive as a source for R data frames (“hive_analysis”,”ORCH”), like this:

NewImage

The most important feature is being able to run R jobs on Hadoop though, with ORAAH converting the R scripts you write into Hadoop jobs via the Hadoop Streaming utility, that gives MapReduce the ability to use any executable or script as the mapper or reducer. The “mapred_basic” R demo that ships with ORAAH shows a basic example of this working, and you can see the MapReduce jobs being kicked-off in the R console, and in the Hadoop Jobtracker web UI:

NewImage

But what if you want to use ORAAH on your own Hadoop cluster? Let’s walk through a typical setup process using a three-node Cloudera CDH 4.5 cluster running on a VMWare ESXi server I’ve got at home, where the three nodes are configured like this:

  • cdh45d-node1.rittmandev.com : 8GB RAM VM running OEL6.4 64-bit and with a full desktop available
  • cdh45d-node2.rittmandev.com : 2GB RAM VM running Centos 6.2 64-bit, and with a minimal Linux install with no desktop etc.
  • cdh45d-node3.rittmandev.com, as per cdh45d-node2

The idea here is that “cdh45d-node1” is where the data analyst can run their R session, or of course they can just SSH into it. I’ve also got Cloudera Manager installed on there using the free “standard” edition of CDH4.5, and with the cluster spanning the other two nodes like this:

NewImage

If you’re going to use R as a regular OS user (for example, “oracle”) then you’ll also need to use Hue to create a home directory for that user in HDFS, as ORAAH will automatically set that directory as the users’ working directory when they load up the ORAAH libraries.

Step 1: Configuring the Hadoop Node running R Client Software

On the first node where I’m using OEL6.4, Oracle R 2.15.3 is already installed by Oracle, but we want R 3.0.1 for ORAAH 2.3.1, so you can install it using Oracle’s Public Yum repository like this:

sudo yum install R-3.0.1

Once you’ve done that, you need to download and install the R packages in the ORAAH 2.3.1 zip file that you can get hold of from the Big Data Connectors download page on OTN. Assuming you’ve downloaded and unzipped them to /root/ORCH2.3.1, running as root install the R packages within the zip file in the correct order, like this:

cd /root/ORCH2.3.1/
R --vanilla CMD INSTALL OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCHcore_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCHstats_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCH_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz

You’ll also need to download and install the “png” source package from the R CRAN website (http://cran.r-project.org/web/packages/png/index.html), and possibly download and install the libpng-devel libraries from the Oracle Public Yum site before it’ll compile.

sudo yum install libpng-devel
R --vanilla CMD INSTALL png_0.1-7.tar.gz 

At that point, you should be good to go. If you’ve installed CDH4.5 using parcels rather than packages though, you’ll need to set a couple of environment variables in your .bash_profile script to tell ORAAH where to find the Hadoop scripts (parcels install Hadoop to /opt/cloudera, whereas packages install to /usr/lib, which is where ORAAH looks for them normally):

# .bash_profile
 
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi
 
# User specific environment and startup programs
 
PATH=$PATH:$HOME/bin
 
export PATH
export ORCH_HADOOP_HOME=/opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30
export HADOOP_STREAMING_JAR=/opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/hadoop-0.20-mapreduce/contrib/streaming

You should then be able to log into the R console from this particular machine and load up the ORCH libraries:

[oracle@cdh45d-node1 ~]$ . ./.bash_profile
[oracle@cdh45d-node1 ~]$ R
 
Oracle Distribution of R version 3.0.1  (--) -- "Good Sport"
Copyright (C)  The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
> library(ORCH)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following objects are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: ORCHcore
Oracle R Connector for Hadoop 2.3.1 (rev. 288)
Info: using native C base64 encoding implementation
Info: Hadoop distribution is Cloudera's CDH v4.5.0
Info: using ORCH HAL v4.1
Info: HDFS workdir is set to "/user/oracle"
Info: mapReduce is functional
Info: HDFS is functional
Info: Hadoop 2.0.0-cdh4.5.0 is up
Info: Sqoop 1.4.3-cdh4.5.0 is up
Warning: OLH is not found
Loading required package: ORCHstats
> 

Step 2: Configuring the other Nodes in the Hadoop Cluster

Once you’ve set up your main Hadoop node with the R Client software, you’ll also need to install  R 3.0.1 onto all of the other nodes in your Hadoop cluster, along with a subset of the ORCH library files. If your other nodes are also running OEL you can just repeat the “yum install R-3.0.1” step from before, but in my case I’m running a very stripped-down Centos 6.2 install on my other nodes so I need to install wget first, then grab the Oracle Public Yum repo file along with a GPG key file that it’ll require before allowing you download from that server:

[root@cdh45d-node2 yum.repos.d]# wget http://public-yum.oracle.com/public-yum-ol6.repo
 
--2014-03-10 07:50:25--  http://public-yum.oracle.com/public-yum-ol6.repo
Resolving public-yum.oracle.com... 109.144.113.166, 109.144.113.190
Connecting to public-yum.oracle.com|109.144.113.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4233 (4.1K) 1
Saving to: <code>public-yum-ol6.repo'
 
100%[=============================================&gt;] 4,233       --.-K/s   in 0s      
 
2014-03-10 07:50:26 (173 MB/s) - </code>public-yum-ol6.repo' saved [4233/4233]
 
[root@cdh45d-node2 yum.repos.d]# wget http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6 -O /etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
 
--2014-03-10 07:50:26--  http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6
Resolving public-yum.oracle.com... 109.144.113.190, 109.144.113.166
Connecting to public-yum.oracle.com|109.144.113.190|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1011 1
Saving to: <code>/etc/pki/rpm-gpg/RPM-GPG-KEY-oracle'
 
100%[=============================================&gt;] 1,011       --.-K/s   in 0s      
 
2014-03-10 07:50:26 (2.15 MB/s) - </code>/etc/pki/rpm-gpg/RPM-GPG-KEY-oracle' saved [1011/1011]

You’ll then need to edit the public-yum-ol6.repo file to enable the correct repository (RHEL/Centos/OEL 6.4 in my case) for your VMs, and also enable the add-ons repository, with the file contents below being the repo file after I made the required changes.

[ol6_latest]
name=Oracle Linux $releasever Latest ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enableid=0
 
[ol6_addons]
name=Oracle Linux $releasever Add ons ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/addons/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1
 
[ol6_ga_base]
name=Oracle Linux $releasever GA installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/0/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u1_base]
name=Oracle Linux $releasever Update 1 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/1/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u2_base]
name=Oracle Linux $releasever Update 2 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/2/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u3_base]
name=Oracle Linux $releasever Update 3 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/3/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u4_base]
name=Oracle Linux $releasever Update 4 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/4/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

Once you’ve saved the file, you can the run the install of R-3.0.1 as before, copy the ORCH files across to the server and then install just the OREbase, OREstats, OREmodels and OREserver packages, like this:

[root@cdh45d-node2 yum.repos.d]# yum install R-3.0.1
ol6_UEK_latest                                                  | 1.2 kB     00:00     
ol6_UEK_latest/primary                                          |  13 MB     00:03     
ol6_UEK_latest                                                                 281/281
ol6_addons                                                      | 1.2 kB     00:00     
ol6_addons/primary                                              |  42 kB     00:00     
ol6_addons                                                                     169/169
ol6_latest                                                      | 1.4 kB     00:00     
ol6_latest/primary                                              |  36 MB     00:10     
ol6_latest                                                                 24906/24906
ol6_u4_base                                                     | 1.4 kB     00:00     
ol6_u4_base/primary                                             | 2.7 MB     00:00     
ol6_u4_base                                                                  8396/8396
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package R.x86_64 0:3.0.1-2.el6 will be installed
 
[root@cdh45d-node2 ~]# cd ORCH2.3.1\ 2/
[root@cdh45d-node2 ORCH2.3.1 2]# ls
ORCH_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
ORCHcore_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
ORCHstats_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREbase’ ...
* DONE (OREbase)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREstats’ ...
* DONE (OREstats)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREmodels’ ...
* DONE (OREmodels)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREserver’ ...
* DONE (OREserver)
Making 'packages.html' ... done

Now, you can open a terminal window on the main node with the R client software, or SSH into the server, and run one of the ORAAH demos where the R job gets run on the Hadoop cluster.

> demo("mapred_basic","ORCH")
 
 
demo(mapred_basic)
---- ~~~~~~~~~~~~
 
> #
> #     ORACLE R CONNECTOR FOR HADOOP DEMOS
> #
> #     Name: mapred_basic
> #     Description: Demonstrates running a mapper and a reducer containing 
> #                  R script in ORCH.
> #
> #
> #
> 
> 
> ##
> # A simple example of how to operate with key-values. Input dataset - cars.
> # Filter cars with with "dist" > 30 in mapper and get mean "dist" for each 
> # "speed" in reducer.
> ##
> 
> ## Set page width
> options(width = 80)
 
> # Put the cars dataset into HDFS
> cars.dfs <- hdfs.put(cars, key='speed')
 
> # Submit the hadoop job with mapper and reducer R scripts
> x <- try(hadoop.run(
+     cars.dfs,
+     mapper = function(key, val) {
+             orch.keyvals(key[val$dist > 30], val[val$dist > 30,])
+     },
+     reducer = function(key, vals) {
+         X <- sum(vals$dist)/nrow(vals)
+         orch.keyval(key, X)
+     },
+     config = new("mapred.config",
+         map.tasks = 1,
+         reduce.tasks = 1
+     )
+ ), silent = TRUE)
 
> # In case of errors, cleanup and return
> if (inherits(x,"try-error")) {
+  hdfs.rm(cars.dfs)
+  stop("execution error")
+ }
 
> # Print the results of the mapreduce job
> print(hdfs.get(x))
   val1     val2
1    10 34.00000
2    13 38.00000
3    14 58.66667
4    15 54.00000
5    16 36.00000
6    17 40.66667
7    18 64.50000
8    19 50.00000
9    20 50.40000
10   22 66.00000
11   23 54.00000
12   24 93.75000
13   25 85.00000
 
> # Remove the HDFS files created above
> hdfs.rm(cars.dfs)
[1] TRUE
 
> hdfs.rm(x)
[1] TRUE

with the job tracker web UI on Hadoop confirming the R script ran on the Hadoop cluster.

NewImage

So that’s a basic tech intro and tips on the install. Documentation for ORAAH and the rest of the Big Data Connectors is available for reading on OTN (including a list of all the R commands you get as part of the ORAAH/ORCH package). Keep an eye on the blog though for more on R, ORE and ORAAH as I try to share some examples from datasets we’ve worked on.

Categories: BI & Warehousing

Oracle Enterprise Data Quality Revisited

Rittman Mead Consulting - Mon, 2014-03-10 10:00

Recently I have be working with Oracle Enterprise Data Quality (EDQ) combined with Oracle Data Integrator (ODI), we will tell you more about exactly what we have been doing in another blog, but for now I would like to revisit our earlier EDQ blogs. I am going to keep to the batch ETL paradigm and not really look at the Real-Time aspects of Oracle EDQ.

My colleague, Koen Vantomme, blogged about EDQ back in August 2012, and Mark Rittman wrote about EDQ and ODI in December 2012. New versions of ODI (12c) and EDQ (11.1.7.3) have been released and this has introduced a few changes in the way we use the two products together. Before discussing that I will give a brief overview of what EDQ can do and three typical use cases we may wish to use in a data integration / data warehousing project.

The product name says it all. “Oracle Enterprise Data Quality” is Oracle’s Data Quality toolset for Enterprises. It provides a comprehensive set of functionality for data stewardship, data profiling and data correction (both realtime by use of web services and, more traditionally, in batch operations). Additional, added-cost, components can be added to provide functionality such as address verification, CRM address services and watch-list screening.

Out-of-the-box we get a web front end giving access to various applets.

Screen Shot 2014 03 08 at 10 56 12

As developers, we will probably use the Director applet most often. Director is a web-launch Oracle ADF applet and is the IDE for creating and managing EDQ processes and jobs. One or more processes can be chained together in an EDQ job and this job can be executed from the Director GUI, the command line or, as I will cover later, from inside an ODI package (and hence from an ODI scenario or loadplan).

In the data warehousing world success is usually measured by whether we give the data users what they want. This usually boils down to being ’timely’ (are the data recent enough for their purposes?) and ‘correct’. Timely is about ETL process and load design. Correct, however, is mainly about data quality (although ETL logic flaws also play a part). Typically we see three kinds of issue:

  • Incomplete: attributes or values that are missing. For example an order in process will not have a shipping date until the order is shipped; and in the case of a website data feed non-mandatory fields on a web form may be left blank.
  • Incorrect: I have seen a e-retailer with a disproportionate number of customers from Albania, where Albania just happened to be the first country listed on the web-form country selector on the customer registration page.
  • Inconsistent: the same data being presented in different ways so that we don’t always spot multiple records referring to the same entity. This can be simply an inconsistent use of letter case in a description or it can be more complex such as the use of synonyms and variant spellings; for example my father was William Scott, but he could appear as ‘Will Scott’, ‘Bill Scott’, ‘Wm Scott’, ‘W Scott’ or ‘W G Scott’ . However we  can’t just blindly convert all cases of Billy to William as Billy could be a legitimate data value; we need to build more complex data rules about this. Sometimes inconsistent data is more physical in nature, for example one of our data sources may use a different character set and we see unexpected character conversions in our data.

Fortunately, EDQ gives us some great tools that we as ETL developers can use to investigate and handle these kinds of issue. Just dropping a data source onto the EDQ Director canvas prompts us if we want to auto profile the data; click ‘yes’ and we automatically generate a process that we can use to inspect our data for a host of anomalies. We click on the green arrow above the canvas to execute the profiling and then click on any of the pre-built Profiler processor icons to see the results. A nice touch is the ability to drill down on the blue coloured results to see the underlying data.
Screen Shot 2014 03 10 at 10 29 49
Having profiled our data we are in a position to decide whether our ETL processes need to be written to handle anticipated data anomalies (of course we should always code defensively) or that we go back to the data source owners for improved data feeds. If we adopt the “handle anomalies” route we can use the EDQ tool set to help with this. Notice I say handle and not “fix” I am a strong believer that a data warehouse is the single source of truth and not the “place where truth is created” If we receive data that needs fixing we should separate it from the data suitable for loading and ideally return it to the data owner for fixing. We may do some automatic fix ups but only if the rules can be firmly agreed with the business and that we always maintain the original source data along side so we can go back if we find our ‘rule’ was incorrect.

In the EDQ sample data (available on OTN for download) we find a customer CSV file where amongst other things the LastName column is highly inconsistent; some names in lowercase, some in uppercase, some in proper case (first character of words is uppercase). We can build out a simple process that does a proper case conversion on all of the data. We can be even smarter, and get it to ignore embedded uppercase so we don’t mess up names like McDonnell and DiCampo. In the real world we would possibly add an exceptions list to handle prefix words such as van, de and von which are traditionally in lower case. In my data correction task I have added another processor to add the current date to outward-bound data so we have a trail of when data was processed. The data written out also has both the original last name and the amended version.

Screen Shot 2014 03 10 at 11 23 58

Sometimes (I’d say “mostly!”) fixing data in the ETL is the wrong thing to do. In cases like these we can build a process using the Data Match processors and based on defined data criteria elect to process data, reject data or flag for further investigation. We can use these match operators as a trigger for instigating Oracle EDQ case management workflows or just use the inbuilt Director functionality to split good and rejected data and raise an email alert using the email processor. Using my customer data as an example we have several customers that are likely to be duplicates, with either variant versions of company name or address. We can build match rules and assign weights to the rule components so that certain conditions trigger matching, non-matching or review. Oracle EDQ supports some very sophisticated match algorithms, such as “within a given number of character substitutions” (aka typo matching) and “matching a percentage of words in a string”.

Screen Shot 2014 03 10 at 12 45 20

We can review our results from the data match tool, in my case I had one record that my rules considered to be an exact match and 52 records that were flagged for human review a likely matches. Note we can highlight the discrepancies

Screen Shot 2014 03 10 at 12 48 26

As I mentioned above we can call these EDQ jobs from the command line or as part of ODI packages. The simplest way is probably using the EDQ tool that has been available since ODI 11.1.1.6.

Screen Shot 2014 03 10 at 13 52 20

However, calling the EDQ command line interface from ODI is also feasible for earlier ODI versions or for more complex use cases that may not be supported by the tool (for example the use of execution parameters although this restriction can be worked around in other ways). To use the ODI tool we need to know some information about the EDQ instance, particularly the host, EDQ JMX server port (this is not the same as the browser url port) the user name and password, EDQ project and job names, and the EDQ domain. There are a few gotchas that are not clear from the documentation, but setting thus up is relatively easy. The biggest of these gotchas is that the EDQ domain name (that is the base mbean to which you connect) has changed in the current EDQ release from the default name provided by ODI, it is now edq. We can verify this base mbean name by using Java Console on the server and drilling into the EDQ server process. The JMX server port number has also changed (8090 is now the default on the EDQ server). We need to explicitly set these values on the ODI EDQ tool General Tab (the ODI defaults no longer work). There are two further restrictions we need be aware of, but these are probably unchanged from earlier releases; the execution user for EDQ  must be a full EDQ Administration user and also that user must not be currently logged into EDQ on another connection. For my testing I created a Weblogic EDQ-USER user and an EDQ_ODI_GROUP group in the WebLogic Security Realm and mapped the WebLogic group to the EDQ Administrators role on the edq Administration web page. That is I have pushed user administration down to the WebLogic server.

Admin

Installing EDQ

Oracle EDQ installs on to an application server such as Apache Tomcat, IBM WebSphere and of course Oracle WebLogic. For this blog post I have downloaded the latest available versions of EDQ and ODI. It should be noted that WebLogic 12 is not supported by the current Oracle EDQ release. For my Application Server I am using a basic WebLogic 10.3.6 install. Download and run the EDQ Repository creation assistant to create required meta data and schemas on the repository database and then run the EDQ provided Oracle Universal Installer. When the base install is complete run the WebLogic configuration utility and extend the domain to include Oracle EDQ. Startup the services from the WebLogic console. If the EDQ url does not bring up the home page it may be that you need to make a fix to the node manager configuration (see MOS note 1617687.1) If you are using the latest Java 1.7 you will also probably need to change some Java security setting to allow the Java Web Start applets to launch. Do to this connect to the Java Control Panel and add the EDQ server and port to the exception site list.

In a follow-up post we will take a look at some of the Director functionality in detail.

Categories: BI & Warehousing