Skip navigation.

BI & Warehousing

Oracle Enterprise Data Quality Revisited

Rittman Mead Consulting - Mon, 2014-03-10 10:00

Recently I have be working with Oracle Enterprise Data Quality (EDQ) combined with Oracle Data Integrator (ODI), we will tell you more about exactly what we have been doing in another blog, but for now I would like to revisit our earlier EDQ blogs. I am going to keep to the batch ETL paradigm and not really look at the Real-Time aspects of Oracle EDQ.

My colleague, Koen Vantomme, blogged about EDQ back in August 2012, and Mark Rittman wrote about EDQ and ODI in December 2012. New versions of ODI (12c) and EDQ (11.1.7.3) have been released and this has introduced a few changes in the way we use the two products together. Before discussing that I will give a brief overview of what EDQ can do and three typical use cases we may wish to use in a data integration / data warehousing project.

The product name says it all. “Oracle Enterprise Data Quality” is Oracle’s Data Quality toolset for Enterprises. It provides a comprehensive set of functionality for data stewardship, data profiling and data correction (both realtime by use of web services and, more traditionally, in batch operations). Additional, added-cost, components can be added to provide functionality such as address verification, CRM address services and watch-list screening.

Out-of-the-box we get a web front end giving access to various applets.

Screen Shot 2014 03 08 at 10 56 12

As developers, we will probably use the Director applet most often. Director is a web-launch Oracle ADF applet and is the IDE for creating and managing EDQ processes and jobs. One or more processes can be chained together in an EDQ job and this job can be executed from the Director GUI, the command line or, as I will cover later, from inside an ODI package (and hence from an ODI scenario or loadplan).

In the data warehousing world success is usually measured by whether we give the data users what they want. This usually boils down to being ’timely’ (are the data recent enough for their purposes?) and ‘correct’. Timely is about ETL process and load design. Correct, however, is mainly about data quality (although ETL logic flaws also play a part). Typically we see three kinds of issue:

  • Incomplete: attributes or values that are missing. For example an order in process will not have a shipping date until the order is shipped; and in the case of a website data feed non-mandatory fields on a web form may be left blank.
  • Incorrect: I have seen a e-retailer with a disproportionate number of customers from Albania, where Albania just happened to be the first country listed on the web-form country selector on the customer registration page.
  • Inconsistent: the same data being presented in different ways so that we don’t always spot multiple records referring to the same entity. This can be simply an inconsistent use of letter case in a description or it can be more complex such as the use of synonyms and variant spellings; for example my father was William Scott, but he could appear as ‘Will Scott’, ‘Bill Scott’, ‘Wm Scott’, ‘W Scott’ or ‘W G Scott’ . However we  can’t just blindly convert all cases of Billy to William as Billy could be a legitimate data value; we need to build more complex data rules about this. Sometimes inconsistent data is more physical in nature, for example one of our data sources may use a different character set and we see unexpected character conversions in our data.

Fortunately, EDQ gives us some great tools that we as ETL developers can use to investigate and handle these kinds of issue. Just dropping a data source onto the EDQ Director canvas prompts us if we want to auto profile the data; click ‘yes’ and we automatically generate a process that we can use to inspect our data for a host of anomalies. We click on the green arrow above the canvas to execute the profiling and then click on any of the pre-built Profiler processor icons to see the results. A nice touch is the ability to drill down on the blue coloured results to see the underlying data.
Screen Shot 2014 03 10 at 10 29 49
Having profiled our data we are in a position to decide whether our ETL processes need to be written to handle anticipated data anomalies (of course we should always code defensively) or that we go back to the data source owners for improved data feeds. If we adopt the “handle anomalies” route we can use the EDQ tool set to help with this. Notice I say handle and not “fix” I am a strong believer that a data warehouse is the single source of truth and not the “place where truth is created” If we receive data that needs fixing we should separate it from the data suitable for loading and ideally return it to the data owner for fixing. We may do some automatic fix ups but only if the rules can be firmly agreed with the business and that we always maintain the original source data along side so we can go back if we find our ‘rule’ was incorrect.

In the EDQ sample data (available on OTN for download) we find a customer CSV file where amongst other things the LastName column is highly inconsistent; some names in lowercase, some in uppercase, some in proper case (first character of words is uppercase). We can build out a simple process that does a proper case conversion on all of the data. We can be even smarter, and get it to ignore embedded uppercase so we don’t mess up names like McDonnell and DiCampo. In the real world we would possibly add an exceptions list to handle prefix words such as van, de and von which are traditionally in lower case. In my data correction task I have added another processor to add the current date to outward-bound data so we have a trail of when data was processed. The data written out also has both the original last name and the amended version.

Screen Shot 2014 03 10 at 11 23 58

Sometimes (I’d say “mostly!”) fixing data in the ETL is the wrong thing to do. In cases like these we can build a process using the Data Match processors and based on defined data criteria elect to process data, reject data or flag for further investigation. We can use these match operators as a trigger for instigating Oracle EDQ case management workflows or just use the inbuilt Director functionality to split good and rejected data and raise an email alert using the email processor. Using my customer data as an example we have several customers that are likely to be duplicates, with either variant versions of company name or address. We can build match rules and assign weights to the rule components so that certain conditions trigger matching, non-matching or review. Oracle EDQ supports some very sophisticated match algorithms, such as “within a given number of character substitutions” (aka typo matching) and “matching a percentage of words in a string”.

Screen Shot 2014 03 10 at 12 45 20

We can review our results from the data match tool, in my case I had one record that my rules considered to be an exact match and 52 records that were flagged for human review a likely matches. Note we can highlight the discrepancies

Screen Shot 2014 03 10 at 12 48 26

As I mentioned above we can call these EDQ jobs from the command line or as part of ODI packages. The simplest way is probably using the EDQ tool that has been available since ODI 11.1.1.6.

Screen Shot 2014 03 10 at 13 52 20

However, calling the EDQ command line interface from ODI is also feasible for earlier ODI versions or for more complex use cases that may not be supported by the tool (for example the use of execution parameters although this restriction can be worked around in other ways). To use the ODI tool we need to know some information about the EDQ instance, particularly the host, EDQ JMX server port (this is not the same as the browser url port) the user name and password, EDQ project and job names, and the EDQ domain. There are a few gotchas that are not clear from the documentation, but setting thus up is relatively easy. The biggest of these gotchas is that the EDQ domain name (that is the base mbean to which you connect) has changed in the current EDQ release from the default name provided by ODI, it is now edq. We can verify this base mbean name by using Java Console on the server and drilling into the EDQ server process. The JMX server port number has also changed (8090 is now the default on the EDQ server). We need to explicitly set these values on the ODI EDQ tool General Tab (the ODI defaults no longer work). There are two further restrictions we need be aware of, but these are probably unchanged from earlier releases; the execution user for EDQ  must be a full EDQ Administration user and also that user must not be currently logged into EDQ on another connection. For my testing I created a Weblogic EDQ-USER user and an EDQ_ODI_GROUP group in the WebLogic Security Realm and mapped the WebLogic group to the EDQ Administrators role on the edq Administration web page. That is I have pushed user administration down to the WebLogic server.

Admin

Installing EDQ

Oracle EDQ installs on to an application server such as Apache Tomcat, IBM WebSphere and of course Oracle WebLogic. For this blog post I have downloaded the latest available versions of EDQ and ODI. It should be noted that WebLogic 12 is not supported by the current Oracle EDQ release. For my Application Server I am using a basic WebLogic 10.3.6 install. Download and run the EDQ Repository creation assistant to create required meta data and schemas on the repository database and then run the EDQ provided Oracle Universal Installer. When the base install is complete run the WebLogic configuration utility and extend the domain to include Oracle EDQ. Startup the services from the WebLogic console. If the EDQ url does not bring up the home page it may be that you need to make a fix to the node manager configuration (see MOS note 1617687.1) If you are using the latest Java 1.7 you will also probably need to change some Java security setting to allow the Java Web Start applets to launch. Do to this connect to the Java Control Panel and add the EDQ server and port to the exception site list.

In a follow-up post we will take a look at some of the Director functionality in detail.

Categories: BI & Warehousing

Internal Links

Tim Dexter - Wed, 2014-03-05 20:06

Another great question today, this time, from friend and colleague, Jerry the master house re-fitter. I think we are competing on who can completely rip and replace their entire house in the shortest time on their own. Every conversation we have starts with 'so what are you working on?' He's in the midst of a kitchen re-fit, Im finishing off odds and ends before I re-build our stair well and start work on my hidden man cave under said stairs. Anyhoo, his question!

Can you create a PDF document that shows a summary on the first page and provides links to more detailed sections further down in the document?

Why yes you can Jerry. Something like this? Click on the department names in the first table and the return to top links in the detail sections. Pretty neat huh? Dynamic internal links based on the data, in this case the department names.

Its not that hard to do either. Here's the template, RTF only right now.


The important fields in this case are the ones in red, heres their contents.

TopLink

<fo:block id="doctop" />

Just think of it as an anchor to the top of the page called doctop

Back to Top

<fo:basic-link internal-destination="doctop" text-decoration="underline">Back to Top</fo:basic-link>

Just a live link 'Back to Top' if you will, that takes the user to the doc top location i.e. to the top of the page.

DeptLink

<fo:block id="{DEPARTMENT_NAME}"/>

Just like the TopLink above, this just creates an anchor in the document. The neat thing here is that we dynamically name it the actual value of the DEPARTMENT_NAME. Note that this link is inside the for-each:G_DEPT loop so the {DEPARTMENT_NAME} is evaluated each time the loop iterates. The curly braces force the engine to fetch the DEPARTMENT_NAME value before creating the anchor.

DEPARTMENT_NAME

<fo:basic-link  internal-destination="{DEPARTMENT_NAME}" ><?DEPARTMENT_NAME?></fo:basic-link>

This is the link for the user to be able to navigate to the detail for that department. It does not use a regular MSWord URL, we have to create a field in the template to hold the department name value and apply the link. Note, no text decoration this time i.e. no underline.

You can add a dynamic link on to anything in the summary section. You just need to remember to keep link 'names' as unique as needed for source and destination. You can combine multiple data values into the link name using the concat function.

Template and data available here. Tested with 10 and 11g, will work with all BIP flavors.

Categories: BI & Warehousing

The OLAP Extension is now available in SQL Developer 4.0

Keith Laker - Tue, 2014-03-04 14:57


The OLAP Extension is now in SQL Developer 4.0.

See http://www.oracle.com/technetwork/developer-tools/sql-developer/downloads/sqldev-releasenotes-v4-1925251.html for the details.

The OLAP functionality is mentioned toward the bottom of the web page.You will still need AWM 12.1.0.1.0 to
  • Manage and enable cube and dimension MV's.
  • Manage data security.
  • Create and edit nested measure folders (i.e. measure folders that are children of other measure folders)
  • Create and edit Maintenance Scripts
  • Manage multilingual support for OLAP Metadata objects
  • Use the OBIEE plugin or the Data Validation plugin
What is new or improved:
  • New Calculation Expression editor for calculated measures.  This allows the user to nest different types to calculated measures easily.  For instance a user can now create a Moving Total of a Prior Period as one calculated measure.  In AWM, it would have required a user to create a Prior Period first and then create a Moving Total calculated measure which referred to the Prior Period measure.  Also the new Calculation Expression editor displays hypertext helper templates when the user selects the OLAP API syntax in the editor.
  • Support for OLAP DML command execution in the SQL Worksheet.  Simply prefix OLAP DML commands by a '~' and then select the execute button to execute them on the SQL Worksheet.  The output of the command will appear in the DBMS Output Window if it is opened, or the Script Output Window if the user has executed 'set serveroutput on' before executing the DML command.
  • Improved OLAP DML Program Editor integrated within the SQL Developer framework.
  • New diagnostic reports in the SQL Developer Report navigator.
  • Ability to create a fact view with a measure dimension (i.e. "pivot cube").  This functionality is accessible from the SQL Developer Tools-OLAP menu option.
  • Cube scripts have been renamed to Build Specifications and are now accessible within the Create/Edit Cube dialog.  The Build Specifications editor there, is similar to the calculation expression editor as far as functionality.
Categories: BI & Warehousing

Congratulations to Stewart Bryson, and Welcome to Edel Kammermann

Rittman Mead Consulting - Tue, 2014-03-04 13:45

Stewart’s too modest to mention it on the blog himself, but I just wanted to congratulate Stewart Bryson on being awarded Oracle ACE Director status by the Oracle OTN ACE program. Stewart was given the Oracle ACE award a few years ago to recognise past work he’d done for the Oracle BI, DW and ETL community, but this higher ACE Director award recognises the ongoing work he’s since been doing to share his knowledge and experience with the Oracle community. For a taste of what he’s been doing recently, check out this recent article on OTN on migrating OWB to ODI, where he goes through ODI12c installation, and then walks through several scenarios for interoperating with, and migrating to, Oracle Data Integrator 12c.

I also wanted to take this opportunity to share the exciting news that Edelweiss Kammermann, who many of you will know from her work with the Latin American Oracle User Group community, joined us back at the start of this year to work with Robin Moffatt in our training development area.

We’ve known Edel for many years, and she came over to Brighton last year to present on BI and SOA/BPM at our 2013 BI Forum. Edel is initially working with Robin on developing our new ODI12c course, followed later on in 2014 with our new BI Apps 11g course, and she’ll be delivering training in Europe, USA and Latin America later on in the year. We’re incredibly pleased and proud that Edel has joined us, and being an Oracle ACE herself, we’ve now got two Oracle ACE Directors, and two Oracle ACEs, in the Rittman Mead team.

I’m sure you’ll all join me in congratulating Stewart, and welcoming Edel to the Rittman Mead team!

Categories: BI & Warehousing

Rittman Mead BI Forum 2014 Now Open for Registration!

Rittman Mead Consulting - Sun, 2014-03-02 10:28

I’m very pleased to announce that the Rittman Mead BI Forum 2014 running in Brighton and Atlanta, May 2014, is now open for registration. Keeping the format as before – a single stream at each event, world-class speakers and expert-level presentations, and a strictly-limited number of attendees – this is the premier Oracle BI tech conference for developers looking for something beyond marketing and beginner-level content.

This year we have a fantastic line-up of speakers and sessions, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Tony Heljula, winner of multiple best speaker awards and this year presenting on Exalytics and TimesTen Columnar Storage
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Edward Roske, author of multiple books on Essbase, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)
NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!

Categories: BI & Warehousing

Waterfall Charts

Tim Dexter - Fri, 2014-02-28 19:35

Great question came through the ether from Holger on waterfall charts last night.

"I know that Answers supports waterfall charts and BI Publisher does not.
Do you have a different solution approach for waterfall charts with BI Publisher (perhaps stacked bars with white areas)?
Maybe you have already implemented something similar in the past and you can send me an example."

I didnt have one to hand, but I do now. Little known fact, the Publisher chart engine is based on the Oracle Reports chart engine. Therefore, this document came straight to mind. Its awesome for chart tips and tricks. Will you have to get your hands dirty in the chart code? Yep. Will you get the chart you want with a little effort? Yep. Now, I know, I know, in this day and age, you should get waterfalls with no effort but then you'd be bored right?

First things first, for the uninitiated, what is a waterfall chart? From some kind person at Wikipedia, "The waterfall chart is normally used for understanding how an initial value is affected by a series of intermediate positive or negative values. Usually the initial and the final values are represented by whole columns, while the intermediate values are denoted by floating columns. The columns are color-coded for distinguishing between positive and negative values."

We'll get back to that last sentence later, for now lets get the basic chart working.

Checking out the Oracle Report charting doc, search for 'floating' their term for 'waterfall' and it will get you to the section on building a 'floating column chart' or in more modern parlance, a waterfall chart. If you have already got your feet wet in the dark arts world of Publisher chart XML, get on with it and get your waterfall working.

If not, read on.

When I first starting looking at this chart, I decided to ignore the 'negative values' in the definition above. Being a glass half full kind of guy I dont see negatives right :)

Without them its a pretty simple job of rendering a stacked bar chart with 4 series for the colors. One for the starting value, one for the ending value, one for the diffs (steps) and one for the base values. The base values color could be set to white but that obscures any tick lines in the chart. Better to use the transparency option from the Oracle Reports doc.

<Series id="0" borderTransparent="true" transparent="true"/> 

Pretty simple, even the data structure is reasonably easy to get working. But, the negative values was nagging at me and Holger, who I pointed at the Oracle Reports doc had come back and could not get negative values to show correctly. So I took another look. What a pain in the butt!

In the chart above (thats my first BIP waterfall maybe the first ever BIP waterfall.) I have lime green, start and finish bars; red for negative and green for positive values. Look a little closer at the hidden bar values where we transition from red to green, ah man, royal pain in the butt! Not because of anything tough in the chart definition, thats pretty straightforward. I just need the following columns START, BASE, DOWN, UP and FINISH. 

START 200
BASE 0
UP 0
DOWN 0
FINISH 0
START 0
BASE 180
UP 0
DOWN 20
FINISH 0
START 0
BASE 150
UP 0
DOWN 30
FINISH 0
 Bar 1 - Start Value
 Bar 2 - PROD1
 Bar 3 - PROD2

and so on. The start, up, down and finish values are reasonably easy to get. The real trick is calculating that hidden BASE value correctly for that transition from -ve >> + ve and vice versa. Hitting Google, I found the key to that calculation in a great page on building a waterfall chart in Excel from the folks at Contextures.  Excel is great at referencing previous cell values to create complex calculations and I guess I could have fudged this article and used an Excel sheet as my data source. I could even have used an Excel template against my database table to create the data for the chart and fed the resulting Excel output back into the report as the data source for the chart. But, I digress, that would be tres cool thou, gotta look at that.
On that page is the formula to get the hidden base bar values and I adapted that into some sql to get the same result.

Lets assume I have the following data in a table:

PRODUCT_NAME SALES PROD1 -20 PROD2 -30 PROD3 50 PROD4 60

The sales values are versus the same period last year i.e. a delta value.  I have a starting value of 200 total sales, lets assume this is pulled from another table.
I have spent the majority of my time on generating the data, the actual chart definition is pretty straight forward. Getting that BASE value has been most tricksy!

I need to generate the following for each column:

PRODUCT_NAME

STRT

BASE_VAL

DOWN

UP

END_TOTAL

START
200
0
0
0
0
PROD1
0
180
20
0
0
PROD2
0
150 30 0
0
PROD3
0 150 0 50 0 PROD4
0 200
0 60 0 END
0 0 0 0 260

Ignoring the START and END values for a second. Here's the query for the PRODx columns:

 SELECT 2 SORT_KEY 
, PRODUCT_NAME
, STRT
, SALES
, UP
, DOWN
, 0 END_TOTAL
, 200 + (SUM(LAG_UP - DOWN) OVER (ORDER BY PRODUCT_NAME)) AS BASE_VAL
FROM
(SELECT P.PRODUCT_NAME
,  0 AS STRT
, P.SALES
, CASE WHEN P.SALES > 0 THEN P.SALES ELSE 0 END AS UP  
, CASE WHEN P.SALES < 0 THEN ABS(P.SALES) ELSE 0 END AS DOWN
, LAG(CASE WHEN P.SALES > 0 THEN P.SALES ELSE 0 END,1,0) 
      OVER (ORDER BY P.PRODUCT_NAME) AS LAG_UP
FROM PRODUCTS P
)

The inner query is breaking the UP and DOWN values into their own columns based on the SALES value. The LAG function is the cool bit to fetch the UP value in the previous row. That column is the key to getting the BASE values correctly.

The outer query just has a calculation for the BASE_VAL.

200 + (SUM(LAG_UP - DOWN) OVER (ORDER BY PRODUCT_NAME))

The SUM..OVER allows me to iterate over the rows to get the calculation I need ie starting value (200) + the running sum of LAG_UP - DOWN. Remember the LAG_UP value is fetching the value from the previous row.
Is there a neater way to do this? Im most sure there is, I could probably eliminate the inner query with a little effort but for the purposes of this post, its quite handy to be able to break things down.

For the start and end values I used more queries and then just UNIONed the three together. Once note on that union; the sorting. For the chart to work, I need START, PRODx, FINISH, in that order. The easiest way to get that was to add a SORT_KEY value to each query and then sort by it. So my total query for the chart was:

SELECT 1 SORT_KEY
, 'START' PRODUCT_NAME
, 200 STRT
, 0 SALES
, 0 UP
, 0 DOWN
, 0 END_TOTAL
, 0 BASE_VAL
FROM PRODUCTS
UNION
SELECT 2 SORT_KEY 
, PRODUCT_NAME
, STRT
, SALES
, UP
, DOWN
, 0 END_TOTAL
, 200 + (SUM(LAG_UP - DOWN) 
      OVER (ORDER BY PRODUCT_NAME)) AS BASE_VAL
FROM
(SELECT P.PRODUCT_NAME
,  0 AS STRT
, P.SALES
, CASE WHEN P.SALES > 0 THEN P.SALES ELSE 0 END AS UP  
, CASE WHEN P.SALES < 0 THEN ABS(P.SALES) ELSE 0 END AS DOWN
, LAG(CASE WHEN P.SALES > 0 THEN P.SALES ELSE 0 END,1,0) 
       OVER (ORDER BY P.PRODUCT_NAME) AS LAG_UP
FROM PRODUCTS P
)
UNION
SELECT 3 SORT_KEY 
, 'END' PRODUCT_NAME
, 0 STRT
, 0 SALES
, 0 UP
, 0 DOWN
, SUM(SALES) + 200 END_TOTAL
, 0 BASE_VAL
FROM PRODUCTS
GROUP BY 1,2,3,4,6
ORDER BY 1 

A lot of effort for a dinky chart but now its done once, doing it again will be easier. Of course no one will want just a single chart in their report, there will be other data, tables, charts, etc. I think if I was doing this in anger I would just break out this query as a separate item in the data model ie a query just for the chart. It will make life much simpler.
Another option that I considered was to build a sub template in XSL to generate the XML tree to support the chart and assign that to a variable. Im sure it can be done with a little effort, I'll save it for another time.

On the last leg, we have the data; now to build the chart. This is actually the easy bit. Sadly I have found an issue in the online template builder that precludes using the chart builder in those templates. However, RTF templates to the rescue!

Insert a chart and in the dialog set up the data like this (click the image to see it full scale.)

Its just a vertical stacked bar with the BASE_VAL color set to white.You can still see the 'hidden' bars and they are over writing the tick lines but if you are happy with it, leave it as is. You can double click the chart and the dialog box can read it no problem. If however, you want those 'hidden' bars truly hidden then click on the Advanced tab of the chart dialog and replace:

<Series id="1" color="#FFFFFF" />

with

<Series id="1" borderTransparent="true" transparent="true" />

and the bars will become completely transparent. You can do the #D and gradient thang if you want and play with colors and themes. You'll then be done with your waterfall masterpiece!

Alot of work? Not really, more than out of the box for sure but hopefully, I have given you enough to decipher the data needs and how to do it at least with an Oracle db. If you need all my files, including table definition, sample XML, BIP DM, Report and templates, you can get them here.

Categories: BI & Warehousing

Wildcard Filtering continued

Tim Dexter - Mon, 2014-02-24 13:27

I wrote up a method for using wildcard filtering in your layouts a while back here. I spotted a followup question on that blog post last week and thought I would try and address it using another wildcard method. 

I want to use the bi publisher to look for several conditions using a wild card. For example if I was sql it would look like this:

if name in ('%Wst','%Grt')

How can I utilize bi publisher to look for the same conditions.

This, in XPATH speak is an OR condition and we can treat it as such. In the last article I used the 'starts-with' function, its a little limiting, there is a better one, 'contains'. This is a much more powerful function that allows you to look for any string within another string. Its case insensitive so you do not need to do upper or lowering of the string you are searching to get the desired results.
Here it is in action:

For the clerks filter I use :

<?for-each-group:G_1[contains(JOB_TITLE,'Clerk')];./JOB_TITLE?>

and to find all clerks and managers, I use:

<?for-each-group:G_1[contains(JOB_TITLE,'Clerk') or contains(JOB_TITLE,'Manager')];./JOB_TITLE?>

Note that Im using re-grouping here, you can use the same XPATH with a regular for-each. Also note the lower case 'or' in the second expression. You can also use an 'and' too.

This works in 10 and 11g flavors of BIP. Sample files available here.

Categories: BI & Warehousing

Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse

Rittman Mead Consulting - Thu, 2014-02-20 03:00

Recently, my colleague, Pete Carpenter, described a proof of concept we carried out using Amazon Redshift as the data warehouse storage layer in a system capturing data from Oracle E-Business Suite (EBS) using Attunity CloudBeam in conjunction with Oracle Data Integrator (ODI) for specialised ETL processing and Oracle Business Intelligence (OBI) as the reporting tool.

In this blog I will look at Amazon Redshift and how it compares with a more traditional DW approach using, as my example, Oracle. I am not going to talk performance in absolute terms as your mileage is going to vary.

What is Redshift?

Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay. Currently, there are 2 families of Redshift servers, the traditional hard-disk based, and the recently introduced SSD family, which has less storage but far more processing power and faster CPUs. For our trials we looked at the traditional disk based storage on a 2 node cluster to give us 4TB of disk spread across 4 CPU cores. Apart from single node configurations, Redshift systems consist of a leader node and two or more database nodes; the leader node is supplied free of charge (you only pay for the storage nodes) and is responsible for acting as the query parser, coordinating the results from the database nodes, and being a central network address for user access.

The Redshift product has its origins in ParAccel and that in turn Postgres and thus supports ANSI SQL and the ODBC and JDBC Postgres drivers. In basic terms it is a share-nothing parallel processing columnar store database that supports columnar compression.

At the cluster level all sorts of robustness features come in to play to handle routine hardware failures such as a node or disk; regular automatic backups occur and on-demand backups can be made to S3 storage for DR or replication to other AWS networks. It is possible to dynamically change the number and or type of Redshift nodes in use, in effect a new cluster is spun up and the data copied from the existing system to the new before dropping the old system. The original database remains open for query (but not update) during the scale-out (or scale-down) process. As Pete Carpenter described, creating a new Redshift instance is a simple matter of completing a few web forms and waiting for the cluster to come up. Once up you can connect to the database using the master credentials you specified at cluster creation and then create databases, users, and schemas as required.

Databases, users, schemas and security

Although it is possible to run a Redshift database using the master user and the default database, good practice suggests that we do a bit more than this. In some ways Redshift is a little like the Oracle 12c database in that we can create additional databases within the master database, much in the style of plugable databases; a major difference comes with the concept of a USER. In Oracle 12c a user belongs to a plugable database, in Redshift all users belong to the master (container) database and can see any of the contained databases (subject to grants.) Schemas are logical groupings for objects and need not be aligned to database user names. Standard object and role grants allow users to access specific databases, schemas, and tables or to have role-rights such as administrator. The final aspect of security is outside the database and is in effect a firewall rule to permit any nominated AWS user or specified IP addresses to speak to the database listener; by default the rule is no inbound access. The diagram below is a block representation of how databases, users, schemas and firewall interrelate. Note user names are descriptive and not valid names!

Screen Shot 2014 02 15 at 13 12 34

Database Design

A key point of difference between Amazon Redshift and Oracle is in how the data is stored or structured in the database. An understanding of this is vital in how to design a performant data warehouse. With Oracle we have shared storage (SAN or local disk) attached to a pool of processors (single machine or a cluster); however, Redshift uses a share-nothing architecture, that is the storage is tied to the individual processor cores of the nodes. As with Oracle, data is stored in blocks, however the Redshift  block size is much larger (1MB) than the usual Oracle block sizes; the real difference is how tables are stored in the database, Redshift stores each column separately and optionally allows one of many forms of data compression. Tables are also distributed across the node slices so that each CPU core has its own section of the table to process. In addition, data in the table can be sorted on a sort column which can lead to further performance benefits; I will discuss this in the section on tables.

Not all of the database features we come to expect in an Oracle data warehouse are available to us in Redshift. The Redshift Developer Guide has the full rundown on what is available, but for now here is a short list of common DW features that are not going to be available to us.

  • Tablespaces
  • Indexes
  • Partitions
  • Constraints
    • Check
    • Primary, Unique, Foreign Key (all usable by optimizer but not enforced)
  • Spatial (Locator) functionality
  • Sequences (although there is an AUTO NUMBER column type)
  • MERGE – we have to code as UPDATE and INSERT in two steps
  • In-database PL/SQL-like language
    • Triggers
    • User defined functions
    • Procedures
  • Timestamps (with timezone)
  • XML types
  • Pseudo columns
  • Various SQL functions (not a full list, but functions I often use in ETL processes)
    • Regular expressions
    • Regression functions
    • SUBSTR
    • TRANSLATE
    • ROW_NUMBER
    • TO_TIMESTAMP

In addition data types may not be exactly the same as those used in Oracle; for example DATE in Oracle has a resolution of 1 SECOND, DATE in Redshift has a resolution of 1 DAY.

Tables

The basic Oracle syntax to create a table works (as does CTAS, Create Table As Select), however there are additional items we can, and should, specify at table creation.

By default the data distribution style is EVEN, that is data is distributed between node-slices  in a round-robin fashion, for performance we may wish to specify a distribution key column to allow a particular column to control how data is distributed; a similar concept to Oracle hash partitioning, and with the same sort of performance characteristics. We aim to create an even distribution of rows per slice (else one slice will take longer than the others to process its data) and by applying the same distribution to other tables that are commonly joined we can benefit from improved table joining performance as all of the rows are stored in the same node-slice. Sometimes it is more appropriate to replicate the whole table to each slice so that the data is always available to join without the need to move data to the same slice before joining; In such cases we set the distribution style to be ALL.

The second thing we can set on a table is the SORTKEY this specifies one or more columns on the table by which the data is ordered on data load (it can be the same column as the distribution key). Redshift maintains information on the minimum and maximum values of the sort key in each database block and at query time uses this information to skip blocks that do not contain data of interest.

Finally, we can elect to compress columns in the database. If we do not specify compression, the default is RAW (i.e. uncompressed) is used. For compressed data we can specify the compression algorithm used, different algorithms are better for certain data types and values. Compression may be data block based (DELTA, BYTE-DICTIONARY, RUN LENGTH, TEXT255 and TEXT32K) or value base (LZO and the MOSTLY compressions). This sounds daunting but there are two ways we can get compression suggestions from the database: using the ANALYZE COMPRESSION command on a loaded table and the AUTO COMPRESS feature of the COPY command, this however requires an empty non-compressed target table; copy is the Redshift equivalent of SQL/Loader and takes a flat file and inserts it into the database.

Let’s consider a simple table T1 with three columns, C1, C2 and C3. We can create this using a simple piece of DDL:

CREATE TABLE T1
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL,
C3 DATE
);

I have not used any of the Redshift nice-to-have features for sorting, distribution, and compression of data. Note too, that I am using NOT NULL constraints, this is the only constraint type enforced in the database. This simple create statement creates database objects on each slice of the cluster, with one block per column per slice (1 slice = 1 CPU core) see the following diagram, note there is no table object stored in the database, it is a collection of columns.

Table slice

Without specifying a distribution key data is evenly spread across all slices. When a 1MB block for a column is full a new block is created for subsequent inserts on the slice. An empty table will occupy block size * number of columns * number of cores and our block size is 1MB this would be columns * cores megabytes

Using a distribution key effectively hashes the data on the key column by the number of cores. Adding a sort key declares that the rows in the table are ordered and potentially allows block elimination to kick in. If our sort key is, say, transaction date, it is likely that our data loads occur in transaction date order, however if we sorted on product code we might find each data load has data that needs to be inserted between existing rows. This does not happen, the data is still appended to the table and the table now needs to be reorganised to put the rows in order. There are two ways to achieve this, the VACUUM command that does an on-line reorg of the table and the potentially faster route of creating a copy table, populating it and then dropping the original and renaming the copy, of course this gives a little downtime when the original table is not available for access.

Applying compression, sort and distribution we get a DDL statement like:

CREATE TABLE T2 
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL SORTKEY DISTKEY,
C3 DATE ENCODE DELTA
);

This table uses column C2 as both the sort key and the distribution key; column c3 is compressed using delta compression – this is an efficient compression algorithm where most dates are ±127 days of the date of the previous row. If we wanted to use a multi-column sort key the DDL syntax would be like:

CREATE TABLE T1 
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL DISTKEY,
C3 DATE 
)
SORTKEY (C3,C2);

Multi-column distribution keys are not supported.

Designing for Performance

Redshift is designed for query and bulk insert operations; we can optimise query performance by structuring data so that less data is transferred between nodes in a join operations or less data is read from disk in a table scan. Choosing the right data sortkeys and distkeys is vital in this process. Ideally these key columns should not be compressed. Adding primary and foreign keys to the tables tells the optimizer about the data relationships and thus improves the quality of query plan being generated. Of course up to date table stats are a given too; tables must be ANALYZEd when ever the contents changes significantly and certainly after initial load. I feel that we should collect stats after each data load.

For a FACT + DIMENSIONS data model (such as in the performance layer of Oracle’s Reference Data Warehouse Architecture) it would be appropriate to distribute data on the dimension key of the largest dimension on both the dimension and the fact tables, this will reduce the amount of data being moved between slices to facilitate joins.

For optimal performance we should always ensure we include both the distribution keys and the sort keys in any query, even if they appear to be redundant. The presence of these keys forces the optimizer to access the tables in an efficient way.

For best data load performance we insert rows in bulk and in sortkey order. Redshift claim best performance comes from using the COPY command to load from flat files and as second best the bulk insert SQL commands such as CTAS and INSERT INTO T1 (select * from T2);. Where Redshift performs less well is when we use certain kinds of ETL steps in our process, particularly those that involve updating rows or single row activities. In addition loading data without respecting the sort key leads to performance problems on data query. If data update is essential we have two real options: we move our ETL processes to a conventional database hub server (perhaps using ODI) and just use Redshift to store pre-transformed data; or we revise our ETL processes to mimimize update activity on the Redshift platform. There is some scope to optimize updates by distributing data on the update key but another approach is to use temporary tables to build the results of the update and to replace the table with the results of the merge. This requires a bit of inventiveness with the ETL design but fortunately many of our required SQL constructs including analytic functions are there to help us.

Categories: BI & Warehousing

Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 2 : Using ODI and OBIEE with Redshift as a Target/Source

Rittman Mead Consulting - Wed, 2014-02-19 03:00

In my post yesterday we stepped through the initial set up and staging load for a data warehouse using Amazon Redshift and Attunity, for eventual use with OBIEE and ODI. Now that we have our source data in Redshift, let’s look at transforming it into a star schema using ODI, by initially looking how we set up the connection to Redshift in ODI’s Topology Navigator.

As I mentioned in yesterday’s post and on a blog article we wrote on non-Oracle databases a few months ago, Redshift is based on ParAccel technology, but uses PostGreSQL ODBC and JDBC drivers to connect to it. Therefore. we can create a Physical Architecture connection in ODI through to it using the PostgresSQL technology type, like this:

Odi config 1aFor the Redshift JDBC URL you’ll need, refer to the AWS Cluster summary page back on the Amazon AWS Console.

Odi config 2a

We can then add our schemas:

Odi config 3a

Then, if your connection to the Redshift database is working correctly, you should see a list of schemas in the drop down menus:

Odi config 4a

Once all the schemas have been added, when then go through the familiar ODI process of logical architecture and context mapping:

Odi config 5a

Next, we can create a model and reverse engineer in our data stores -

Odi config 6a

It’s a good idea at this point to review each datastore and ensure that all attributes have been correctly assigned a datatype. To save a little reading time, a set of dimension and fact tables were created in the performance schema, and these were also reverse engineered in:

Odi config 7a

So from here on out, the process of creating and using mappings, packages, variables etc to populate your target tables is fairly straightforward, one which ODI Developers will be familiar with. You add your source and target data stores and implement the transformation logic.

Odi config 8a

For the initial load, the SQL Control Append IKM was used and for the most part, this KM worked well without customisation. However, one of the early issues encountered during execution was with unordered outer joins – this appears to be a an issue with the PostGresQL technology. This was resolved by amending the mapping and setting an order for the joins:

Odi config 9a

Merging / Incremental loading

One of the things to be aware with Redshift at the time of writing is the absence of Merge functionality (we’re so spoilt in Oracle DBMS land ;) ) You can of course achieve this by issuing separate insert and update statements. There is also the SQL Incremental Update IKM in ODI, but you may want to review the KM Task steps against your requirements before proceeding, as invariably some customisation will be required to improve performance. The same applies to SCD functionality – you should consider utilising analytic window functions (Redshift supports several aggregate and ranking window functions) to achieve your desired format.

Additionally, as mentioned previously Attunity supports a variety of replication tasks – you can do one off initial loading, an initial load + apply changes, and/or collect deltas into a change tracking table in the Redshift database. This in itself means you have the basic building blocks to create and maintain a Foundation layer as part of your Data Warehouse Design.

So, once we have data in our target fact and dimension tables, we can model these in OBIEE and create some analyses.

OBIEE Configuration

 As per the AWS documentation for Redshift, the recommended ODBC driver was downloaded from the Amazon AWS website, and an ODBC data source was created on the OBIEE server:

Obiee config 1a

For the server entry, you can use either the public or private leader node IP, depending on your network/security configuration. Just backtracking a little here – you may recall that the usr_dw account was created after the initial Redshift cluster build. In order for it to be able to access the tables in the performance schema, we need to grant the required select privileges.  

Once the ODBC connection is defined, we can create an RPD, define our database connection properties, import our tables and start modelling.

Obiee config 2a

Obiee config 3a

 

When it came to setting the features I used the query DBMS function and then went with what it set. Bear in mind that because PostgreSQL and Redshift aren’t supported source databases for OBIEE, depending on the analyses you are creating you may or may not hit functionality issues, so you may find yourself tweaking these settings.

Obiee config 4a

So if your connection is defined correctly and there are no other issues, you should now be able to import the tables:

Obiee config 5a

Once you’ve made your selection, you can then define the relationships between the fact and dimension tables (if you have primary and foreign keys defined these should propagate through), create any aliases etc. and then you’re about ready to start building up the the Business Model and Presentation areas. 

Obiee config 6a

Once the RPD modelling is completed, we can upload it and create some content.

Obiee config 8a

Obiee config 7a

Admittedly the analyses created for this dashboard were fairly basic, so no functionality issues were encountered during creation and execution – you might hit some with more “edge-cases” such as analytic functions or nested subtotals, or you might end-up hitting performance issues when you’ve got lots of concurrent users, as the SQL issued by OBIEE might not be as efficient as it would be for a supported data source.

Performance Considerations for Redshift

Pete Scott will be covering the ins and outs of designing for performance in a blog post tomorrow, for which I’ll add a link to this post once it’s up. However, one thing that should be mentioned is the importance of setting sort and distribution keys appropriately on the fact and dimension tables, as well as Primary and Foreign keys (whilst these constraints are not enforced, they are used by the optimiser when determining execution plans). The sort and distribution keys determine how the data is stored on disk, and how it is distributed across the compute nodes, and can make a big impact on query response times, as well as any issues around lack of official support in OBIEE for Redshift as a data source.

So to conclude -  whilst not officially supported, it is possible to create and maintain a Data Warehouse in Redshift and use the Oracle Product set to drive ETL and reporting. Some of the features that make Redshift an attractive DB platform is it’s ease of management, it’s scaleability and sizing options, and the ability to get a cluster up and running in hours rather than days. Combined with ODI and OBIEE and therefore a transferable skills base, it makes for an intriguing DW solution.

Categories: BI & Warehousing

Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 1 : Configuring Redshift, and CDC Using Attunity

Rittman Mead Consulting - Tue, 2014-02-18 12:36

One of our longstanding Oracle customers recently asked us to put together a proof-of-concept DW system using Amazon Redshift as the data warehouse database, rather than Oracle Database. The main driver for this was the economics of running Redshift in the cloud vs. Oracle on-premise, or using Amazon RDS, and they were also interested in the potential performance benefits of running their data warehouse on a column-store database. They were also interested in trying out Attunity Replicate with Cloudbeam as the CDC solution, a product conceptually similar to Oracle GoldenGate but hosted on the Amazon AWS platform, and licensed by data volume and month rather than per CPU, perpetual, as GoldenGate is.

Because the client is a big user of Oracle ETL technology though – currently OWB, moving to ODI in the near future – they still wanted to use ODI to do the main data integration work, so an interesting question for us was whether ODI was a suitable tool for use with Redshift, which isn’t an officially supported source or target platform for ODI. It’s important to understand what “not supported” or “not certified” means in this context – it doesn’t mean it won’t work, it just means you’re on your own if it doesn’t. What gave us hope though was that Redshift uses Postgres-compatible ODBC and JDBC drivers, and we got it working with OBIEE a few months ago, so the premise for the PoC seemed reasonable to us. The diagram below shows the architecture we went with at the start of the project.

Architecture

 

A little about the non-Oracle components:

  • Amazon Redshift – Amazon’s Fully Managed Data Warehouse in the cloud offering. Based on ParAccel technology (which itself has roots in PostGresQL), the service utilises MPP, columnar storage and compression principles to enhance query performance.  
  • Attunity Replicate & Cloudbeam – Attunity provide log based Change Data Capture and replication functionality, which includes Redshift as a target database. Attunity supports on going source to target replication as well as storing change data in audit / change tracking tables. In order to replicate data to Redshift, you need to set up an account with Attunity and then sign up for a CloudBeam subscription. 
  • Amazon S3Amazon’s cloud storage service, which is used by Attunity to stage replication/change data prior to loading into the Redshift database.

Initial Configuration and Load

Source Database

The source RDBMS used for this proof of concept was an Oracle EBS database running on 11.1.0.7. In particular, we used the GL schema as this housed some sizeable data. The database requirements for Attunity Replicate are: 

  • Add supplemental logging (for this experiment full logging was added on the GL tables)
  • Switch the database into archive log mode
  • Optionally create a dedicated database account for Attunity – requires select privs on various V$ tables, the GL schema tables and execute on the LogMiner packages. It’s at this point you may want to touch base with your resident DBA :) 

S3 Storage

The Attunity CloudBeam service initially stages files in an S3 bucket before loading into the target database schema. In order to create this, browse to the S3 section of the AWS Console, and click Create Bucket. Once the bucket has been created, you then need to create a folder within this, in this instance the folder was named ORA_STG.

S3 1ds

 

Redshift Cluster

You deploy a Redshift database from the AWS management console. From the Redshift page, select a Region, and then Launch Cluster. From here, fill out the cluster details, for example:

Redshift config 1a

Leaving the database name blank will create a database named ‘dev’.  

After clicking Continue, the next page allows you to size your cluster.  For our test case, we opted for a two node cluster (which in fact has 3 nodes – 1 Leader node and 2 Compute nodes):
 

Redshft config 2a

The next configuration screen allows you to set various security and network options:

Redshift config 3a

Finally you are presented with a summary screen confirming your configuration options. The important thing to note here is that as soon as the cluster is launched and running, you will be start being charged. 

Reshift config 4a

After you click Launch, the cluster build will begin, and after some time the Status will switch from creating to Available. Clicking the Cluster Name hyperlink will show you a summary screen, detailing security config, cluster status, information about backups and URLs for connecting to your cluster. 

Redshift config 5a

At the bottom of the screen area list of public and private IPs for the Leader and Compute Nodes of the cluster: 

Redshift config 6a

We used the public IP of the Leader node for our inbound client connections.

The final step in terms of cluster configuration was to configure a list of IPs that were authorised to access the cluster. This is handled from within the security screen on the Redshift Management Console.

Redshift config 7a

Once you’ve authorised your machine, you can connect to the Redshift database via the JDBC/ODBC URLs shown on the cluster summary screen previously. Amazon recommend using SQL Workbench, and also link to JDBC and ODBC drivers which you should install. So, the next step was to enter the connection details, connect to the database and create some target users and schemas.

The connection details were defined using the downloaded jdbc driver, and the jdbc url from the AWS summary screen:

Redshift config 8a

Once connected, users and schemas were created with the following statements:

create schema performance;
create schema staging;
create schema foundation;
 
create user usr_stg with password '<password>';
create user usr_fnd with password '<password>';
create user usr_dw with password '<password>';
create user odi_work with password '<password>';

NB: in Redshift, passwords must contain one uppercase character,a number, and be at least 8 chars long.

Once the users and schemas are created, we can then grant privileges  to the odi_work account to be able to create objects across the schemas:

grant usage,create on schema staging to odi_work;
grant usage,create on schema performance to odi_work;
grant usage,create on schema foundation to odi_work;

Of course, how you would manage security and database privileges in your own environment is entirely up to you. You can configure object level privileges as your database security guidelines dictate. 

Once your users are set up, you can create objects and grant privileges as required. 

Attunity

Once the Attunity Replicate software is installed, it’s a case of adding the source and target databases.

Source: Oracle EBS database

 

Clicking Manage Databases and then Add Database brings up the following dialogue, in which you fill in the details of the database. The Advanced tab contains details around how Attunity implements and executes CDC/log Mining. 

Add db 1a

Attunity 1a

Target: Redshift Cluster

You follow the same process as above, but select Amazon Redshift as the Type and specify it as a target. Within this screen, you enter details of the cluster endpoint address, the port (refer back to the AWS cluster summary screen), and a user name and password to access the database. If you have entered the details correctly and you can reach the cluster, you should be able to drop down the Database Name list and see the dev database. The next section on this dialog is to enter details about your Attunity CloudBeam subscription, and then your S3 bucket and folder. It should be noted here that in your environment you may not want to use your master user  db account, but instead opt for a slightly less privileged user. 

Attunity 2a

Once you have filled out each section, it is worth clicking the Test button on the bottom left of the dialog to ensure all details are correct.

So, now that we have our source and target databases configured, the next step is to create a Replication Task. The task can be a combination of the following options: 

  • Full Load – replicate a table from source to target
  • Apply Changes – combined with the above, maintains the target copy with changes that occur on source
  • Store Changes – CDC deltas stored in a separate table 

Once you’ve decided on the type of replication task, you drag and drop the source and targets onto the palette, configure which tables you want to replicate from source, and specify any transformation rules e.g. to move to a different target schema. In our test case, we created an initial load task and a rule was created to map the GL schema in the source to the staging schema in the target. 

Attunity 4a

Attunity 6a

Within the task settings, you can choose to either create new target tables, or use pre-created ones. This is useful if you want to create the tables with certain options, e.g. specifying the sort and distribution keys. 

Attunity 5a

Once you finished reviewing and tweaking settings, you can run the task. Once kicked off, you can review progress from the Monitor screen: 

Attunity 7a

Any errors will be flagged up and can be reviewed in the Messages section. When the load completes, you can verify the existence of the tables in the Redshift database: 

Attunity 8a

We can now need to grant select privileges on the above objects to odi_work (unless you configured Attunity to use the odi_work account) before it can access them, as despite residing in a schema odi_work can create objects in, they are owned by a different user and therefore permissions still need to be granted.

grant select on <schema>.<object> to odi_work;

 

So the first stage of getting our data from source to Redshift is complete. Tomorrow, we will look at transforming our initial load data into a target star schema using ODI.

Categories: BI & Warehousing

Using Groovy instead of WLST for OBIEE Systems Management

Rittman Mead Consulting - Sun, 2014-02-16 22:33

I remember some years back when Rittman Mead first received the OBIEE 11g beta and I tried installing it for the first time. It was a nightmare. If you think the 11.1.1.3 release was challenging, you should have tried working with any one of the betas. It was with these first few releases that Weblogic was injected into my world. Some time around the first month of starting the beta, I recall Mark Rittman saying something along the lines of: “all the Fusion Middleware stuff looks interesting, but I’m glad we won’t have to know anything about that.” You see… I just wanted a nice career in BI, but here I am writing a blog post on working with JMX MBeans.

JMX MBeans are complicated, meaning that careers probably exist for working with JMX MBeans and little else. I’m truly sorry to hear that. For the layperson (and by this, I mean people who work in BI), trying to understand MBeans probably ranks somewhere between getting decent service at a restaurant in the UK, and understanding why the Anomaly needs to join with the Source. It’s complex enough that most vendors provide utilities to simplify the process… which means using anything other than Java to work with them. For Weblogic, we have Weblogic Scripting Tool (WLST).

WLST is really just a specific implementation of Jython designed into a command-line utility that in many ways feels like SQL-Plus or RMAN. It’s designed to work interactively or in scripted fashion, and Rittman Mead has offered up some free WLST scripts to the community over the years via GitHub. If you take a look at our deploy_rpd.py script for example, you can make some sense of how WLST works. I’ve reworked that script slightly in this post to use hardcoded values instead of variables, to remove the exception handling, etc., to make the logic easier to follow. It’s shown here in two parts… the first part which connects to the Weblogic Administration Server and locks the domain:

import sys
import os

connect('weblogic', 'welcome1', 't3://localhost:7001');
domainCustom()
cd ('oracle.biee.admin')
cd ('oracle.biee.admin:type=BIDomain,group=Service')

print 'Locking the configuration...'
objs = jarray.array([], java.lang.Object)
strs = jarray.array([], java.lang.String)
invoke('lock', objs, strs)

This lock method is equivalent to the “Lock and Edit” button inside Fusion Middleware Control. We use the cd command to navigate to the BIDomain MBean because the lock method (as well as commit and rollback ) is in that MBean. The invoke command is used to execute MBean methods in WLST and accept parameters for those methods. To do this we use generic Object and String arrays because the invoke command is generic and needs to support an ever-changing series of parameters. The rest of the pared script is below:

cd ('..')
cd ('oracle.biee.admin:type=BIDomain.BIInstance.ServerConfiguration,biInstance=coreapplication,group=Service')
# Set the parameters
params =  jarray.array(['/mnt/hgfs/stewart/scripts/sugarcrm.rpd','Admin123'],java.lang.Object)
# Set the parameters Signs
sign =  jarray.array(['java.lang.String', 'java.lang.String'],java.lang.String)
# Invoke the procedure
invoke( 'uploadRepository', params, sign)

cd ('..')
cd ('oracle.biee.admin:type=BIDomain,group=Service')
objs = jarray.array([], java.lang.Object)
strs = jarray.array([], java.lang.String)
invoke('commit', objs, strs)

In the second part of this script, we navigate to the ServerConfiguration MBean, which contains the uploadRepository method. We populate the params array with the RPD path and RPD password, we upload the repository, and then navigate back to the BIDomain MBean to perform the commit.

WLST strikes me as an attempt to take a programmatic interface, one that is usually accessed through Java, and make it accessible to non-developers. In this regard, I see it as a major failure. The MBean methods have been abstracted to the point of being almost useless. This seems like the exact opposite of simple to me. It seems overly complex to have to construct arrays and populate them just to pass parameters to methods. I would prefer to just pass parameters as, well… parameters. Additionally… notice how we had to cd to the BIDomain MBean, cd away to go use the ServerConfiguration MBean, and then cd back again. I would prefer to use the MBeans the way they were meant to be used… as objects, so we would be able to instantiate and hold multiple MBean objects at the same time.

If WLST is indeed easier to use than Java, then this is quite an indictment of Java. It works pretty well when we are writing processes that are JMX specific… when the commands that we need to issue are all calls to MBeans and nothing else. But what happens if we want to write a process that involves interacting with MBeans as well as some other processes or APIs, such as calls to any of the Oracle BI Server XML API, in a unified script? WLST is terrible at this. In my first attempt at writing something like this, I used Perl to execute both validaterpd commands as well as MBean commands. I ended up having to maintain a separate Jython script along with the Perl script, which didn’t sit well with me as a developer. What we really need is a language that is groovy to write and maintain, compiles to pure Java (just like Jython), and has the ability to encapsulate MBeans natively in a similar fashion to WLST. What we need is Groovy.

Groovy is a dynamic language that can be used to write scripts, or write object-oriented packages and classes, and can also be used to compile to Java byte code. Because it’s simple to write and has native MBean object handling, it’s easier (in my opinion) than WLST. How easy? Well, let’s go through our deploy_rpd.py example, and see if it’s easier to write and understand. Here’s our upload repository script in Groovy:

import javax.management.remote.JMXConnectorFactory
import javax.management.remote.JMXServiceURL
import javax.naming.Context

def h = new Hashtable()
h.put(Context.SECURITY_PRINCIPAL, 'weblogic')
h.put(Context.SECURITY_CREDENTIALS, 'welcome1')
h.put(JMXConnectorFactory.PROTOCOL_PROVIDER_PACKAGES, "weblogic.management.remote")
jmxConnector = JMXConnectorFactory.connect(new JMXServiceURL("service:jmx:t3://localhost:7001/jndi/weblogic.management.mbeanservers.domainruntime"), h)
server = jmxConnector.getMBeanServerConnection()

def biDomain = new GroovyMBean(server,'oracle.biee.admin:type=BIDomain,group=Service')
def servConfig = new GroovyMBean(server, 'oracle.biee.admin:type=BIDomain.BIInstance.ServerConfiguration,biInstance=coreapplication,group=Service')

biDomain.lock()
servConfig.uploadRepository('/mnt/hgfs/stewart/scripts/sugarcrm.rpd', 'Admin123')
biDomain.commit()

That’s the whole script. There’s a few things to point out here. First… establishing connectivity to the Weblogic Admin Server is slightly harder in Groovy. WLST hides a good bit of the complexity with connectivity, but the JMXConnectorFactory class in Groovy is a close second. However, once the connectivity is done, Groovy wins the rest of the way. Notice how we are able to configure two different MBean objects: biDomain for BIDomain and servConfig for ServerConfiguration. We don’t have to cd back and forth across the MBean hierarchy to work with the different MBeans; we can simply instantiate as many MBeans at once as we need to.  Also, notice that we don’t have the generic invoke command to call the different methods in MBeans. We just call the method directly: if we want to call lock(), or commit(), or uploadRepository(), we just call it, and pass the parameters that it accepts. No need to build the generic Object and String arrays for passing in to invoke.

Because Groovy is very easy to execute external command-line processes (such as biserverxmlexec or biserverxmlgen) it’s a great choice for unified OBIEE processes that need to work with executables and JMX MBeans together in a single script. Also… Groovy accepts Java syntax, so if we can’t find the Groovy code samples we need from the internet, we can just plug in Java samples instead.

Categories: BI & Warehousing

OBIEE 11g Presentation Server Catalog Replication Across Redundant Servers

Rittman Mead Consulting - Fri, 2014-02-14 08:44

The thing about OBIEE 11g, is that there is no single way to get things “up and running”. Over the past year I have performed countless OBIEE installs and almost every one has been different to the last, due to infrastructure challenges, scale requirements or requirements for high-availability and server redundancy.

There is little we can do about infrastructure challenges, typically around policies which are in place across the IT estate and must be adhered to. Scale is simple; more users means more or bigger servers. The last point on the list is HA & server redundancy, and this is arguably the biggest area we need to address before punching in “./runInstaller” at the command-prompt.

Mark Rittman is currently pulling a wider blog posts around options for HA & server redundancy, so I’m not about to get into that in this post. What I will share in my first Rittman Mead blog post is something I have been working on recently, that is sometimes implemented as part of a HA/redundancy solution; BI Presentation Server catalog replication.

OBIEE webcat sync

BI Presentation Server catalog replication is the Oracle supported method of moving a BI Presentation Server catalog from one running OBIEE instance to another. This is useful is many situations namly:

  • Release cycle management when specific aspects of the BI Presentation Server catalog need to be moved from one OBIEE environment to another, such as development to testing.
  • Full BI Presentation Server catalog synchronization between OBIEE instances to provide two OBIEE instances with exactly the same BI Presentation Server catalog for DR.

The 2nd point above is the scenario I am looking to cover here – using this utility we can make sure the BI Presentation Server catalog on two distinct OBIEE instances are the same. In a DR event we can either point users at the secondary server or reverse this utility to pump the data from the DR server to the primary server after the fault on the server has been cleared.  The RPD and other changes, such as user roles, would be managed as part of the release process and copied to the DR server when they are pushed to the production server. It is only the BI Presentation Server catalog which is constantly changing as the OBIEE instance is used, and therefore only the BI Presentation Server catalog which needs to be constantly replicated.

To set up BI Presentation Server catalog replication, we can use a command-line utility which ships with Oracle Business Intelligence 11g, called either “sawrepaj.bat” or “sawrepaj.sh” depending on your operating system (the two are functionally identical, and differ only in the way they are invoked from the command-line). “sawrepaj” has options to synchronise catalogs between servers, or to simply export a catalog for backups or as part of the release lifecycle; in addition, we can control the grain of what is exported down to an individual catalog folder or file.

The utility runs from the command-line and accepts a number of options, or commands, to specify how the replication takes place, using the following syntax:

sawrepaj.sh [/C path] command [command parameters]
  • mark – instruct the OBIEE server to ‘watch’ catalog files and record changes.
  • run – run the batch, as defined in the config.xml file.

Before using the utility, we need to create a config.xml file which instructs the OBIEE Presentation Services to record changes, and tells sawrepaj what to do when it executes.

The config.xml file can be placed anywhere in the filesystem and is referenced using the /C switch when calling sawrepaj – more about this /C switch later. The config file tells the sawrepaj utility which Oracle BI Presentation Servers are involved in the replication, and how to access them. The example below includes a source and target Oracle BI Presentation Server – however you could add countless other BI Presentation Servers depending on your environment.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Oracle BI Presentation Services Replication Agent Configuration File -->
<Config>
<General>
<ExportDirectory>smb://path/to/directory</ExportDirectory>
<LogExpiresHours>48</LogExpiresHours>
</General>
<Server name=”serverA" user="weblogic" pwd="Password01">
<Property name= "oracle.bi.presentation.url"
value ="http://servera:9704/analytics/saw.dll"/>
<Property name="oracle.bi.presentation.usesoaptcptransport" value="N"/>
</Server>
<Server name="serverB" user="weblogic" pwd="Password01">
<Property name= "oracle.bi.presentation.url"
value ="http://serverb:9704/analytics/saw.dll"/>
<Property name="oracle.bi.presentation.usesoaptcptransport" value="N"/>
</Server>
<Folderset name="all">
<Folder>/</Folder>
</Folderset>
<ReplicationTask destination="serverA" source="serverB" folders="all" />
</Config>

The three sections of interest in the above file are Server, FolderSet and ReplicationTask. Each

  • Server tag points to an OBIEE Presentation Services instance with credentials for a user with has SOAP access.
  • FolderSet tag contains a list of Folder tags which detail the folders within the webcat to replicate.
  • ReplicationTask contains the instruction of what to do when we run sawrepaj. For the above example, serverA is the target and ServerB is the source.

Before any replication can begin, we need to enable each BI Presentation Server component for replication. Add the below to the instanceconfig.xml file between the <catalog> tags. This needs to be done for all all Presentation Server components, whether source or target for the replication.

<Replication>
<Enabled>true</Enabled>
<ReadLogRecordsSinceHoursAgo>120</ReadLogRecordsSinceHoursAgo>
<RecordsInFileLimit>4000</RecordsInFileLimit>
</Replication>

Now, back to the sawrepaj commands:

mark – The mark command is used to tell tBI Presentation services which webcat files to monitor for changes. We can be as specific as a single file, or as as vague as an entire webcat. You will need to run this on a BI server with the following command:

sawrepaj mark all /

run – The final part of the process, the run command will instruct sawrepaj to run all the tasks in the config.xml file. The below is an example of the command. If your config.xml file is not in the root path of the sawrepaj tool, you will need to specify it’s location with the /C switch. This command can be added to a cronjob, a Windows scheduler or can be ran manually, or as part of another script.

sawrepaj /C /my/config/file/config.xml run

That’s it! Once completed, the above steps will give you all you need to set up full catalog replication between multiple BI servers.

Categories: BI & Warehousing

Date formatting in OBIEE 11g – setting the default Locale for users

Rittman Mead Consulting - Wed, 2014-02-12 15:35

A short and sweet blog post this one, simply to plug a gap in Google that I found when trying to do achieve this today.

How user’s see things like date formats in OBIEE is determined by their locale, which is usually related to language but not always the same.

The classic example, and the one I was trying to resolve, was that users were seeing dates presenting in the ‘wrong’ format. The users are British and instead of seeing dates in the correct format of DD/MM/YYYY (19/01/2014) they were MM/DD/YYYY (01/19/2014). (side note: this lighthearted article in the Guardian refers to this ridiculous format as “Middle-Endianness”…)

In this example I have got the same date value in two columns. One column I have explicitly formatted (“DATE_VERBOSE”) so there is no ambiguity about the date. The second column (“DATE_LOCALE”) is the same date value, but formatted according to the locale setting. It is this default locale setting that I want to demonstrate how to set – manually changing all date columns to a particular format is not a sustainable way to develop reports…

Oracle’s documentation is not very clear in this respect, and Google is a mishmash of half solutions and rubbish, so I humbly offer this short instruction for how to ensure users have the correct locale applied.

Solution 1 – manually

Each user can configure their chosen locale, from the My Account dialog:

Whenever changing the locale, you need to logout and log back in to OBIEE for it to take effect.

This setting, if they override it from the default, is stored in the Presentation Catalog under the user’s userprefsxmlstore.xml file

<sawprefs:userPreference prefKey="Locale">en-gb</sawprefs:userPreference>

So for a single user, this method is fine – but for more than one existing user, or any new users, it doesn’t fly. We don’t want to go hacking Presentation Catalog files, that is bad, mmm’kay?

Solution 2 – The USERLOCALE System Session Variable

There is a System Session Variable called USERLOCALE which you can set per user, and will define their locale. Easy! All you have to watch out for is that its value (thanks Christian) is case sensitive. This cost me some hours today, so I’ll say it again – the value you specify for USERLOCALE is case sensitive and OBIEE will not help you out if you specify it wrong (it’ll just ignore it).

You can hardcode the value to the same for all users, or if you want to get fancy you could drive it from a database table to make it variable dependent upon the user’s location that you store in a table.

The Session Variable and example Initialisation Block are shown here:

To validate what is going on with the session variable, you can use a Static Text view with the following content to show the values of the relevant session and presentation variables:

[u][b]System Session Variables[/b][/u]
<p align=left>
[b]NQ_SESSION.USERLOCALE: [/b]@{biServer.variables['NQ_SESSION.USERLOCALE']}[br/] 
[b]NQ_SESSION.WEBLANGUAGE: [/b] @{biServer.variables['NQ_SESSION.WEBLANGUAGE']}[br/]
</p>

[u][b]Predefined Presentation Variables[/b][/u][br/]
<p align="left">
[b]session.language:[/b] @{session.language}[br/]
[b]session.locale:[/b] @{session.locale}[br/]
</p>

In my test report the Static Text view is to the right of the data table, and I can now see that the USERLOCALE session variable has been populated. Crucially, the session.locale presentation variable is inheriting the correct value, which in turn is driving the desired formatting of the DATE_LOCALE column.

Method 3 – the hack

I’m including this here for completeness, and because you will come across it in plenty of places on the web so it is good to understand why it is not the correct solution.

Within the OBIEE installation folder (FMW_HOME) you will find the ORACLE_HOME folder, Oracle_BI1, in which all the application’s binaries and internal configuration are stored. This is not a “user serviceable” folder, and Oracle are at liberty to change any of the files within it whenever you patch or upgrade the software.

Within Oracle_BI1 there is a file called localemappings.xml and this file sets the default locale used if one is not explicitly configured. If you change the “catch all” line in this XML file to your desired locale, it will set the default.

<when matches="*"><localeDefinition name="en-gb"/></when>

But, this is not the correct way to do it, and there is no need to because the USERLOCALE method above works just fine.

What about AllowedLocales in instanceconfig.xml?

In combing through the documentation (RTFM, after all), you will come across reference to the Localization tags for the Presentation Services configuration file instanceconfig.xml. Within this tag you can specify AllowedLocales. However, all this does is provide a way to restrict the dropdown locale list (shown in method 1 above). So this is a nice thing to do for your user base if there is a finite number of locales they will want to use (save them wading through many options), but it does not influence the default locale – even if you set it to a single value, your desired default.

Summary

If you want to set the default locale, use the system session variable USERLOCALE. Make sure you specify your value in lowercase, otherwise it won’t work.

Categories: BI & Warehousing

Filtered Charts

Tim Dexter - Tue, 2014-02-11 17:19

A customer question this week regarding filtering a chart. They have a report with a bunch of criteria with monetary values but, rather than show all of the criteria in a pie chart, they just want to show a few. For example:

 This ...
 rather than this

 There are a couple of ways to tackle this:

1. Filter the chart data in the chart definition. Using an XPATH expression you can filter out all of the criteria you do not want to see. Open the chart definition and update the definition. You will need to update the RowCount, RowLabels and DataValues attributes in the chart definition. Adding in the following XPATH expression:

    [DEPARTMENT_NAME='Accounting' or DEPARTMENT_NAME='Marketing' or DEPARTMENT_NAME='Executive']

so the DataValues value becomes:

    <DataValues><xsl:for-each-group select=".//G_1[DEPARTMENT_NAME='Accounting' or 
                                                     DEPARTMENT_NAME='Marketing' or 
                                                        DEPARTMENT_NAME='Executive']" ...

2. Create a variable in the template to hold just the values you want to chart.

    <?variable: filterDepts; /DATA_DS/LIST_G_1/G_1[DEPARTMENT_NAME='Accounting' or 
                                                     DEPARTMENT_NAME='Marketing' or 
                                                       DEPARTMENT_NAME='Executive']?>

Then update the chart definition with the variable for the same three attributes above, the RowCount, RowLabels and DataValues. For example:

    <DataValues><xsl:for-each-group select="$filterDepts" ...

These both work admirably, but they both require some manual updating of the chart definition which can get fiddly and a pain to maintain. I'm also just filtering for three departments, when you get up to 5 or 6 then the XPATH starts to become a pain to maintain. Option 2 alleviates this somewhat because you only need to define the filter once to create the filtered variable.
A better option may be ...

3. Force the effort down into the data layer. Create another query in the report that just pulls the data for the chart.

LIST_G2/G_2 holds the data for the chart. Then all you need do is create a vanilla chart on that particular section of the data.

Yes, there is some overhead to re-fetch the data but this is going to be about the same if not less than the extra processing required in the template with options 1 and 2. This has another advantage, you can parametrize the criteria for the user. You can create a parameter to allow the user to select, in my case, the department(s) they want to chart.


Its simple enough to create the multi-select parameter and modify the query to filter based on the values chosen by the user.

Sample report (including data model and layout template here) just un-archive into your catalog.
RTF Template plus sample data available here.




Categories: BI & Warehousing

Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!

Rittman Mead Consulting - Tue, 2014-02-11 04:48

The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.

The voting forms, and event details, are below:

In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.

Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.

Categories: BI & Warehousing

Alternate Tray Printing

Tim Dexter - Mon, 2014-02-10 17:18

Since we introduced support for check printing PCL escape sequences in 11.1.1.7 i.e. being able to set the micr font or change the print cartridge to the magnetic ink for that string. I have wanted to test out other PCL commands, particularly, changing print trays. Say you have letter headed paper or pre-printed or colored paper in tray 2 but only want to use it for the first page or specific or for a separator page, the rest can come out of plain ol Tray 1 with its copier paper.

I have had a couple of inquiries recently and so, I finally took some time to test out the theory. I should add here, that the dev team thought it would work but were not 100%. The feature was built for the check printing requirements alone so they could not support any other commands. I was hopeful thou!
In short, it works!



I can generate a document and print it with embedded PCL commands to change from Tray 1 (&l4H) to Tray 2 (&l1H ) - yep, makes no sense to me either. I got the codes from here, useful site with a host of other possibilities to test.

For the test, I just created a department-employee listing that broke the page when the department changed. Just inside the first grouping loop I included the PCL string to set Tray 1.

<pcl><control><esc/>&l4H </control> </pcl>

Note, this has to be in clear text, you can not use a formfield.
I then created a dummy insert page using a template and called it from just within the closing department group field (InsertPAGE field.) At the beginning of the dummy page I included the PCL string to get the paper from Tray 2:

<pcl><control><esc/>&l1H</control> </pcl>

When you run this to PDF you will see the PCL string. I played with this and hid it using a white font and it worked great, assuming you have white paper :)

When you set up the printer in the BIP admin console, you need to ensure you have picked the 'PDF to PCL Filter' for the printer.



If you dont want to have PCL enabled all the time, you can have multiple definitions for the same printer with/with out the PCL filter. Users just need to pick the appropriate printer instance. Using this filter ensures that those PCL strings will be preserved into the final PCL that gets sent to the printer.

Example files here. Official documentation on the PCL string here.

Happy Printing!





Categories: BI & Warehousing

Testing Oracle Direct Connector for HDFS as an Alternative to Hive ODBC for OBIEE11g

Rittman Mead Consulting - Thu, 2014-02-06 05:39

In a post on the blog a couple of weeks ago, I looked at loading up a set of flight delays data into Apache Hadoop, then analysing it using Apache Hive, Cloudera Impala and OBIEE. In this scenario, OBIEE connects to the Hadoop cluster using Hive and Impala ODBC drivers, and then either Hive (through MapReduce jobs) or Impala (through its in-memory distributed query engine) returns the data to OBIEE, for display on the dashboard.

In my initial tests, as you’d expect Hive was fairly slow (as its optimised towards large (TB+) batch jobs), whilst Impala was fast, on a par with regular Oracle database queries albeit with a much more limited set of SQL functions supported. One of the questions I got asked after I posted the blog article though, was how this approach compared to accessing the Hadoop data via Oracle’s “big data connectors” – either Oracle Loader for Hadoop (OLH) or Oracle Direct Connector for HDFS? Could these give us similar performance to Hive or Impala, but make the Hadoop data more “Oracle-compatible” and suitable for querying with OBIEE?

For anyone not all that familiar with Oracle’s big data connectors, there’s two that are particularly relevant to BI/DW use-cases; Oracle Loader for Hadoop (OLH) is a bulk-loader between Hadoop and the Oracle Database that uses MapReduce on the Hadoop-side to prepare, sort and transform data before efficiently loading it into an Oracle database. There’s already lots of ways you can move data in and out of Oracle from a Hadoop source – sqoop for example – but presumably Oracle feel OLH is a particularly-efficient way of doing it, exploits Oracle direct/bulk-loading capabilities and parallelism, and so forth – but it’s a loader, not a transparent reading mechanism, so it’s only really appropriate for ETL-type situations where you want to copy the data out of Hadoop and into an Oracle data warehouse.

Oracle Direct Connector for HDFS (ODCH) is a bit different though, in that it uses the Oracle external table feature to map an Oracle data dictionary table onto HDFS files, or optionally Apache Hive tables (and therefore onto HDFS files, NoSQL databases or whatever). So you could potentially use ODCH to create Oracle database table representations of your Hive tables / HDFS files, then map these into the OBIEE repository and work with them just like any other Oracle (external) table. What this means then is that you’re not reliant on Hive or Impala ODBC drivers, or the BI Server having to generate HiveQL or ImpalaQL queries (with Impala of course not yet being officially supported), making your work from the OBIEE side a lot easier than if you’re trying to work with Hive or Impala directly.

So how well does it work then? I decided to give it a try using the new BigDataLite VM I mentioned on the blog last week, along with the Airline Delays dataset I used in the post on Impala and Hive the other week. My starting point then was a Hive database with four tables – flight performance (19m rows), dest and origin (2k rows each) and carrier (1.5k rows), like this:

NewImage

Now obviously Oracle’s big data connectors aren’t installed as part of Hadoop by default, but the BigDataLite VM does have them pre-installed, along with an Oracle Database 12c database, so a lot of the work in setting things up is done for you. By default, an install of the big data connectors connects to the Hadoop environment on the same machine, so all of the setup commands I use will assume that the Hive server and so on are on the same server – localhost in this case – although of course you can configure them to connect to a remote Hadoop server or cluster.

The way ODCH works is that it uses the Oracle external table “pre-processor” feature to stream HDFS file content into your query session, with the pre-processor in this scenario being a utility provided by the big data connectors to connect the two environments together. This blog post from Oracle explains the process in a bit more detail, but the key things to understand are that it’s conceptually similar to accessing regular file data via external tables, giving us the benefit of fairly seamless access to this external data (vs. using SQL*Loader, or OLH in this case), but it’s also not “real” Oracle table data so there’s no indexes or any other performance structures that can make queries run faster – thought like regular external tables you can run ODCH loading in-parallel.

So let’s use the setup in the BigDataLite VM to create some external tables in the Oracle database that map to my Hive tables, and underlying HDFS data files. Before we do this though we need to set up a few things; first, we need to create a temporary directory on the Linux filesystem to hold the metadata files created by ODCH, and we also need to specify where the ODCH HDFS file streamer utility can be found. Logging in as oracle/welcome1 on the VM, I first create the temporary directory, and then set an environment variable the rest of the process will be looking for:

mkdir bi_airlines_dir
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

Now I log into SQL*Plus and create a database user that I’ll create the external tables in, and grant it access to the USERS tablespace:

sqlplus / as sysdba
 
create user bi_airlines identified by bi_airlines quota unlimited on users;
grant connect, resource to bi_airlines;

Next, I need to create two database directory objects; one to point to where the ODCH HDFS stream pre-processor lives, and the other to point to my temporary directory:

create or replace bi_airlines_dir as ‘/home/oracle/bi_airlines_dir’;
grant read, write on directory bi_airlines_dir to bi_airlines;
 
create or replace directory osch_bin_path as ‘/u01/connectors/osch/bin’;
grant read, write, execute on directory osch_bin_path to bi_airlines;

Now, re-purposing the scripts on the BigDataLite VM provided to create external tables over the Moviework Hive database, each of my tables requires two setup files; one, a shell script, to call the ODCH utility and reference an XML parameter file, and the second, the XML file, which contains the definition of the external table. To take an example, the shell script (genloc_origin_hive.sh) to create an external table over my “origin” Hive table looks like this:

# Add HIVE_HOME/lib* to HADOOP_CLASSPATH. This cannot be done
# in the login profiles since this breaks Pig in the previous lab.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-conf /home/oracle/movie/moviework/osch/origin_hive.xml \
-createTable

whereas the accompanying XML file (origin_hive.xml) looks like this:

<Configuration>

<property>
 <name>oracle.hadoop.exttab.tableName</name>
 <value>BI_AIRLINES.ORIGIN_EXT_TAB_HIVE</value> 
</property>

<property> 
 <name>oracle.hadoop.exttab.sourceType</name>
 <value>hive</value> 
</property>

<property> 
 <name>oracle.hadoop.exttab.hive.tableName</name>
 <value>origin</value> 
</property>

<property> 
 <name>oracle.hadoop.exttab.hive.databaseName</name>
 <value>bi_airlines</value> 
</property>

<property>
 <name>oracle.hadoop.connection.url</name>
 <value>jdbc:oracle:thin:@localhost:1521:orcl</value> 
</property>

<property>
 <name>oracle.hadoop.connection.user</name>
 <value>BI_AIRLINES</value> 
</property>

<property>
 <name>oracle.hadoop.exttab.defaultDirectory</name>
 <value>bi_airlines_dir</value> 
</property>

</configuration>

Within the XML file, we specify the name of the external table that the utility will create in Oracle for you (ORIGIN_EXT_TAB_HIVE within the BI_AIRLINES schema); that it’s a Hive (rather than HDFS file) source, and the Hive database name and table name to source data from. With the operating system directory that you pass to it, the utility stores metadata within this to point to the individual HDFS files that contain your data, which means that whilst the HDFS data itself isn’t copied to Oracle, a hard reference to the file is, which means if you add more files to the Hive table’s HDFS directory, you’ll need to re-run the utility to register them – so its not completely automatic beyond this point, but more or less OK if you’ve got a single file providing the data, as I have in this slightly non-representative case.

So let’s run the shell script and create one of the external tables, passing in the Oracle database password for the “bi_airlines” user when prompted:

[oracle@bigdatalite osch]$ sh genloc_origin_hive.sh
Oracle SQL Connector for HDFS Release 2.3.0 - Production

Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved.

[Enter Database Password:]
14/02/06 11:59:58 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
14/02/06 11:59:58 INFO hive.metastore: Trying to connect to metastore with URI thrift://bigdatalite.localdomain:9083
14/02/06 11:59:58 INFO hive.metastore: Waiting 1 seconds before next connection attempt.
14/02/06 11:59:59 INFO hive.metastore: Connected to metastore.
The create table command succeeded.

CREATE TABLE "BI_AIRLINES"."ORIGIN_EXT_TAB_HIVE"
(
"ORIGIN" VARCHAR2(4000),
"ORIGIN_DEST" VARCHAR2(4000),
"ORIGIN_CITY" VARCHAR2(4000),
"ORIGIN_STATE" VARCHAR2(4000),
"AIRPORT_ID" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( 
TYPE ORACLE_LOADER
DEFAULT DIRECTORY "BI_AIRLINES_DIR"
ACCESS PARAMETERS
(
RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'7C'
MISSING FIELD VALUES ARE NULL
(
"ORIGIN" CHAR(4000) NULLIF "ORIGIN"=0X'5C4E',
"ORIGIN_DEST" CHAR(4000) NULLIF "ORIGIN_DEST"=0X'5C4E',
"ORIGIN_CITY" CHAR(4000) NULLIF "ORIGIN_CITY"=0X'5C4E',
"ORIGIN_STATE" CHAR(4000) NULLIF "ORIGIN_STATE"=0X'5C4E',
"AIRPORT_ID" CHAR(4000) NULLIF "AIRPORT_ID"=0X'5C4E'
)
)
LOCATION
(
'osch-20140206120000-1947-1'
)
) PARALLEL REJECT LIMIT UNLIMITED;

The following location files were created.

osch-20140206120000-1947-1 contains 1 URI, 150606 bytes

150606 hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/bi_airlines.db/origin/geog_origin.txt

Note the HDFS file reference at the end – this is the metadata created by the utility, which we’d need to update if more data files are added to the underlying Hive table.

So let’s fire-up SQL*Plus and take a look at the tables:

[oracle@bigdatalite osch]$ sqlplus bi_airlines/bi_airlines

SQL*Plus: Release 12.1.0.1.0 Production on Thu Feb 6 13:20:12 2014

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Last Successful login time: Thu Feb 06 2014 12:21:52 -05:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select table_name from user_tables;

TABLE_NAME
--------------------------------------------------------------------------------
FLIGHT_PERF_EXT_TAB_HIVE
ORIGIN_EXT_TAB_HIVE
DEST_EXT_TAB_HIVE
CARRIER_EXT_TAB_HIVE

SQL> set timing on
SQL> select count(*) from flight_perf_ext_tab_hive;

COUNT(*)
----------
19648958

Elapsed: 00:01:09.74

Not bad, but just over a minute to do a quick row-count on the table. Let’s try the same thing using Hive itself – note that ODCH and Hive aren’t using the same access technique to the underlying data, and one might more sense than the other for particular query situations.

[oracle@bigdatalite ~]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/oracle/hive_job_log_2b7797b2-1ef8-4ffc-b8b1-ab4f77b19cea_423831297.txt
hive> select count(*) from bi_airlines.flight_performance;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
 set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
 set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
 set mapred.reduce.tasks=<number>
Starting Job = job_201402052208_0001, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201402052208_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201402052208_0001
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2014-02-06 13:26:48,223 Stage-1 map = 0%, reduce = 0%
2014-02-06 13:27:23,402 Stage-1 map = 51%, reduce = 0%
2014-02-06 13:27:40,487 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 16.59 sec
2014-02-06 13:27:41,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:42,523 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:43,535 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:44,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:45,558 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:46,569 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:47,582 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:48,596 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:49,608 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:50,616 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:51,625 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:52,641 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
MapReduce Total cumulative CPU time: 35 seconds 50 msec
Ended Job = job_201402052208_0001
MapReduce Jobs Launched: 
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 35.05 sec HDFS Read: 524501756 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 35 seconds 50 msec
OK
19648958
Time taken: 76.525 seconds

Ok, a bit longer, 76 seconds vs. 69 seconds – but as I said, they’re different access mechanisms and you’d probably get different results if you joined the data, filtered it etc.

Let’s try it in Impala now:

[oracle@bigdatalite ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Connected to bigdatalite.localdomain:21000
Server version: impalad version 1.2.3 RELEASE (build 1cab04cdb88968a963a8ad6121a2e72a3a623eca)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.2.3 (1cab04c) built on Fri Dec 20 19:39:39 PST 2013)
[bigdatalite.localdomain:21000] > select count(*) from bi_airlines.flight_performance;
Query: select count(*) from bi_airlines.flight_performance
+----------+
| count(*) |
+----------+
| 19648958 |
+----------+
Returned 1 row(s) in 1.09s
[bigdatalite.localdomain:21000] >

Ah, that’s more like it – the count this time was returned in just over a second, which is what you’d expect as Impala is optimised for ad-hoc queries, Hive for larger, batch-style SQL access and transformations. Moving over to OBIEE though, after creating a TNSNAMES connection through to the BigDataLite Oracle Database 12c database, I can import the external tables in just like any regular Oracle source:

NewImage

Building the RPD against these tables is thereafter the same as any Oracle database source – no need to worry about Hive drivers, SQL dialects, Impala not being supported and so on.

NewImage

But … when you come to run queries, it’s slow. As slow as accessing Hadoop data via Hive, in the order of minutes to return a result set to the dashboard.

NewImage

And this is what you’d expect if you connected OBIEE to a file data source, or an Oracle external table. There’s no indexes on the underlying tables, no aggregates and so forth, and you’re not benefiting from the query optimisations you’d get with technologies such as Impala, Stinger from Hortonworks or whatever. In reality, if what you’re after is the convenience of connecting to Oracle tables rather than Hive or Impala ones, you’d just use the external tables you created to then load the data into regular Oracle tables, and just query those.

What ODCH is really for is data extraction and ETL, it’s not a high-performance ad-hoc query tool, and realistically you’d only really use it like this for initial prototyping or if your use-case really suited direct query access to HDFS file data. So – back to Impala then for this type of Hadoop access, though it’s a neat feature to be aware of, particularly in the context of ODI and bulk-loading Hadoop data into Oracle.

Categories: BI & Warehousing

New OTN Article: Making the Move from Oracle Warehouse Builder to Oracle Data Integrator 12c

Rittman Mead Consulting - Mon, 2014-02-03 23:28

One of the reasons I’ve been silent on the blog lately regarding new Oracle Data Integrator 12c content (even though I promised it some months ago) is because I’ve been writing an article focused on migration options from Oracle Warehouse Builder (OWB). Published recently on OTN, this article hopefully helps customers understand some of the options Oracle has included to make good on their recent Statement of Direction for Data Integration.

Making the Move to ODI 12c

While the article is focused primarily on OWB customers looking to make the move, there’s plenty of value here for organizations already familiar with ODI 11g. I walk through an Enterprise Installation of ODI 12c on Linux, including Fusion Middleware domain creation and JEE agent configuration. Following that, I explore two new technical capabilities for OWB shops included in the new tool. First is Runtime Integration, which allows the execution of OWB mappings and process flows directly from ODI with complete auditing and load plan integration. Secondly, I investigate the new Migration Utility which will migrate the majority of customer mappings directly from OWB to ODI 12c. I then embark on a functional assessment of what it means to add value to a data integration project, and how organizations can use these new features to achieve it. I present three different approaches for OWB shops to implement ODI 12c: the Side-by-Side Approach, the Big Bang Approach, and the Phased Approach. Understanding that no size fits all, I’m sure organizations all over the spectrum will recognize a fit in at least one of these solutions.

As always, early access and beta participation means that Rittman Mead is prepared to implement an OWB migration project immediately using any of these three approaches. If you have any questions, please comment here on the blog, or reach out to us directly.

Categories: BI & Warehousing

Memory Guard

Tim Dexter - Mon, 2014-02-03 12:38

Happy New ... err .. Chinese Year! Yeah, its been a while, its also been danged busy and we're only in February, just! A question came up on one of our internal mailing lists concerning out of memory errors. Pieter, support guru extraordinaire jumped on it with reference to a support note covering the relatively new 'BI Publisher Memory Guard'. Sounds grand eh?

As many a BIP user knows, at BIP's heart lives an XSLT engine. XSLT engines are notoriously memory hungry. Oracle's wee beastie has come a long way in terms of taming its appetite for bits and bytes since we started using it. BIP allows you to take advantage of this 'scalable mode.' Its a check box on the data model which essentially says 'XSLT engine, stop stuffing your face with memory doughnuts and get on with the salad and chicken train for this job' i.e. it gets a limited memory stack within which to work and makes use of disk, if needed, think Windows' 'virtual memory'.

Now that switch is all well and good, for a known big report that you would typically mark as 'schedule only.' You do not want users sitting in front of their screen waiting for a 10,000 page document to appear, right? How about those reports that are borderline 'big' or you have a potentially big report but expect users to filter the heck out of it and they choose not to? It would be nice to be able to set some limits on reports in case a user kicks off a monster donut binge session. Enter 'BI Publisher Memory Guard'!

It essentially lets you set those limits on memory and report size so that users can not run a report that brings the server to its knees. More information on the support web site, search for 'BI Publisher Memory Guard a New Feature to Prevent out-of-memory Errors (Doc ID 1599935.1)' or you can get Leslie's white paper covering the same here.

Categories: BI & Warehousing

Automated RPD Builds with OBIEE

Rittman Mead Consulting - Mon, 2014-02-03 06:23
Introduction

Some time ago, I wrapped up a four-part series on using MDS XML as a basis for RPD development. Of course… you can (re-) read the blog posts to get up to speed on the concept… but I’ll offer a quick refresher here. Using an XML-based “data-store” for our RPD allows us the ability to treat our RPD the way Java developers treat their class files. At the end of the day… when it’s text, we can work with it. I paired this new(-ish) RPD storage format with the distributed version-control system (VCS) Git to highlight some truly radical development strategies. I speak on this approach quite often… and I always include a live demo of working on multiple branches of development at the same time, and using the power of Git at the end of it to merge the branches together to build a single, consistent RPD file, or an OBIEE RPD “release”. The question I always get asked: why Git? Can we do this with Subversion? My answer: sort of. The Git-difference should be appreciated by RPD developers: it’s all about metadata. Anyone who has set up a Subversion repository knows that branching was an afterthought. Our ability to “branch” relies on the manual creation of a directory called “branches” (right between “trunk” and “tags”) that plays no special part in the Subversion architecture. But Git has branching functionality built way down in the metadata. When it comes time to merge two branches together, Git knows ahead of time whether a conflict will arise, and has already devised a strategy for pulling the closest ancestor between the two branches to do a three-way merge.

What do we Get with Git?

Let’s run through the spoils. First… developers work on feature, or “topic” branches to complete their work. This is an isolated, “virtual” copy of the repository branched of from the master branch at a certain point in time.  Below, we can see an illustration of the “Git Flow” methodology, our preferred RPD multi-user development strategy at Rittman Mead:

Git Branch Strategy

Developers no longer have to concern themselves with conflict resolution the way they have always had to when using MUDE. The merging is done after the fact… at the time when releases are assembled. And this brings me to the second–and most impactful–spoil: we are able to easily “cherry-pick” (yes… that is a technical term) features to go into our releases. For Agile methodologies that focus on delivering value to their users often… cherry-picking is a requirement, and it simply is not possible with either serialized, online development… or with OBIEE’s built-in MUDE. As depicted in the Git Flow image above, we are able to cherry-pick the features we want to include in a particular release and merge them into our develop branch, and then prepare them for final delivery in the master branch, where we tag them as releases.

Is it Really So Radical?

Earlier, I described this methodology as “radical”, which certainly requires context. For BI developers using OBIEE, Cognos, MicroStrategy, etc., this is indeed radical. But this is all old hat to Java developers, or Groovy developers, or Ruby on Rails developers, etc. For those folks, SDLC and methodologies like Git Flow are just how development is done. What is it about BI developers that make these concepts so foreign to us? Why are proper software development lifecycle (SDLC) strategies always an after-thought in the BI tools? All of them–Red Stack or otherwise–seem to “bolt-on” custom revisioning features after the fact instead of just integrating with proven solutions (Git, SVN, etc.) used by the rest of the world. When I speak to customers who are “uneasy” about stepping away from the OBIEE-specific MUDE approach, I reassure them that actually, branch-based SDLC is really the proven standard, while bespoke, siloed alternatives are the exception . Which gets us thinking… what else is the rest of the world doing that we aren’t? The easy answer: automated builds.

Should We Automate Our Builds?

The answer is yes. The development industry is abuzz right now with discussions about build automation and continuous integration, with oft-mentioned open-source projects such as Apache Maven, Jenkins CI, Grunt and Gradle. Although these open source projects are certainly on my radar (I’m currently testing Gradle for possible fit), an automated build process doesn’t have to be near that fancy, and for most environments, could be produced with simple scripting using any interpretative language. To explain how an automated build would work for OBIEE metadata development, let’s first explore the steps required if we were building a simple Java application. The process goes something like this:

  1. Developers use Git (or other VCS) branches to track feature development, and those branches are used to cherry-pick combinations of features to package into a release.
  2. When the release has been assembled (or “merged” using Git terminology), then an automated process performs a build.
  3. The end result of our build process is typically an EAR file. As builds are configurable for different environments (QA, Production, etc.), our build process would be aware of this, and be capable of generating builds specific to multiple environments simultaneously.

Taking these steps and digesting them in an OBIEE world, using the built-in features we have for comparing and patching, it seems our build process would look something like the following:

deploy

We’ll explain this diagram in more detail as we go, because it’s truly pivotal to understanding how our build process works. But in general… we version control our MDS XML repository for development, and use different versions of that MDS XML repository to generate patch files each time we build a release. That patch file is then applied to one or more target binary RPD files downstream in an automated fashion. Make sense?

To prepare ourselves for using MDS XML for automated builds, we have to start by generating binary RPD files for the different target environments we want to build for, in our case, QA and Production versions. We’ll refer to these as our “baseline” RPD files: initially, they look identical to our development repository, but exist in binary format. These target binary RPD files will also be stored in Git to make it easy to pull back previous versions of our target RPDs, as well as to make it easy to deploy these files to any environment. We only have to generate our baselines once, so we could obviously do that manually using the Admin Tool, but we’ll demonstrate the command-line approach here:

Once we have our binary QA and Production RPDs ready, we’ll need to make any environment-specific configuration changes necessary for the different target environments. One obvious example of this kind of configuration is our connection pools. This is why we don’t simply generate an RPD file from our MDS XML repository and use it for all the downstream environments: specifically, we want to be able to manage connection pool differences, but generically, we are really describing any environment-specific configurations, which is possible with controlled, automated builds. This is another step that only has to be performed once (or whenever our connection pool information needs updating, such as regular password updates) so it too can be performed manually using the Admin Tool.

Once our baselines are set, we can enable our build processes. We’ll walk through the manual steps required to do a build, and after that, we’ll have a quick look at Rittman Mead’s automated build process.

Staging the Prior Version

After we’ve completed development one or more new features, those branches will be merged into our develop branch (or the master branch if we aren’t using Git Flow), giving us a consistent MDS XML repository that we are ready to push to our target environments. We’ll need to compare our current MDS XML repository with the last development version that we produced so we can generate a patch file to drive our build process. There’s a very important distinction to make here: we aren’t going to compare our development repository directly against our QA or Production repositories, which is the approach I often see people recommending in blogs and other media. Hopefully, the reason will be clear in a moment.

So where are we going to find the previous version of our MDS XML repository? Since we are using version control, we are able to checkout our previous build using either a branch or a tag. I’ll demonstrate this using a branch, but it makes little difference from a process perspective, and most VCS (including Git) use the same command for either approach. To make this clear, I’ve created two branches: current and previous. The current branch is the one we’ve been diligently working on with new content… the previous branch is the branch we used to drive the build process last time around. So we need to “stage” a copy of the prior MDS XML repository, which we do by checking out the previous branch and extracting that repository to a location outside of our Git repository. Notice that we’ll extract it as a binary RPD file; this is not a requirement, it’s just easier to deal with a single file for this process:

stage-rpd-mark

Generating the Patch File

Now that we have staged our previous RPD file and returned to the current development branch, we are ready to proceed with generating a patch file. The patch file is generated by simply doing a comparison between our development MDS XML repository and the staged binary RPD file. The important thing to note about this approach: this patch file will not have any environment-specific information in it, such as connection pool configurations, etc. It only contains new development that’s occurred in the development environment between the time of the last release and now:

create-patch-mark

Applying Patch File to our Target(s)

With the staged, prior version of the binary RPD, and our patch file, we have all we need to apply the latest release to our target binary RPD files. In our case, we’ll be applying the changes to both the QA and Production binary RPD files. Compared to everything we’ve done up until this point, this is really the easiest part. Below is a sample patchrpd process for our production RPD:

Patching Production RPD

There are a lot more options we could incorporate into this process from a patching perspective. We cold use the -D option to introduce a decision file, which gives us the ability to highly customize this process. What we’ve demonstrated is a fairly simple option, but to be honest, in a controlled, SDLC approach, it’s rare that we would ever need anything more than this simple process.

Automated Builds with the Rittman Mead Delivery Framework

As you can imagine, with all the brain-power floating around at Rittman Mead, we’ve packaged up a lot of automated processes over the years, and we have active development underway to refactor those processes into the Rittman Mead Delivery Framework. As these disparate pieces get incorporated into our Framework as a whole, we tend to rewrite them in either Groovy or Python… these languages have been adopted due to their strategic choice in Oracle products. I personally wrote our automated OBIEE build tool, and chose Groovy because of it’s existing hooks into Gradle, Maven, Ant and all the rest. Also, the domain-specific language (DSL) capability of Groovy (used in both Grails and Gradle) is perfect for writing task-specific processes such as builds. The build process in our framework is driven by the buildConf.groovy config file, which is demonstrated below:

buildConf-mark

With Groovy, most things Java are pretty easy to incorporate, so I added Apache log4j functionality. Running the automated build process first in a logging mode of INFO, we see the following standard output:

build-mark

And now, in DEBUG mode:

[oracle@oracle obiee]$ groovy BuildOBI.groovy 
[main] DEBUG common.GitRepo - currentBranch: current 
[main] DEBUG common.Util - command: git checkout previous 
[main] DEBUG common.GitRepo - currentBranch: previous 
[main] DEBUG common.Util - command: biserverxmlexec -D /home/oracle/source/gcbc/rpd/gcbc -O /stage/gcbc.rpd -P Admin123 
[main] DEBUG common.Util - command: git checkout current 
[main] DEBUG common.GitRepo - currentBranch: current 
[main] INFO obiee.Build - /stage/gcbc.rpd staged 
[main] DEBUG common.Util - command: comparerpd -P Admin123 -C /home/oracle/source/gcbc/rpd/gcbc -W Admin123 -G /stage/gcbc.rpd -D /home/oracle/source/gcbc/patch/patch.xml 
[main] INFO obiee.Build - /home/oracle/source/gcbc/patch/patch.xml created 
[main] DEBUG common.Util - command: patchrpd -C /home/oracle/source/gcbc/rpd/qa/gcbc.rpd -I /home/oracle/source/gcbc/patch/patch.xml -G /stage/gcbc.rpd -O /home/oracle/source/gcbc/rpd/qa/gcbc.rpd -P Admin123 -Q Admin123 
[main] INFO obiee.Build - /home/oracle/source/gcbc/rpd/qa/gcbc.rpd patched 
[main] DEBUG common.Util - command: patchrpd -C /home/oracle/source/gcbc/rpd/production/gcbc.rpd -I /home/oracle/source/gcbc/patch/patch.xml -G /stage/gcbc.rpd -O /home/oracle/source/gcbc/rpd/production/gcbc.rpd -P Admin123 -Q Admin123 
[main] INFO obiee.Build - /home/oracle/source/gcbc/rpd/production/gcbc.rpd patched 
[oracle@oracle obiee]$

The real power that Git brings (and frankly, other VCS’s such as Subversion do as well) is the ability to configure commit hooks, such that scripts are executed whenever revisions are committed. You’ll notice that the buildConf.groovy configuration file has a list option called validBranches. This allows us to configure the build to run only when the current branch is included in that list. We have that list unpopulated at the moment (meaning it’s valid for all branches), but if we populated that list, then the build process would only run for certain branches. This would allow us to execute the build process as a commit process (either pre or post commit), but only have the build process run when being committed to particular branches, such as the develop branch or the master branch, when the code is prepared to be released.

Conclusion

So what do we think? We’ve put a lot of thought into implementing enterprise SDLC for our customers, and this process works very well. However, I use our Delivery Framework in single-user development scenarios (when I’m creating metadata repositories for presentations, for instance) because the tools are repeatable and they just work. We have additional processes that allow us to complete the build process by uploading the completed binary RPD’s to the appropriate servers and restarting the services.

Categories: BI & Warehousing