Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 1 hour 52 min ago

Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse

Thu, 2014-02-20 03:00

Recently, my colleague, Pete Carpenter, described a proof of concept we carried out using Amazon Redshift as the data warehouse storage layer in a system capturing data from Oracle E-Business Suite (EBS) using Attunity CloudBeam in conjunction with Oracle Data Integrator (ODI) for specialised ETL processing and Oracle Business Intelligence (OBI) as the reporting tool.

In this blog I will look at Amazon Redshift and how it compares with a more traditional DW approach using, as my example, Oracle. I am not going to talk performance in absolute terms as your mileage is going to vary.

What is Redshift?

Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay. Currently, there are 2 families of Redshift servers, the traditional hard-disk based, and the recently introduced SSD family, which has less storage but far more processing power and faster CPUs. For our trials we looked at the traditional disk based storage on a 2 node cluster to give us 4TB of disk spread across 4 CPU cores. Apart from single node configurations, Redshift systems consist of a leader node and two or more database nodes; the leader node is supplied free of charge (you only pay for the storage nodes) and is responsible for acting as the query parser, coordinating the results from the database nodes, and being a central network address for user access.

The Redshift product has its origins in ParAccel and that in turn Postgres and thus supports ANSI SQL and the ODBC and JDBC Postgres drivers. In basic terms it is a share-nothing parallel processing columnar store database that supports columnar compression.

At the cluster level all sorts of robustness features come in to play to handle routine hardware failures such as a node or disk; regular automatic backups occur and on-demand backups can be made to S3 storage for DR or replication to other AWS networks. It is possible to dynamically change the number and or type of Redshift nodes in use, in effect a new cluster is spun up and the data copied from the existing system to the new before dropping the old system. The original database remains open for query (but not update) during the scale-out (or scale-down) process. As Pete Carpenter described, creating a new Redshift instance is a simple matter of completing a few web forms and waiting for the cluster to come up. Once up you can connect to the database using the master credentials you specified at cluster creation and then create databases, users, and schemas as required.

Databases, users, schemas and security

Although it is possible to run a Redshift database using the master user and the default database, good practice suggests that we do a bit more than this. In some ways Redshift is a little like the Oracle 12c database in that we can create additional databases within the master database, much in the style of plugable databases; a major difference comes with the concept of a USER. In Oracle 12c a user belongs to a plugable database, in Redshift all users belong to the master (container) database and can see any of the contained databases (subject to grants.) Schemas are logical groupings for objects and need not be aligned to database user names. Standard object and role grants allow users to access specific databases, schemas, and tables or to have role-rights such as administrator. The final aspect of security is outside the database and is in effect a firewall rule to permit any nominated AWS user or specified IP addresses to speak to the database listener; by default the rule is no inbound access. The diagram below is a block representation of how databases, users, schemas and firewall interrelate. Note user names are descriptive and not valid names!

Screen Shot 2014 02 15 at 13 12 34

Database Design

A key point of difference between Amazon Redshift and Oracle is in how the data is stored or structured in the database. An understanding of this is vital in how to design a performant data warehouse. With Oracle we have shared storage (SAN or local disk) attached to a pool of processors (single machine or a cluster); however, Redshift uses a share-nothing architecture, that is the storage is tied to the individual processor cores of the nodes. As with Oracle, data is stored in blocks, however the Redshift  block size is much larger (1MB) than the usual Oracle block sizes; the real difference is how tables are stored in the database, Redshift stores each column separately and optionally allows one of many forms of data compression. Tables are also distributed across the node slices so that each CPU core has its own section of the table to process. In addition, data in the table can be sorted on a sort column which can lead to further performance benefits; I will discuss this in the section on tables.

Not all of the database features we come to expect in an Oracle data warehouse are available to us in Redshift. The Redshift Developer Guide has the full rundown on what is available, but for now here is a short list of common DW features that are not going to be available to us.

  • Tablespaces
  • Indexes
  • Partitions
  • Constraints
    • Check
    • Primary, Unique, Foreign Key (all usable by optimizer but not enforced)
  • Spatial (Locator) functionality
  • Sequences (although there is an AUTO NUMBER column type)
  • MERGE – we have to code as UPDATE and INSERT in two steps
  • In-database PL/SQL-like language
    • Triggers
    • User defined functions
    • Procedures
  • Timestamps (with timezone)
  • XML types
  • Pseudo columns
  • Various SQL functions (not a full list, but functions I often use in ETL processes)
    • Regular expressions
    • Regression functions
    • SUBSTR
    • TRANSLATE
    • ROW_NUMBER
    • TO_TIMESTAMP

In addition data types may not be exactly the same as those used in Oracle; for example DATE in Oracle has a resolution of 1 SECOND, DATE in Redshift has a resolution of 1 DAY.

Tables

The basic Oracle syntax to create a table works (as does CTAS, Create Table As Select), however there are additional items we can, and should, specify at table creation.

By default the data distribution style is EVEN, that is data is distributed between node-slices  in a round-robin fashion, for performance we may wish to specify a distribution key column to allow a particular column to control how data is distributed; a similar concept to Oracle hash partitioning, and with the same sort of performance characteristics. We aim to create an even distribution of rows per slice (else one slice will take longer than the others to process its data) and by applying the same distribution to other tables that are commonly joined we can benefit from improved table joining performance as all of the rows are stored in the same node-slice. Sometimes it is more appropriate to replicate the whole table to each slice so that the data is always available to join without the need to move data to the same slice before joining; In such cases we set the distribution style to be ALL.

The second thing we can set on a table is the SORTKEY this specifies one or more columns on the table by which the data is ordered on data load (it can be the same column as the distribution key). Redshift maintains information on the minimum and maximum values of the sort key in each database block and at query time uses this information to skip blocks that do not contain data of interest.

Finally, we can elect to compress columns in the database. If we do not specify compression, the default is RAW (i.e. uncompressed) is used. For compressed data we can specify the compression algorithm used, different algorithms are better for certain data types and values. Compression may be data block based (DELTA, BYTE-DICTIONARY, RUN LENGTH, TEXT255 and TEXT32K) or value base (LZO and the MOSTLY compressions). This sounds daunting but there are two ways we can get compression suggestions from the database: using the ANALYZE COMPRESSION command on a loaded table and the AUTO COMPRESS feature of the COPY command, this however requires an empty non-compressed target table; copy is the Redshift equivalent of SQL/Loader and takes a flat file and inserts it into the database.

Let’s consider a simple table T1 with three columns, C1, C2 and C3. We can create this using a simple piece of DDL:

CREATE TABLE T1
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL,
C3 DATE
);

I have not used any of the Redshift nice-to-have features for sorting, distribution, and compression of data. Note too, that I am using NOT NULL constraints, this is the only constraint type enforced in the database. This simple create statement creates database objects on each slice of the cluster, with one block per column per slice (1 slice = 1 CPU core) see the following diagram, note there is no table object stored in the database, it is a collection of columns.

Table slice

Without specifying a distribution key data is evenly spread across all slices. When a 1MB block for a column is full a new block is created for subsequent inserts on the slice. An empty table will occupy block size * number of columns * number of cores and our block size is 1MB this would be columns * cores megabytes

Using a distribution key effectively hashes the data on the key column by the number of cores. Adding a sort key declares that the rows in the table are ordered and potentially allows block elimination to kick in. If our sort key is, say, transaction date, it is likely that our data loads occur in transaction date order, however if we sorted on product code we might find each data load has data that needs to be inserted between existing rows. This does not happen, the data is still appended to the table and the table now needs to be reorganised to put the rows in order. There are two ways to achieve this, the VACUUM command that does an on-line reorg of the table and the potentially faster route of creating a copy table, populating it and then dropping the original and renaming the copy, of course this gives a little downtime when the original table is not available for access.

Applying compression, sort and distribution we get a DDL statement like:

CREATE TABLE T2 
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL SORTKEY DISTKEY,
C3 DATE ENCODE DELTA
);

This table uses column C2 as both the sort key and the distribution key; column c3 is compressed using delta compression – this is an efficient compression algorithm where most dates are ±127 days of the date of the previous row. If we wanted to use a multi-column sort key the DDL syntax would be like:

CREATE TABLE T1 
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL DISTKEY,
C3 DATE 
)
SORTKEY (C3,C2);

Multi-column distribution keys are not supported.

Designing for Performance

Redshift is designed for query and bulk insert operations; we can optimise query performance by structuring data so that less data is transferred between nodes in a join operations or less data is read from disk in a table scan. Choosing the right data sortkeys and distkeys is vital in this process. Ideally these key columns should not be compressed. Adding primary and foreign keys to the tables tells the optimizer about the data relationships and thus improves the quality of query plan being generated. Of course up to date table stats are a given too; tables must be ANALYZEd when ever the contents changes significantly and certainly after initial load. I feel that we should collect stats after each data load.

For a FACT + DIMENSIONS data model (such as in the performance layer of Oracle’s Reference Data Warehouse Architecture) it would be appropriate to distribute data on the dimension key of the largest dimension on both the dimension and the fact tables, this will reduce the amount of data being moved between slices to facilitate joins.

For optimal performance we should always ensure we include both the distribution keys and the sort keys in any query, even if they appear to be redundant. The presence of these keys forces the optimizer to access the tables in an efficient way.

For best data load performance we insert rows in bulk and in sortkey order. Redshift claim best performance comes from using the COPY command to load from flat files and as second best the bulk insert SQL commands such as CTAS and INSERT INTO T1 (select * from T2);. Where Redshift performs less well is when we use certain kinds of ETL steps in our process, particularly those that involve updating rows or single row activities. In addition loading data without respecting the sort key leads to performance problems on data query. If data update is essential we have two real options: we move our ETL processes to a conventional database hub server (perhaps using ODI) and just use Redshift to store pre-transformed data; or we revise our ETL processes to mimimize update activity on the Redshift platform. There is some scope to optimize updates by distributing data on the update key but another approach is to use temporary tables to build the results of the update and to replace the table with the results of the merge. This requires a bit of inventiveness with the ETL design but fortunately many of our required SQL constructs including analytic functions are there to help us.

Categories: BI & Warehousing

Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 2 : Using ODI and OBIEE with Redshift as a Target/Source

Wed, 2014-02-19 03:00

In my post yesterday we stepped through the initial set up and staging load for a data warehouse using Amazon Redshift and Attunity, for eventual use with OBIEE and ODI. Now that we have our source data in Redshift, let’s look at transforming it into a star schema using ODI, by initially looking how we set up the connection to Redshift in ODI’s Topology Navigator.

As I mentioned in yesterday’s post and on a blog article we wrote on non-Oracle databases a few months ago, Redshift is based on ParAccel technology, but uses PostGreSQL ODBC and JDBC drivers to connect to it. Therefore. we can create a Physical Architecture connection in ODI through to it using the PostgresSQL technology type, like this:

Odi config 1aFor the Redshift JDBC URL you’ll need, refer to the AWS Cluster summary page back on the Amazon AWS Console.

Odi config 2a

We can then add our schemas:

Odi config 3a

Then, if your connection to the Redshift database is working correctly, you should see a list of schemas in the drop down menus:

Odi config 4a

Once all the schemas have been added, when then go through the familiar ODI process of logical architecture and context mapping:

Odi config 5a

Next, we can create a model and reverse engineer in our data stores -

Odi config 6a

It’s a good idea at this point to review each datastore and ensure that all attributes have been correctly assigned a datatype. To save a little reading time, a set of dimension and fact tables were created in the performance schema, and these were also reverse engineered in:

Odi config 7a

So from here on out, the process of creating and using mappings, packages, variables etc to populate your target tables is fairly straightforward, one which ODI Developers will be familiar with. You add your source and target data stores and implement the transformation logic.

Odi config 8a

For the initial load, the SQL Control Append IKM was used and for the most part, this KM worked well without customisation. However, one of the early issues encountered during execution was with unordered outer joins – this appears to be a an issue with the PostGresQL technology. This was resolved by amending the mapping and setting an order for the joins:

Odi config 9a

Merging / Incremental loading

One of the things to be aware with Redshift at the time of writing is the absence of Merge functionality (we’re so spoilt in Oracle DBMS land ;) ) You can of course achieve this by issuing separate insert and update statements. There is also the SQL Incremental Update IKM in ODI, but you may want to review the KM Task steps against your requirements before proceeding, as invariably some customisation will be required to improve performance. The same applies to SCD functionality – you should consider utilising analytic window functions (Redshift supports several aggregate and ranking window functions) to achieve your desired format.

Additionally, as mentioned previously Attunity supports a variety of replication tasks – you can do one off initial loading, an initial load + apply changes, and/or collect deltas into a change tracking table in the Redshift database. This in itself means you have the basic building blocks to create and maintain a Foundation layer as part of your Data Warehouse Design.

So, once we have data in our target fact and dimension tables, we can model these in OBIEE and create some analyses.

OBIEE Configuration

 As per the AWS documentation for Redshift, the recommended ODBC driver was downloaded from the Amazon AWS website, and an ODBC data source was created on the OBIEE server:

Obiee config 1a

For the server entry, you can use either the public or private leader node IP, depending on your network/security configuration. Just backtracking a little here – you may recall that the usr_dw account was created after the initial Redshift cluster build. In order for it to be able to access the tables in the performance schema, we need to grant the required select privileges.  

Once the ODBC connection is defined, we can create an RPD, define our database connection properties, import our tables and start modelling.

Obiee config 2a

Obiee config 3a

 

When it came to setting the features I used the query DBMS function and then went with what it set. Bear in mind that because PostgreSQL and Redshift aren’t supported source databases for OBIEE, depending on the analyses you are creating you may or may not hit functionality issues, so you may find yourself tweaking these settings.

Obiee config 4a

So if your connection is defined correctly and there are no other issues, you should now be able to import the tables:

Obiee config 5a

Once you’ve made your selection, you can then define the relationships between the fact and dimension tables (if you have primary and foreign keys defined these should propagate through), create any aliases etc. and then you’re about ready to start building up the the Business Model and Presentation areas. 

Obiee config 6a

Once the RPD modelling is completed, we can upload it and create some content.

Obiee config 8a

Obiee config 7a

Admittedly the analyses created for this dashboard were fairly basic, so no functionality issues were encountered during creation and execution – you might hit some with more “edge-cases” such as analytic functions or nested subtotals, or you might end-up hitting performance issues when you’ve got lots of concurrent users, as the SQL issued by OBIEE might not be as efficient as it would be for a supported data source.

Performance Considerations for Redshift

Pete Scott will be covering the ins and outs of designing for performance in a blog post tomorrow, for which I’ll add a link to this post once it’s up. However, one thing that should be mentioned is the importance of setting sort and distribution keys appropriately on the fact and dimension tables, as well as Primary and Foreign keys (whilst these constraints are not enforced, they are used by the optimiser when determining execution plans). The sort and distribution keys determine how the data is stored on disk, and how it is distributed across the compute nodes, and can make a big impact on query response times, as well as any issues around lack of official support in OBIEE for Redshift as a data source.

So to conclude -  whilst not officially supported, it is possible to create and maintain a Data Warehouse in Redshift and use the Oracle Product set to drive ETL and reporting. Some of the features that make Redshift an attractive DB platform is it’s ease of management, it’s scaleability and sizing options, and the ability to get a cluster up and running in hours rather than days. Combined with ODI and OBIEE and therefore a transferable skills base, it makes for an intriguing DW solution.

Categories: BI & Warehousing

Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 1 : Configuring Redshift, and CDC Using Attunity

Tue, 2014-02-18 12:36

One of our longstanding Oracle customers recently asked us to put together a proof-of-concept DW system using Amazon Redshift as the data warehouse database, rather than Oracle Database. The main driver for this was the economics of running Redshift in the cloud vs. Oracle on-premise, or using Amazon RDS, and they were also interested in the potential performance benefits of running their data warehouse on a column-store database. They were also interested in trying out Attunity Replicate with Cloudbeam as the CDC solution, a product conceptually similar to Oracle GoldenGate but hosted on the Amazon AWS platform, and licensed by data volume and month rather than per CPU, perpetual, as GoldenGate is.

Because the client is a big user of Oracle ETL technology though – currently OWB, moving to ODI in the near future – they still wanted to use ODI to do the main data integration work, so an interesting question for us was whether ODI was a suitable tool for use with Redshift, which isn’t an officially supported source or target platform for ODI. It’s important to understand what “not supported” or “not certified” means in this context – it doesn’t mean it won’t work, it just means you’re on your own if it doesn’t. What gave us hope though was that Redshift uses Postgres-compatible ODBC and JDBC drivers, and we got it working with OBIEE a few months ago, so the premise for the PoC seemed reasonable to us. The diagram below shows the architecture we went with at the start of the project.

Architecture

 

A little about the non-Oracle components:

  • Amazon Redshift – Amazon’s Fully Managed Data Warehouse in the cloud offering. Based on ParAccel technology (which itself has roots in PostGresQL), the service utilises MPP, columnar storage and compression principles to enhance query performance.  
  • Attunity Replicate & Cloudbeam – Attunity provide log based Change Data Capture and replication functionality, which includes Redshift as a target database. Attunity supports on going source to target replication as well as storing change data in audit / change tracking tables. In order to replicate data to Redshift, you need to set up an account with Attunity and then sign up for a CloudBeam subscription. 
  • Amazon S3Amazon’s cloud storage service, which is used by Attunity to stage replication/change data prior to loading into the Redshift database.

Initial Configuration and Load

Source Database

The source RDBMS used for this proof of concept was an Oracle EBS database running on 11.1.0.7. In particular, we used the GL schema as this housed some sizeable data. The database requirements for Attunity Replicate are: 

  • Add supplemental logging (for this experiment full logging was added on the GL tables)
  • Switch the database into archive log mode
  • Optionally create a dedicated database account for Attunity – requires select privs on various V$ tables, the GL schema tables and execute on the LogMiner packages. It’s at this point you may want to touch base with your resident DBA :) 

S3 Storage

The Attunity CloudBeam service initially stages files in an S3 bucket before loading into the target database schema. In order to create this, browse to the S3 section of the AWS Console, and click Create Bucket. Once the bucket has been created, you then need to create a folder within this, in this instance the folder was named ORA_STG.

S3 1ds

 

Redshift Cluster

You deploy a Redshift database from the AWS management console. From the Redshift page, select a Region, and then Launch Cluster. From here, fill out the cluster details, for example:

Redshift config 1a

Leaving the database name blank will create a database named ‘dev’.  

After clicking Continue, the next page allows you to size your cluster.  For our test case, we opted for a two node cluster (which in fact has 3 nodes – 1 Leader node and 2 Compute nodes):
 

Redshft config 2a

The next configuration screen allows you to set various security and network options:

Redshift config 3a

Finally you are presented with a summary screen confirming your configuration options. The important thing to note here is that as soon as the cluster is launched and running, you will be start being charged. 

Reshift config 4a

After you click Launch, the cluster build will begin, and after some time the Status will switch from creating to Available. Clicking the Cluster Name hyperlink will show you a summary screen, detailing security config, cluster status, information about backups and URLs for connecting to your cluster. 

Redshift config 5a

At the bottom of the screen area list of public and private IPs for the Leader and Compute Nodes of the cluster: 

Redshift config 6a

We used the public IP of the Leader node for our inbound client connections.

The final step in terms of cluster configuration was to configure a list of IPs that were authorised to access the cluster. This is handled from within the security screen on the Redshift Management Console.

Redshift config 7a

Once you’ve authorised your machine, you can connect to the Redshift database via the JDBC/ODBC URLs shown on the cluster summary screen previously. Amazon recommend using SQL Workbench, and also link to JDBC and ODBC drivers which you should install. So, the next step was to enter the connection details, connect to the database and create some target users and schemas.

The connection details were defined using the downloaded jdbc driver, and the jdbc url from the AWS summary screen:

Redshift config 8a

Once connected, users and schemas were created with the following statements:

create schema performance;
create schema staging;
create schema foundation;
 
create user usr_stg with password '<password>';
create user usr_fnd with password '<password>';
create user usr_dw with password '<password>';
create user odi_work with password '<password>';

NB: in Redshift, passwords must contain one uppercase character,a number, and be at least 8 chars long.

Once the users and schemas are created, we can then grant privileges  to the odi_work account to be able to create objects across the schemas:

grant usage,create on schema staging to odi_work;
grant usage,create on schema performance to odi_work;
grant usage,create on schema foundation to odi_work;

Of course, how you would manage security and database privileges in your own environment is entirely up to you. You can configure object level privileges as your database security guidelines dictate. 

Once your users are set up, you can create objects and grant privileges as required. 

Attunity

Once the Attunity Replicate software is installed, it’s a case of adding the source and target databases.

Source: Oracle EBS database

 

Clicking Manage Databases and then Add Database brings up the following dialogue, in which you fill in the details of the database. The Advanced tab contains details around how Attunity implements and executes CDC/log Mining. 

Add db 1a

Attunity 1a

Target: Redshift Cluster

You follow the same process as above, but select Amazon Redshift as the Type and specify it as a target. Within this screen, you enter details of the cluster endpoint address, the port (refer back to the AWS cluster summary screen), and a user name and password to access the database. If you have entered the details correctly and you can reach the cluster, you should be able to drop down the Database Name list and see the dev database. The next section on this dialog is to enter details about your Attunity CloudBeam subscription, and then your S3 bucket and folder. It should be noted here that in your environment you may not want to use your master user  db account, but instead opt for a slightly less privileged user. 

Attunity 2a

Once you have filled out each section, it is worth clicking the Test button on the bottom left of the dialog to ensure all details are correct.

So, now that we have our source and target databases configured, the next step is to create a Replication Task. The task can be a combination of the following options: 

  • Full Load – replicate a table from source to target
  • Apply Changes – combined with the above, maintains the target copy with changes that occur on source
  • Store Changes – CDC deltas stored in a separate table 

Once you’ve decided on the type of replication task, you drag and drop the source and targets onto the palette, configure which tables you want to replicate from source, and specify any transformation rules e.g. to move to a different target schema. In our test case, we created an initial load task and a rule was created to map the GL schema in the source to the staging schema in the target. 

Attunity 4a

Attunity 6a

Within the task settings, you can choose to either create new target tables, or use pre-created ones. This is useful if you want to create the tables with certain options, e.g. specifying the sort and distribution keys. 

Attunity 5a

Once you finished reviewing and tweaking settings, you can run the task. Once kicked off, you can review progress from the Monitor screen: 

Attunity 7a

Any errors will be flagged up and can be reviewed in the Messages section. When the load completes, you can verify the existence of the tables in the Redshift database: 

Attunity 8a

We can now need to grant select privileges on the above objects to odi_work (unless you configured Attunity to use the odi_work account) before it can access them, as despite residing in a schema odi_work can create objects in, they are owned by a different user and therefore permissions still need to be granted.

grant select on <schema>.<object> to odi_work;

 

So the first stage of getting our data from source to Redshift is complete. Tomorrow, we will look at transforming our initial load data into a target star schema using ODI.

Categories: BI & Warehousing

Using Groovy instead of WLST for OBIEE Systems Management

Sun, 2014-02-16 22:33

I remember some years back when Rittman Mead first received the OBIEE 11g beta and I tried installing it for the first time. It was a nightmare. If you think the 11.1.1.3 release was challenging, you should have tried working with any one of the betas. It was with these first few releases that Weblogic was injected into my world. Some time around the first month of starting the beta, I recall Mark Rittman saying something along the lines of: “all the Fusion Middleware stuff looks interesting, but I’m glad we won’t have to know anything about that.” You see… I just wanted a nice career in BI, but here I am writing a blog post on working with JMX MBeans.

JMX MBeans are complicated, meaning that careers probably exist for working with JMX MBeans and little else. I’m truly sorry to hear that. For the layperson (and by this, I mean people who work in BI), trying to understand MBeans probably ranks somewhere between getting decent service at a restaurant in the UK, and understanding why the Anomaly needs to join with the Source. It’s complex enough that most vendors provide utilities to simplify the process… which means using anything other than Java to work with them. For Weblogic, we have Weblogic Scripting Tool (WLST).

WLST is really just a specific implementation of Jython designed into a command-line utility that in many ways feels like SQL-Plus or RMAN. It’s designed to work interactively or in scripted fashion, and Rittman Mead has offered up some free WLST scripts to the community over the years via GitHub. If you take a look at our deploy_rpd.py script for example, you can make some sense of how WLST works. I’ve reworked that script slightly in this post to use hardcoded values instead of variables, to remove the exception handling, etc., to make the logic easier to follow. It’s shown here in two parts… the first part which connects to the Weblogic Administration Server and locks the domain:

import sys
import os

connect('weblogic', 'welcome1', 't3://localhost:7001');
domainCustom()
cd ('oracle.biee.admin')
cd ('oracle.biee.admin:type=BIDomain,group=Service')

print 'Locking the configuration...'
objs = jarray.array([], java.lang.Object)
strs = jarray.array([], java.lang.String)
invoke('lock', objs, strs)

This lock method is equivalent to the “Lock and Edit” button inside Fusion Middleware Control. We use the cd command to navigate to the BIDomain MBean because the lock method (as well as commit and rollback ) is in that MBean. The invoke command is used to execute MBean methods in WLST and accept parameters for those methods. To do this we use generic Object and String arrays because the invoke command is generic and needs to support an ever-changing series of parameters. The rest of the pared script is below:

cd ('..')
cd ('oracle.biee.admin:type=BIDomain.BIInstance.ServerConfiguration,biInstance=coreapplication,group=Service')
# Set the parameters
params =  jarray.array(['/mnt/hgfs/stewart/scripts/sugarcrm.rpd','Admin123'],java.lang.Object)
# Set the parameters Signs
sign =  jarray.array(['java.lang.String', 'java.lang.String'],java.lang.String)
# Invoke the procedure
invoke( 'uploadRepository', params, sign)

cd ('..')
cd ('oracle.biee.admin:type=BIDomain,group=Service')
objs = jarray.array([], java.lang.Object)
strs = jarray.array([], java.lang.String)
invoke('commit', objs, strs)

In the second part of this script, we navigate to the ServerConfiguration MBean, which contains the uploadRepository method. We populate the params array with the RPD path and RPD password, we upload the repository, and then navigate back to the BIDomain MBean to perform the commit.

WLST strikes me as an attempt to take a programmatic interface, one that is usually accessed through Java, and make it accessible to non-developers. In this regard, I see it as a major failure. The MBean methods have been abstracted to the point of being almost useless. This seems like the exact opposite of simple to me. It seems overly complex to have to construct arrays and populate them just to pass parameters to methods. I would prefer to just pass parameters as, well… parameters. Additionally… notice how we had to cd to the BIDomain MBean, cd away to go use the ServerConfiguration MBean, and then cd back again. I would prefer to use the MBeans the way they were meant to be used… as objects, so we would be able to instantiate and hold multiple MBean objects at the same time.

If WLST is indeed easier to use than Java, then this is quite an indictment of Java. It works pretty well when we are writing processes that are JMX specific… when the commands that we need to issue are all calls to MBeans and nothing else. But what happens if we want to write a process that involves interacting with MBeans as well as some other processes or APIs, such as calls to any of the Oracle BI Server XML API, in a unified script? WLST is terrible at this. In my first attempt at writing something like this, I used Perl to execute both validaterpd commands as well as MBean commands. I ended up having to maintain a separate Jython script along with the Perl script, which didn’t sit well with me as a developer. What we really need is a language that is groovy to write and maintain, compiles to pure Java (just like Jython), and has the ability to encapsulate MBeans natively in a similar fashion to WLST. What we need is Groovy.

Groovy is a dynamic language that can be used to write scripts, or write object-oriented packages and classes, and can also be used to compile to Java byte code. Because it’s simple to write and has native MBean object handling, it’s easier (in my opinion) than WLST. How easy? Well, let’s go through our deploy_rpd.py example, and see if it’s easier to write and understand. Here’s our upload repository script in Groovy:

import javax.management.remote.JMXConnectorFactory
import javax.management.remote.JMXServiceURL
import javax.naming.Context

def h = new Hashtable()
h.put(Context.SECURITY_PRINCIPAL, 'weblogic')
h.put(Context.SECURITY_CREDENTIALS, 'welcome1')
h.put(JMXConnectorFactory.PROTOCOL_PROVIDER_PACKAGES, "weblogic.management.remote")
jmxConnector = JMXConnectorFactory.connect(new JMXServiceURL("service:jmx:t3://localhost:7001/jndi/weblogic.management.mbeanservers.domainruntime"), h)
server = jmxConnector.getMBeanServerConnection()

def biDomain = new GroovyMBean(server,'oracle.biee.admin:type=BIDomain,group=Service')
def servConfig = new GroovyMBean(server, 'oracle.biee.admin:type=BIDomain.BIInstance.ServerConfiguration,biInstance=coreapplication,group=Service')

biDomain.lock()
servConfig.uploadRepository('/mnt/hgfs/stewart/scripts/sugarcrm.rpd', 'Admin123')
biDomain.commit()

That’s the whole script. There’s a few things to point out here. First… establishing connectivity to the Weblogic Admin Server is slightly harder in Groovy. WLST hides a good bit of the complexity with connectivity, but the JMXConnectorFactory class in Groovy is a close second. However, once the connectivity is done, Groovy wins the rest of the way. Notice how we are able to configure two different MBean objects: biDomain for BIDomain and servConfig for ServerConfiguration. We don’t have to cd back and forth across the MBean hierarchy to work with the different MBeans; we can simply instantiate as many MBeans at once as we need to.  Also, notice that we don’t have the generic invoke command to call the different methods in MBeans. We just call the method directly: if we want to call lock(), or commit(), or uploadRepository(), we just call it, and pass the parameters that it accepts. No need to build the generic Object and String arrays for passing in to invoke.

Because Groovy is very easy to execute external command-line processes (such as biserverxmlexec or biserverxmlgen) it’s a great choice for unified OBIEE processes that need to work with executables and JMX MBeans together in a single script. Also… Groovy accepts Java syntax, so if we can’t find the Groovy code samples we need from the internet, we can just plug in Java samples instead.

Categories: BI & Warehousing

OBIEE 11g Presentation Server Catalog Replication Across Redundant Servers

Fri, 2014-02-14 08:44

The thing about OBIEE 11g, is that there is no single way to get things “up and running”. Over the past year I have performed countless OBIEE installs and almost every one has been different to the last, due to infrastructure challenges, scale requirements or requirements for high-availability and server redundancy.

There is little we can do about infrastructure challenges, typically around policies which are in place across the IT estate and must be adhered to. Scale is simple; more users means more or bigger servers. The last point on the list is HA & server redundancy, and this is arguably the biggest area we need to address before punching in “./runInstaller” at the command-prompt.

Mark Rittman is currently pulling a wider blog posts around options for HA & server redundancy, so I’m not about to get into that in this post. What I will share in my first Rittman Mead blog post is something I have been working on recently, that is sometimes implemented as part of a HA/redundancy solution; BI Presentation Server catalog replication.

OBIEE webcat sync

BI Presentation Server catalog replication is the Oracle supported method of moving a BI Presentation Server catalog from one running OBIEE instance to another. This is useful is many situations namly:

  • Release cycle management when specific aspects of the BI Presentation Server catalog need to be moved from one OBIEE environment to another, such as development to testing.
  • Full BI Presentation Server catalog synchronization between OBIEE instances to provide two OBIEE instances with exactly the same BI Presentation Server catalog for DR.

The 2nd point above is the scenario I am looking to cover here – using this utility we can make sure the BI Presentation Server catalog on two distinct OBIEE instances are the same. In a DR event we can either point users at the secondary server or reverse this utility to pump the data from the DR server to the primary server after the fault on the server has been cleared.  The RPD and other changes, such as user roles, would be managed as part of the release process and copied to the DR server when they are pushed to the production server. It is only the BI Presentation Server catalog which is constantly changing as the OBIEE instance is used, and therefore only the BI Presentation Server catalog which needs to be constantly replicated.

To set up BI Presentation Server catalog replication, we can use a command-line utility which ships with Oracle Business Intelligence 11g, called either “sawrepaj.bat” or “sawrepaj.sh” depending on your operating system (the two are functionally identical, and differ only in the way they are invoked from the command-line). “sawrepaj” has options to synchronise catalogs between servers, or to simply export a catalog for backups or as part of the release lifecycle; in addition, we can control the grain of what is exported down to an individual catalog folder or file.

The utility runs from the command-line and accepts a number of options, or commands, to specify how the replication takes place, using the following syntax:

sawrepaj.sh [/C path] command [command parameters]
  • mark – instruct the OBIEE server to ‘watch’ catalog files and record changes.
  • run – run the batch, as defined in the config.xml file.

Before using the utility, we need to create a config.xml file which instructs the OBIEE Presentation Services to record changes, and tells sawrepaj what to do when it executes.

The config.xml file can be placed anywhere in the filesystem and is referenced using the /C switch when calling sawrepaj – more about this /C switch later. The config file tells the sawrepaj utility which Oracle BI Presentation Servers are involved in the replication, and how to access them. The example below includes a source and target Oracle BI Presentation Server – however you could add countless other BI Presentation Servers depending on your environment.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Oracle BI Presentation Services Replication Agent Configuration File -->
<Config>
<General>
<ExportDirectory>smb://path/to/directory</ExportDirectory>
<LogExpiresHours>48</LogExpiresHours>
</General>
<Server name=”serverA" user="weblogic" pwd="Password01">
<Property name= "oracle.bi.presentation.url"
value ="http://servera:9704/analytics/saw.dll"/>
<Property name="oracle.bi.presentation.usesoaptcptransport" value="N"/>
</Server>
<Server name="serverB" user="weblogic" pwd="Password01">
<Property name= "oracle.bi.presentation.url"
value ="http://serverb:9704/analytics/saw.dll"/>
<Property name="oracle.bi.presentation.usesoaptcptransport" value="N"/>
</Server>
<Folderset name="all">
<Folder>/</Folder>
</Folderset>
<ReplicationTask destination="serverA" source="serverB" folders="all" />
</Config>

The three sections of interest in the above file are Server, FolderSet and ReplicationTask. Each

  • Server tag points to an OBIEE Presentation Services instance with credentials for a user with has SOAP access.
  • FolderSet tag contains a list of Folder tags which detail the folders within the webcat to replicate.
  • ReplicationTask contains the instruction of what to do when we run sawrepaj. For the above example, serverA is the target and ServerB is the source.

Before any replication can begin, we need to enable each BI Presentation Server component for replication. Add the below to the instanceconfig.xml file between the <catalog> tags. This needs to be done for all all Presentation Server components, whether source or target for the replication.

<Replication>
<Enabled>true</Enabled>
<ReadLogRecordsSinceHoursAgo>120</ReadLogRecordsSinceHoursAgo>
<RecordsInFileLimit>4000</RecordsInFileLimit>
</Replication>

Now, back to the sawrepaj commands:

mark – The mark command is used to tell tBI Presentation services which webcat files to monitor for changes. We can be as specific as a single file, or as as vague as an entire webcat. You will need to run this on a BI server with the following command:

sawrepaj mark all /

run – The final part of the process, the run command will instruct sawrepaj to run all the tasks in the config.xml file. The below is an example of the command. If your config.xml file is not in the root path of the sawrepaj tool, you will need to specify it’s location with the /C switch. This command can be added to a cronjob, a Windows scheduler or can be ran manually, or as part of another script.

sawrepaj /C /my/config/file/config.xml run

That’s it! Once completed, the above steps will give you all you need to set up full catalog replication between multiple BI servers.

Categories: BI & Warehousing

Date formatting in OBIEE 11g – setting the default Locale for users

Wed, 2014-02-12 15:35

A short and sweet blog post this one, simply to plug a gap in Google that I found when trying to do achieve this today.

How user’s see things like date formats in OBIEE is determined by their locale, which is usually related to language but not always the same.

The classic example, and the one I was trying to resolve, was that users were seeing dates presenting in the ‘wrong’ format. The users are British and instead of seeing dates in the correct format of DD/MM/YYYY (19/01/2014) they were MM/DD/YYYY (01/19/2014). (side note: this lighthearted article in the Guardian refers to this ridiculous format as “Middle-Endianness”…)

In this example I have got the same date value in two columns. One column I have explicitly formatted (“DATE_VERBOSE”) so there is no ambiguity about the date. The second column (“DATE_LOCALE”) is the same date value, but formatted according to the locale setting. It is this default locale setting that I want to demonstrate how to set – manually changing all date columns to a particular format is not a sustainable way to develop reports…

Oracle’s documentation is not very clear in this respect, and Google is a mishmash of half solutions and rubbish, so I humbly offer this short instruction for how to ensure users have the correct locale applied.

Solution 1 – manually

Each user can configure their chosen locale, from the My Account dialog:

Whenever changing the locale, you need to logout and log back in to OBIEE for it to take effect.

This setting, if they override it from the default, is stored in the Presentation Catalog under the user’s userprefsxmlstore.xml file

<sawprefs:userPreference prefKey="Locale">en-gb</sawprefs:userPreference>

So for a single user, this method is fine – but for more than one existing user, or any new users, it doesn’t fly. We don’t want to go hacking Presentation Catalog files, that is bad, mmm’kay?

Solution 2 – The USERLOCALE System Session Variable

There is a System Session Variable called USERLOCALE which you can set per user, and will define their locale. Easy! All you have to watch out for is that its value (thanks Christian) is case sensitive. This cost me some hours today, so I’ll say it again – the value you specify for USERLOCALE is case sensitive and OBIEE will not help you out if you specify it wrong (it’ll just ignore it).

You can hardcode the value to the same for all users, or if you want to get fancy you could drive it from a database table to make it variable dependent upon the user’s location that you store in a table.

The Session Variable and example Initialisation Block are shown here:

To validate what is going on with the session variable, you can use a Static Text view with the following content to show the values of the relevant session and presentation variables:

[u][b]System Session Variables[/b][/u]
<p align=left>
[b]NQ_SESSION.USERLOCALE: [/b]@{biServer.variables['NQ_SESSION.USERLOCALE']}[br/] 
[b]NQ_SESSION.WEBLANGUAGE: [/b] @{biServer.variables['NQ_SESSION.WEBLANGUAGE']}[br/]
</p>

[u][b]Predefined Presentation Variables[/b][/u][br/]
<p align="left">
[b]session.language:[/b] @{session.language}[br/]
[b]session.locale:[/b] @{session.locale}[br/]
</p>

In my test report the Static Text view is to the right of the data table, and I can now see that the USERLOCALE session variable has been populated. Crucially, the session.locale presentation variable is inheriting the correct value, which in turn is driving the desired formatting of the DATE_LOCALE column.

Method 3 – the hack

I’m including this here for completeness, and because you will come across it in plenty of places on the web so it is good to understand why it is not the correct solution.

Within the OBIEE installation folder (FMW_HOME) you will find the ORACLE_HOME folder, Oracle_BI1, in which all the application’s binaries and internal configuration are stored. This is not a “user serviceable” folder, and Oracle are at liberty to change any of the files within it whenever you patch or upgrade the software.

Within Oracle_BI1 there is a file called localemappings.xml and this file sets the default locale used if one is not explicitly configured. If you change the “catch all” line in this XML file to your desired locale, it will set the default.

<when matches="*"><localeDefinition name="en-gb"/></when>

But, this is not the correct way to do it, and there is no need to because the USERLOCALE method above works just fine.

What about AllowedLocales in instanceconfig.xml?

In combing through the documentation (RTFM, after all), you will come across reference to the Localization tags for the Presentation Services configuration file instanceconfig.xml. Within this tag you can specify AllowedLocales. However, all this does is provide a way to restrict the dropdown locale list (shown in method 1 above). So this is a nice thing to do for your user base if there is a finite number of locales they will want to use (save them wading through many options), but it does not influence the default locale – even if you set it to a single value, your desired default.

Summary

If you want to set the default locale, use the system session variable USERLOCALE. Make sure you specify your value in lowercase, otherwise it won’t work.

Categories: BI & Warehousing

Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!

Tue, 2014-02-11 04:48

The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.

The voting forms, and event details, are below:

In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.

Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.

Categories: BI & Warehousing

Testing Oracle Direct Connector for HDFS as an Alternative to Hive ODBC for OBIEE11g

Thu, 2014-02-06 05:39

In a post on the blog a couple of weeks ago, I looked at loading up a set of flight delays data into Apache Hadoop, then analysing it using Apache Hive, Cloudera Impala and OBIEE. In this scenario, OBIEE connects to the Hadoop cluster using Hive and Impala ODBC drivers, and then either Hive (through MapReduce jobs) or Impala (through its in-memory distributed query engine) returns the data to OBIEE, for display on the dashboard.

In my initial tests, as you’d expect Hive was fairly slow (as its optimised towards large (TB+) batch jobs), whilst Impala was fast, on a par with regular Oracle database queries albeit with a much more limited set of SQL functions supported. One of the questions I got asked after I posted the blog article though, was how this approach compared to accessing the Hadoop data via Oracle’s “big data connectors” – either Oracle Loader for Hadoop (OLH) or Oracle Direct Connector for HDFS? Could these give us similar performance to Hive or Impala, but make the Hadoop data more “Oracle-compatible” and suitable for querying with OBIEE?

For anyone not all that familiar with Oracle’s big data connectors, there’s two that are particularly relevant to BI/DW use-cases; Oracle Loader for Hadoop (OLH) is a bulk-loader between Hadoop and the Oracle Database that uses MapReduce on the Hadoop-side to prepare, sort and transform data before efficiently loading it into an Oracle database. There’s already lots of ways you can move data in and out of Oracle from a Hadoop source – sqoop for example – but presumably Oracle feel OLH is a particularly-efficient way of doing it, exploits Oracle direct/bulk-loading capabilities and parallelism, and so forth – but it’s a loader, not a transparent reading mechanism, so it’s only really appropriate for ETL-type situations where you want to copy the data out of Hadoop and into an Oracle data warehouse.

Oracle Direct Connector for HDFS (ODCH) is a bit different though, in that it uses the Oracle external table feature to map an Oracle data dictionary table onto HDFS files, or optionally Apache Hive tables (and therefore onto HDFS files, NoSQL databases or whatever). So you could potentially use ODCH to create Oracle database table representations of your Hive tables / HDFS files, then map these into the OBIEE repository and work with them just like any other Oracle (external) table. What this means then is that you’re not reliant on Hive or Impala ODBC drivers, or the BI Server having to generate HiveQL or ImpalaQL queries (with Impala of course not yet being officially supported), making your work from the OBIEE side a lot easier than if you’re trying to work with Hive or Impala directly.

So how well does it work then? I decided to give it a try using the new BigDataLite VM I mentioned on the blog last week, along with the Airline Delays dataset I used in the post on Impala and Hive the other week. My starting point then was a Hive database with four tables – flight performance (19m rows), dest and origin (2k rows each) and carrier (1.5k rows), like this:

NewImage

Now obviously Oracle’s big data connectors aren’t installed as part of Hadoop by default, but the BigDataLite VM does have them pre-installed, along with an Oracle Database 12c database, so a lot of the work in setting things up is done for you. By default, an install of the big data connectors connects to the Hadoop environment on the same machine, so all of the setup commands I use will assume that the Hive server and so on are on the same server – localhost in this case – although of course you can configure them to connect to a remote Hadoop server or cluster.

The way ODCH works is that it uses the Oracle external table “pre-processor” feature to stream HDFS file content into your query session, with the pre-processor in this scenario being a utility provided by the big data connectors to connect the two environments together. This blog post from Oracle explains the process in a bit more detail, but the key things to understand are that it’s conceptually similar to accessing regular file data via external tables, giving us the benefit of fairly seamless access to this external data (vs. using SQL*Loader, or OLH in this case), but it’s also not “real” Oracle table data so there’s no indexes or any other performance structures that can make queries run faster – thought like regular external tables you can run ODCH loading in-parallel.

So let’s use the setup in the BigDataLite VM to create some external tables in the Oracle database that map to my Hive tables, and underlying HDFS data files. Before we do this though we need to set up a few things; first, we need to create a temporary directory on the Linux filesystem to hold the metadata files created by ODCH, and we also need to specify where the ODCH HDFS file streamer utility can be found. Logging in as oracle/welcome1 on the VM, I first create the temporary directory, and then set an environment variable the rest of the process will be looking for:

mkdir bi_airlines_dir
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

Now I log into SQL*Plus and create a database user that I’ll create the external tables in, and grant it access to the USERS tablespace:

sqlplus / as sysdba
 
create user bi_airlines identified by bi_airlines quota unlimited on users;
grant connect, resource to bi_airlines;

Next, I need to create two database directory objects; one to point to where the ODCH HDFS stream pre-processor lives, and the other to point to my temporary directory:

create or replace bi_airlines_dir as ‘/home/oracle/bi_airlines_dir’;
grant read, write on directory bi_airlines_dir to bi_airlines;
 
create or replace directory osch_bin_path as ‘/u01/connectors/osch/bin’;
grant read, write, execute on directory osch_bin_path to bi_airlines;

Now, re-purposing the scripts on the BigDataLite VM provided to create external tables over the Moviework Hive database, each of my tables requires two setup files; one, a shell script, to call the ODCH utility and reference an XML parameter file, and the second, the XML file, which contains the definition of the external table. To take an example, the shell script (genloc_origin_hive.sh) to create an external table over my “origin” Hive table looks like this:

# Add HIVE_HOME/lib* to HADOOP_CLASSPATH. This cannot be done
# in the login profiles since this breaks Pig in the previous lab.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-conf /home/oracle/movie/moviework/osch/origin_hive.xml \
-createTable

whereas the accompanying XML file (origin_hive.xml) looks like this:

<Configuration>

<property>
 <name>oracle.hadoop.exttab.tableName</name>
 <value>BI_AIRLINES.ORIGIN_EXT_TAB_HIVE</value> 
</property>

<property> 
 <name>oracle.hadoop.exttab.sourceType</name>
 <value>hive</value> 
</property>

<property> 
 <name>oracle.hadoop.exttab.hive.tableName</name>
 <value>origin</value> 
</property>

<property> 
 <name>oracle.hadoop.exttab.hive.databaseName</name>
 <value>bi_airlines</value> 
</property>

<property>
 <name>oracle.hadoop.connection.url</name>
 <value>jdbc:oracle:thin:@localhost:1521:orcl</value> 
</property>

<property>
 <name>oracle.hadoop.connection.user</name>
 <value>BI_AIRLINES</value> 
</property>

<property>
 <name>oracle.hadoop.exttab.defaultDirectory</name>
 <value>bi_airlines_dir</value> 
</property>

</configuration>

Within the XML file, we specify the name of the external table that the utility will create in Oracle for you (ORIGIN_EXT_TAB_HIVE within the BI_AIRLINES schema); that it’s a Hive (rather than HDFS file) source, and the Hive database name and table name to source data from. With the operating system directory that you pass to it, the utility stores metadata within this to point to the individual HDFS files that contain your data, which means that whilst the HDFS data itself isn’t copied to Oracle, a hard reference to the file is, which means if you add more files to the Hive table’s HDFS directory, you’ll need to re-run the utility to register them – so its not completely automatic beyond this point, but more or less OK if you’ve got a single file providing the data, as I have in this slightly non-representative case.

So let’s run the shell script and create one of the external tables, passing in the Oracle database password for the “bi_airlines” user when prompted:

[oracle@bigdatalite osch]$ sh genloc_origin_hive.sh
Oracle SQL Connector for HDFS Release 2.3.0 - Production

Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved.

[Enter Database Password:]
14/02/06 11:59:58 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
14/02/06 11:59:58 INFO hive.metastore: Trying to connect to metastore with URI thrift://bigdatalite.localdomain:9083
14/02/06 11:59:58 INFO hive.metastore: Waiting 1 seconds before next connection attempt.
14/02/06 11:59:59 INFO hive.metastore: Connected to metastore.
The create table command succeeded.

CREATE TABLE "BI_AIRLINES"."ORIGIN_EXT_TAB_HIVE"
(
"ORIGIN" VARCHAR2(4000),
"ORIGIN_DEST" VARCHAR2(4000),
"ORIGIN_CITY" VARCHAR2(4000),
"ORIGIN_STATE" VARCHAR2(4000),
"AIRPORT_ID" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( 
TYPE ORACLE_LOADER
DEFAULT DIRECTORY "BI_AIRLINES_DIR"
ACCESS PARAMETERS
(
RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'7C'
MISSING FIELD VALUES ARE NULL
(
"ORIGIN" CHAR(4000) NULLIF "ORIGIN"=0X'5C4E',
"ORIGIN_DEST" CHAR(4000) NULLIF "ORIGIN_DEST"=0X'5C4E',
"ORIGIN_CITY" CHAR(4000) NULLIF "ORIGIN_CITY"=0X'5C4E',
"ORIGIN_STATE" CHAR(4000) NULLIF "ORIGIN_STATE"=0X'5C4E',
"AIRPORT_ID" CHAR(4000) NULLIF "AIRPORT_ID"=0X'5C4E'
)
)
LOCATION
(
'osch-20140206120000-1947-1'
)
) PARALLEL REJECT LIMIT UNLIMITED;

The following location files were created.

osch-20140206120000-1947-1 contains 1 URI, 150606 bytes

150606 hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/bi_airlines.db/origin/geog_origin.txt

Note the HDFS file reference at the end – this is the metadata created by the utility, which we’d need to update if more data files are added to the underlying Hive table.

So let’s fire-up SQL*Plus and take a look at the tables:

[oracle@bigdatalite osch]$ sqlplus bi_airlines/bi_airlines

SQL*Plus: Release 12.1.0.1.0 Production on Thu Feb 6 13:20:12 2014

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Last Successful login time: Thu Feb 06 2014 12:21:52 -05:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select table_name from user_tables;

TABLE_NAME
--------------------------------------------------------------------------------
FLIGHT_PERF_EXT_TAB_HIVE
ORIGIN_EXT_TAB_HIVE
DEST_EXT_TAB_HIVE
CARRIER_EXT_TAB_HIVE

SQL> set timing on
SQL> select count(*) from flight_perf_ext_tab_hive;

COUNT(*)
----------
19648958

Elapsed: 00:01:09.74

Not bad, but just over a minute to do a quick row-count on the table. Let’s try the same thing using Hive itself – note that ODCH and Hive aren’t using the same access technique to the underlying data, and one might more sense than the other for particular query situations.

[oracle@bigdatalite ~]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/oracle/hive_job_log_2b7797b2-1ef8-4ffc-b8b1-ab4f77b19cea_423831297.txt
hive> select count(*) from bi_airlines.flight_performance;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
 set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
 set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
 set mapred.reduce.tasks=<number>
Starting Job = job_201402052208_0001, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201402052208_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201402052208_0001
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2014-02-06 13:26:48,223 Stage-1 map = 0%, reduce = 0%
2014-02-06 13:27:23,402 Stage-1 map = 51%, reduce = 0%
2014-02-06 13:27:40,487 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 16.59 sec
2014-02-06 13:27:41,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:42,523 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:43,535 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:44,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:45,558 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:46,569 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:47,582 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec
2014-02-06 13:27:48,596 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:49,608 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:50,616 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:51,625 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
2014-02-06 13:27:52,641 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec
MapReduce Total cumulative CPU time: 35 seconds 50 msec
Ended Job = job_201402052208_0001
MapReduce Jobs Launched: 
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 35.05 sec HDFS Read: 524501756 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 35 seconds 50 msec
OK
19648958
Time taken: 76.525 seconds

Ok, a bit longer, 76 seconds vs. 69 seconds – but as I said, they’re different access mechanisms and you’d probably get different results if you joined the data, filtered it etc.

Let’s try it in Impala now:

[oracle@bigdatalite ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Connected to bigdatalite.localdomain:21000
Server version: impalad version 1.2.3 RELEASE (build 1cab04cdb88968a963a8ad6121a2e72a3a623eca)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.2.3 (1cab04c) built on Fri Dec 20 19:39:39 PST 2013)
[bigdatalite.localdomain:21000] > select count(*) from bi_airlines.flight_performance;
Query: select count(*) from bi_airlines.flight_performance
+----------+
| count(*) |
+----------+
| 19648958 |
+----------+
Returned 1 row(s) in 1.09s
[bigdatalite.localdomain:21000] >

Ah, that’s more like it – the count this time was returned in just over a second, which is what you’d expect as Impala is optimised for ad-hoc queries, Hive for larger, batch-style SQL access and transformations. Moving over to OBIEE though, after creating a TNSNAMES connection through to the BigDataLite Oracle Database 12c database, I can import the external tables in just like any regular Oracle source:

NewImage

Building the RPD against these tables is thereafter the same as any Oracle database source – no need to worry about Hive drivers, SQL dialects, Impala not being supported and so on.

NewImage

But … when you come to run queries, it’s slow. As slow as accessing Hadoop data via Hive, in the order of minutes to return a result set to the dashboard.

NewImage

And this is what you’d expect if you connected OBIEE to a file data source, or an Oracle external table. There’s no indexes on the underlying tables, no aggregates and so forth, and you’re not benefiting from the query optimisations you’d get with technologies such as Impala, Stinger from Hortonworks or whatever. In reality, if what you’re after is the convenience of connecting to Oracle tables rather than Hive or Impala ones, you’d just use the external tables you created to then load the data into regular Oracle tables, and just query those.

What ODCH is really for is data extraction and ETL, it’s not a high-performance ad-hoc query tool, and realistically you’d only really use it like this for initial prototyping or if your use-case really suited direct query access to HDFS file data. So – back to Impala then for this type of Hadoop access, though it’s a neat feature to be aware of, particularly in the context of ODI and bulk-loading Hadoop data into Oracle.

Categories: BI & Warehousing

New OTN Article: Making the Move from Oracle Warehouse Builder to Oracle Data Integrator 12c

Mon, 2014-02-03 23:28

One of the reasons I’ve been silent on the blog lately regarding new Oracle Data Integrator 12c content (even though I promised it some months ago) is because I’ve been writing an article focused on migration options from Oracle Warehouse Builder (OWB). Published recently on OTN, this article hopefully helps customers understand some of the options Oracle has included to make good on their recent Statement of Direction for Data Integration.

Making the Move to ODI 12c

While the article is focused primarily on OWB customers looking to make the move, there’s plenty of value here for organizations already familiar with ODI 11g. I walk through an Enterprise Installation of ODI 12c on Linux, including Fusion Middleware domain creation and JEE agent configuration. Following that, I explore two new technical capabilities for OWB shops included in the new tool. First is Runtime Integration, which allows the execution of OWB mappings and process flows directly from ODI with complete auditing and load plan integration. Secondly, I investigate the new Migration Utility which will migrate the majority of customer mappings directly from OWB to ODI 12c. I then embark on a functional assessment of what it means to add value to a data integration project, and how organizations can use these new features to achieve it. I present three different approaches for OWB shops to implement ODI 12c: the Side-by-Side Approach, the Big Bang Approach, and the Phased Approach. Understanding that no size fits all, I’m sure organizations all over the spectrum will recognize a fit in at least one of these solutions.

As always, early access and beta participation means that Rittman Mead is prepared to implement an OWB migration project immediately using any of these three approaches. If you have any questions, please comment here on the blog, or reach out to us directly.

Categories: BI & Warehousing

Automated RPD Builds with OBIEE

Mon, 2014-02-03 06:23
Introduction

Some time ago, I wrapped up a four-part series on using MDS XML as a basis for RPD development. Of course… you can (re-) read the blog posts to get up to speed on the concept… but I’ll offer a quick refresher here. Using an XML-based “data-store” for our RPD allows us the ability to treat our RPD the way Java developers treat their class files. At the end of the day… when it’s text, we can work with it. I paired this new(-ish) RPD storage format with the distributed version-control system (VCS) Git to highlight some truly radical development strategies. I speak on this approach quite often… and I always include a live demo of working on multiple branches of development at the same time, and using the power of Git at the end of it to merge the branches together to build a single, consistent RPD file, or an OBIEE RPD “release”. The question I always get asked: why Git? Can we do this with Subversion? My answer: sort of. The Git-difference should be appreciated by RPD developers: it’s all about metadata. Anyone who has set up a Subversion repository knows that branching was an afterthought. Our ability to “branch” relies on the manual creation of a directory called “branches” (right between “trunk” and “tags”) that plays no special part in the Subversion architecture. But Git has branching functionality built way down in the metadata. When it comes time to merge two branches together, Git knows ahead of time whether a conflict will arise, and has already devised a strategy for pulling the closest ancestor between the two branches to do a three-way merge.

What do we Get with Git?

Let’s run through the spoils. First… developers work on feature, or “topic” branches to complete their work. This is an isolated, “virtual” copy of the repository branched of from the master branch at a certain point in time.  Below, we can see an illustration of the “Git Flow” methodology, our preferred RPD multi-user development strategy at Rittman Mead:

Git Branch Strategy

Developers no longer have to concern themselves with conflict resolution the way they have always had to when using MUDE. The merging is done after the fact… at the time when releases are assembled. And this brings me to the second–and most impactful–spoil: we are able to easily “cherry-pick” (yes… that is a technical term) features to go into our releases. For Agile methodologies that focus on delivering value to their users often… cherry-picking is a requirement, and it simply is not possible with either serialized, online development… or with OBIEE’s built-in MUDE. As depicted in the Git Flow image above, we are able to cherry-pick the features we want to include in a particular release and merge them into our develop branch, and then prepare them for final delivery in the master branch, where we tag them as releases.

Is it Really So Radical?

Earlier, I described this methodology as “radical”, which certainly requires context. For BI developers using OBIEE, Cognos, MicroStrategy, etc., this is indeed radical. But this is all old hat to Java developers, or Groovy developers, or Ruby on Rails developers, etc. For those folks, SDLC and methodologies like Git Flow are just how development is done. What is it about BI developers that make these concepts so foreign to us? Why are proper software development lifecycle (SDLC) strategies always an after-thought in the BI tools? All of them–Red Stack or otherwise–seem to “bolt-on” custom revisioning features after the fact instead of just integrating with proven solutions (Git, SVN, etc.) used by the rest of the world. When I speak to customers who are “uneasy” about stepping away from the OBIEE-specific MUDE approach, I reassure them that actually, branch-based SDLC is really the proven standard, while bespoke, siloed alternatives are the exception . Which gets us thinking… what else is the rest of the world doing that we aren’t? The easy answer: automated builds.

Should We Automate Our Builds?

The answer is yes. The development industry is abuzz right now with discussions about build automation and continuous integration, with oft-mentioned open-source projects such as Apache Maven, Jenkins CI, Grunt and Gradle. Although these open source projects are certainly on my radar (I’m currently testing Gradle for possible fit), an automated build process doesn’t have to be near that fancy, and for most environments, could be produced with simple scripting using any interpretative language. To explain how an automated build would work for OBIEE metadata development, let’s first explore the steps required if we were building a simple Java application. The process goes something like this:

  1. Developers use Git (or other VCS) branches to track feature development, and those branches are used to cherry-pick combinations of features to package into a release.
  2. When the release has been assembled (or “merged” using Git terminology), then an automated process performs a build.
  3. The end result of our build process is typically an EAR file. As builds are configurable for different environments (QA, Production, etc.), our build process would be aware of this, and be capable of generating builds specific to multiple environments simultaneously.

Taking these steps and digesting them in an OBIEE world, using the built-in features we have for comparing and patching, it seems our build process would look something like the following:

deploy

We’ll explain this diagram in more detail as we go, because it’s truly pivotal to understanding how our build process works. But in general… we version control our MDS XML repository for development, and use different versions of that MDS XML repository to generate patch files each time we build a release. That patch file is then applied to one or more target binary RPD files downstream in an automated fashion. Make sense?

To prepare ourselves for using MDS XML for automated builds, we have to start by generating binary RPD files for the different target environments we want to build for, in our case, QA and Production versions. We’ll refer to these as our “baseline” RPD files: initially, they look identical to our development repository, but exist in binary format. These target binary RPD files will also be stored in Git to make it easy to pull back previous versions of our target RPDs, as well as to make it easy to deploy these files to any environment. We only have to generate our baselines once, so we could obviously do that manually using the Admin Tool, but we’ll demonstrate the command-line approach here:

Once we have our binary QA and Production RPDs ready, we’ll need to make any environment-specific configuration changes necessary for the different target environments. One obvious example of this kind of configuration is our connection pools. This is why we don’t simply generate an RPD file from our MDS XML repository and use it for all the downstream environments: specifically, we want to be able to manage connection pool differences, but generically, we are really describing any environment-specific configurations, which is possible with controlled, automated builds. This is another step that only has to be performed once (or whenever our connection pool information needs updating, such as regular password updates) so it too can be performed manually using the Admin Tool.

Once our baselines are set, we can enable our build processes. We’ll walk through the manual steps required to do a build, and after that, we’ll have a quick look at Rittman Mead’s automated build process.

Staging the Prior Version

After we’ve completed development one or more new features, those branches will be merged into our develop branch (or the master branch if we aren’t using Git Flow), giving us a consistent MDS XML repository that we are ready to push to our target environments. We’ll need to compare our current MDS XML repository with the last development version that we produced so we can generate a patch file to drive our build process. There’s a very important distinction to make here: we aren’t going to compare our development repository directly against our QA or Production repositories, which is the approach I often see people recommending in blogs and other media. Hopefully, the reason will be clear in a moment.

So where are we going to find the previous version of our MDS XML repository? Since we are using version control, we are able to checkout our previous build using either a branch or a tag. I’ll demonstrate this using a branch, but it makes little difference from a process perspective, and most VCS (including Git) use the same command for either approach. To make this clear, I’ve created two branches: current and previous. The current branch is the one we’ve been diligently working on with new content… the previous branch is the branch we used to drive the build process last time around. So we need to “stage” a copy of the prior MDS XML repository, which we do by checking out the previous branch and extracting that repository to a location outside of our Git repository. Notice that we’ll extract it as a binary RPD file; this is not a requirement, it’s just easier to deal with a single file for this process:

stage-rpd-mark

Generating the Patch File

Now that we have staged our previous RPD file and returned to the current development branch, we are ready to proceed with generating a patch file. The patch file is generated by simply doing a comparison between our development MDS XML repository and the staged binary RPD file. The important thing to note about this approach: this patch file will not have any environment-specific information in it, such as connection pool configurations, etc. It only contains new development that’s occurred in the development environment between the time of the last release and now:

create-patch-mark

Applying Patch File to our Target(s)

With the staged, prior version of the binary RPD, and our patch file, we have all we need to apply the latest release to our target binary RPD files. In our case, we’ll be applying the changes to both the QA and Production binary RPD files. Compared to everything we’ve done up until this point, this is really the easiest part. Below is a sample patchrpd process for our production RPD:

Patching Production RPD

There are a lot more options we could incorporate into this process from a patching perspective. We cold use the -D option to introduce a decision file, which gives us the ability to highly customize this process. What we’ve demonstrated is a fairly simple option, but to be honest, in a controlled, SDLC approach, it’s rare that we would ever need anything more than this simple process.

Automated Builds with the Rittman Mead Delivery Framework

As you can imagine, with all the brain-power floating around at Rittman Mead, we’ve packaged up a lot of automated processes over the years, and we have active development underway to refactor those processes into the Rittman Mead Delivery Framework. As these disparate pieces get incorporated into our Framework as a whole, we tend to rewrite them in either Groovy or Python… these languages have been adopted due to their strategic choice in Oracle products. I personally wrote our automated OBIEE build tool, and chose Groovy because of it’s existing hooks into Gradle, Maven, Ant and all the rest. Also, the domain-specific language (DSL) capability of Groovy (used in both Grails and Gradle) is perfect for writing task-specific processes such as builds. The build process in our framework is driven by the buildConf.groovy config file, which is demonstrated below:

buildConf-mark

With Groovy, most things Java are pretty easy to incorporate, so I added Apache log4j functionality. Running the automated build process first in a logging mode of INFO, we see the following standard output:

build-mark

And now, in DEBUG mode:

[oracle@oracle obiee]$ groovy BuildOBI.groovy 
[main] DEBUG common.GitRepo - currentBranch: current 
[main] DEBUG common.Util - command: git checkout previous 
[main] DEBUG common.GitRepo - currentBranch: previous 
[main] DEBUG common.Util - command: biserverxmlexec -D /home/oracle/source/gcbc/rpd/gcbc -O /stage/gcbc.rpd -P Admin123 
[main] DEBUG common.Util - command: git checkout current 
[main] DEBUG common.GitRepo - currentBranch: current 
[main] INFO obiee.Build - /stage/gcbc.rpd staged 
[main] DEBUG common.Util - command: comparerpd -P Admin123 -C /home/oracle/source/gcbc/rpd/gcbc -W Admin123 -G /stage/gcbc.rpd -D /home/oracle/source/gcbc/patch/patch.xml 
[main] INFO obiee.Build - /home/oracle/source/gcbc/patch/patch.xml created 
[main] DEBUG common.Util - command: patchrpd -C /home/oracle/source/gcbc/rpd/qa/gcbc.rpd -I /home/oracle/source/gcbc/patch/patch.xml -G /stage/gcbc.rpd -O /home/oracle/source/gcbc/rpd/qa/gcbc.rpd -P Admin123 -Q Admin123 
[main] INFO obiee.Build - /home/oracle/source/gcbc/rpd/qa/gcbc.rpd patched 
[main] DEBUG common.Util - command: patchrpd -C /home/oracle/source/gcbc/rpd/production/gcbc.rpd -I /home/oracle/source/gcbc/patch/patch.xml -G /stage/gcbc.rpd -O /home/oracle/source/gcbc/rpd/production/gcbc.rpd -P Admin123 -Q Admin123 
[main] INFO obiee.Build - /home/oracle/source/gcbc/rpd/production/gcbc.rpd patched 
[oracle@oracle obiee]$

The real power that Git brings (and frankly, other VCS’s such as Subversion do as well) is the ability to configure commit hooks, such that scripts are executed whenever revisions are committed. You’ll notice that the buildConf.groovy configuration file has a list option called validBranches. This allows us to configure the build to run only when the current branch is included in that list. We have that list unpopulated at the moment (meaning it’s valid for all branches), but if we populated that list, then the build process would only run for certain branches. This would allow us to execute the build process as a commit process (either pre or post commit), but only have the build process run when being committed to particular branches, such as the develop branch or the master branch, when the code is prepared to be released.

Conclusion

So what do we think? We’ve put a lot of thought into implementing enterprise SDLC for our customers, and this process works very well. However, I use our Delivery Framework in single-user development scenarios (when I’m creating metadata repositories for presentations, for instance) because the tools are repeatable and they just work. We have additional processes that allow us to complete the build process by uploading the completed binary RPD’s to the appropriate servers and restarting the services.

Categories: BI & Warehousing

Rittman Mead BI Forum 2014 Call for Papers Closing Soon – And News on This Year’s Masterclass

Thu, 2014-01-30 03:00

Its a couple of days to go until the call for papers for the Rittman Mead BI Forum 2014 closes, with suggested topics this year including OBIEE (of course), Essbase, Endeca, Big Data, Visualizations, In-Memory analysis and data integration. So far we’ve had some excellent submissions but we’re still looking for more – so if you’re considering putting an abstract in, do it now before we close the process late this Friday night!

I’m also very excited to announce that this year’s optional one-day masterclass on the Wednesday before each event will be presented by Lars George from Cloudera, who be talking about Hadoop, Cloudera’s distribution of Hadoop and their management and real-time query tools, and how these relate to the world of Oracle BI&DW. Lars is a Cloudera Solutions Architect and Head of Services for them in EMEA, and is also an HBase committer and author of the book “HBase: The Definitive Guide”.

You’ll probably have seen a lot on big data, and Cloudera, on this blog over the past few months, and I’m particularly grateful to Justin Kestelyn who used to run OTN and the Oracle ACE Program, but now does a similar role over at Cloudera, for making it happen. Thanks Justin and Lars, and we’ll look forward to seeing Lars in Brighton and Atlanta in May this year.

Once the call for papers closes, we’ll do the usual vote to allow potential attendees to influence the paper selection, and then we’ll announce the agendas and open the events up for registration later in February. Until then though – get your abstracts in now before it’s too late… 

Categories: BI & Warehousing

Exploring OBIEE Web Services through Python

Wed, 2014-01-29 16:21

My very first blog post for Rittman Mead was on the subject of Web Services in BI Publisher 11g, and now I want to return to the subject here, looking at the Web Services of OBIEE itself.

Web Services are basically a way of enabling an application or service to offer up an API to as-yet-undefined-clients. This may be an API for fetching data, sending data, or invoking a service or action. In this age of The Cloud, systems being able to interact in this abstracted manner (i.e. without requiring one to be aware of the other at design time) is a Good Thing and something that if an application does not support is rather frowned upon.

For a simple explanation of Web Services, and details of how to explore and test them using SoapUI then refer to my article Web Services in BI Publisher 11g.

OBIEE itself has supported Web Services for a long time now, but they are exposed through the SOAP, a HTTP-based protocol that relies on carefully formed XML messages. This is fine, but had meant that to actually utilise them beyond tinkering within SoapUI you had to create a heavy-weight JDeveloper project or similar. One of the many great things about working at Rittman Mead is that as a company we use Macs, meaning that a unix-like command line, and Python, is just a click away. Recently I discovered that Python has a library called SUDS. This makes calling a web service as simple as:

from suds.client import Client
client = Client('http://obiee-server:9704/analytics-ws/saw.dll/wsdl/v7')
sessionid = client.service['SAWSessionService'].logon('weblogic','Password01')

With this code I have just logged into OBIEE and returned a session ID token that I can now use in subsequent web service calls. In the background, SUDS sorts out the forming of the XML SOAP messages to send to the web service, and the parsing of the returned XML SOAP message into a Python object matching the object. So, now I can actually start exploring and programming using the Web Services straight from my command line…nice.

The Web Services discussed below are all session based, meaning that you have to first authenticate and retrieve a session ID, which is then used for subsequent Web Service calls.

Checking a user’s Subject Area grants

What prompted me to try again with exploring OBIEE’s Web Services was my discussions with my colleagues over ways to Regression Test OBIEE, above and beyond what I have already covered here and here, where I use nqcmd and Catalog Manager to work with the Logical SQL for an analysis. What about security – how can we test what a user can and cannot see from the point of view of an RPD’s Presentation Layer permissions?

Once authenticated, we can call a Web Service which shows the Subject Areas that the user may see. The MetadataService and getSubjectAreas are both straight from the documentation, transposed into SUDS/Python syntax.

sa_list=client.service['MetadataService'].getSubjectAreas(sessionid)
print sa_list

This returns:

[(SASubjectArea){
   name = ""Sales - Fact Sales""
   displayName = "Sales - Fact Sales"
   description = None
 }, (SASubjectArea){
   name = ""Sales - Store Quality""
   displayName = "Sales - Store Quality"
   description = None
 }]

And now for each subject area, examine the available tables and columns within:

for sa in sa_list:
    print '\t%s'%(sa.name)
    sa_contents=client.service['MetadataService'].describeSubjectArea(sa.name,'IncludeTablesAndColumns',sessionid)
  for table in sa_contents.tables:
        print '\t\t%s' % (table.name)
        for col in table.columns:
            print '\t\t\t%s' % (col.name)

Which returns:

"Sales - Fact Sales"
        "Fact Sales"
                "Cost"  
                "Revenue"
                "Sale Amount" 
                "Sale Amount Month Ago"
                "Sale Amount Target" 
        "Dim Times"
                "Month YYYYMM" 
                "Month" 
                "Quarter"
                "Quarter YYYYQ"
[...]

Nice huh? And with a bit of Python scripting magic we could easily dump that out to disk for comparison again once an RPD change has been made.

Listing users

Listing out the available RPD objects for a user is nice, but how do we know which users there are? Not everyone’s going to be logging in as weblogic, right? (right?!). There is a SecurityService Web Service that offers a getAccounts method, but how do we call it? Unlike the above MetadataService.getSubjectAreas call, where we just passed in the session ID token, here we need to form a proper object describing the account query that we want to run. SUDS offers a way to do this, using an object ‘factory’:

accountlist = client.factory.create('Account')

From the interactive Python shell we can see what an “Account” object looks like:

>>> print accountlist
(Account){
   name = None
   accountType = None
   guid = None
   displayName = None
 }

And this matches the AccountStructure in the documentation (displayName doesn’t do anything in this method). We can set the object’s attributes to define the accounts we want to see – any account name (* is wildcard), and show users (accountType=0) only, not application roles:

accountlist.name = '*'
accountlist.accountType = 0

And now call the getAccounts method, storing the results in an object:

accounts=client.service['SecurityService'].getAccounts(accountlist,sessionid)

Dumping the object’s contents to the screen in the interactive shell shows:

>>> print accounts
[(Account){
   name = "BISystemUser"
   accountType = 0
   guid = "208001E0834C11E3AF35BD098D6B48E3"
   displayName = "BISystemUser"
 }, (Account){
   name = "OracleSystemUser"
   accountType = 0
   guid = "E33C7570834111E3BF03EF3A9BA6EB6D"
   displayName = "OracleSystemUser"
 }, (Account){
   name = "author_user"
   accountType = 0
   guid = "8FB2B570873F11E3BF851386414CD489"
   displayName = "author_user"
 }, (Account){
   name = "consumer_user"
   accountType = 0
   guid = "A38EEFF0873F11E3BF851386414CD489"
   displayName = "consumer_user"
 }, (Account){
   name = "weblogic"
   accountType = 0
   guid = "E33CC390834111E3BF03EF3A9BA6EB6D"
   displayName = "weblogic"
 }]
Iterating through user accounts

Now we have an object that gives us each user defined on the system. We can iterate through this object, and use Impersonation (which I have written about in detail here) to login as each user. Once logged in as the user, we could use the above example to iterate through the subject areas available to the user and record their actual RPD security privileges, and so on.

for account in accounts:
        print '\n----\n\n'
        this_acc = account.name
        try:   
                print 'Logging in as %s' % (this_acc)
                imp_client = Client('http://obiee-server:9704/analytics-ws/saw.dll/wsdl/v7')
                new_session = imp_client.service['SAWSessionService'].impersonate('weblogic','Password01',this_acc)
                print 'New session ID %s' % (new_session)
                try:   
                        user=imp_client.service['SAWSessionService'].getCurUser(new_session)
                        print 'Logged in as %s'%(user)
                        #
                        # Do cool stuff as impersonated user here
                except:
                        print 'failed to get user session'
        except:
                print 'failed to login as %s' % (this_acc)

NB to use impersonation you need to have the oracle.bi.server.impersonateUser Application Policy granted, which in a default 11.1.1.7 environment is not granted even to the BIAdministrators Application Role

SUDS logging

Use this little snippet to dump out the raw request/response XML SOAP messages to help debug issues you may^H^H^H will encounter:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)

See the SUDS documentation for more detail.

Toolkit

Web Services, even through SUDS, cannot be blagged nor approached through the standard code-by-Google method. You have to be extremely precise with your syntax and object formations, which means that you will need:

To install SUDS on a *nix platform with Python installed already (which it is by default on Macs, and most Linux distros), use Python’s easy_install program:

sudo easy_install suds

And off you go!

This is all very cool, but is it useful?

(Well, I think it’s cool, but then I laughed at this so go figure)

Being able to tap directly into core OBIEE functionality with a few lines of script, callable natively from any Linux OBIEE server (and any Windows service with Python installed) with minimal dependencies is very useful. It opens up much of OBIEE’s functionality as – in effect – a Python library. Whether it’s for building out some form of testing suite, automating web catalog management, auditing user privileges, or a tasks like a light-touch way to kick off an Agent (or create one from scratch), it’s another option handy to have up one’s sleeve.

Categories: BI & Warehousing

Looking at the ODI12c Hadoop Demos in the New Oracle BigDataLite VM

Wed, 2014-01-29 13:21

In a post earlier today on the blog I took a look at the new Oracle BigDataLite virtual machine that’s now downloadable from OTN, and walked-through some of the Cloudera Hadoop tools that come with the VM. At the end of the post I mentioned that there was also an install of ODI12c on the VM, and it comes with a couple of Hadoop integration examples already set-up for you. So what do these examples do, and how do they use the Hadoop tools and servers on the VM?

Let’s start with some background first. Hadoop is a framework for executing simple selection, filtering and aggregation batch jobs in a fault-tolerant way across horizontal clusters of servers (the BigDataLite VM is just a single node cluster, but the process is the same). When you load data into a Hadoop cluster for analysis, it’s put into what’s called HDFS (Hadoop Distributed File System), a Unix-like filesystem that spreads data across all nodes in the cluster and has built-in redundancy and fault tolerance – basically Hadoop server nodes are designed with cheap, higher-failure-rate hardware in-mind, and the Hadoop parallel query process detects failed nodes and works around them. In ODI terms, you’d often find data of interest sitting in HDFS, most probably because someone has done some prior processing or analysis using a tool like R, and now you want to load the results in a regular Oracle data warehouse.

NewImage

Then, so that SQL-based tools such as ODI can access these files, another technology called Hive provides a SQL-like access layer over the files, very similar to how Oracle accesses files through external tables, with a Hive metastore playing the role of the Oracle data dictionary in terms of arranging files into tables, columns and databases.

NewImage

Then, in the background, when you query the Hive tables, the Hive Server creates MapReduce jobs on the fly to return your data, splitting the job into various mapper and reducer activities which then run across the Hadoop cluster. Hive isn’t really (these days) designed for BI-type ad-hoc queries, but it’s great for batch access to Hadoop data which is why ODI uses it.

In addition there are a bunch of Oracle utilities that Oracle provide for connecting Hadoop to the Oracle database, collectively called Oracle’s Big Data Connectors. One of them, Oracle Loader for Hadoop, extracts data from Hadoop by pushing all of the data transformation work into MapReduce jobs, allowing you to leverage the power of the Hadoop cluster whilst easily loading data into Oracle tables. One of the ODI Hadoop knowledge modules uses this utility, along with another one called Oracle Direct Connector for HDFS. The diagram below shows the architecture behind Oracle Loader for Hadoop, and how it leverages MapReduce to do the “heavy lifting” around the data transformation.

NewImage

Oracle Direct Connector for HDFS is even-more conceptually-familiar, and allows you to create a special type of external table in Oracle to connect to HDFS files, as opposed to regular filesystem files.

NewImage

The last piece of the puzzle is an add-in to Oracle Data Integrator, called Oracle Data Integrator Application Adaptor for Hadoop. Available for both ODI11g and 12c, this provides a number of new knowledge modules designed for accessing Hadoop data along with connectivity to Hive and HDFS, and is a pre-requisite for the connectivity we’ll see in this posting.

NewImage

The knowledge modules that this application adapter provides are:

  • IKM File to Hive (Load Data) – for loading file data into an existing Hive table
  • IKM Hive Control Append – for loading data to-and-from Hive tables, for in-Hive-database ETL
  • IKM Hive Transform – for transforming Hive data using more complex expressions and SerDes
  • IKM File-Hive to Oracle (OLH) – for loading data into an Oracle table from Hive, using OLH/ODCH
  • CKM Hive – for applying static and flow controls to Hive tables
  • RKM Hive – for reverse-engineering Hive metadata into the ODI repository

So let’s take a closer look at what’s in the ODI12c examples in the BigDataLite VM, starting with the Topology. If you take a look at the Topology tab in ODI Studio you’ll see the Hive technology, and if you drill into it further, connections to the Hive server on the VM and the various Hive databases.

NewImage

What this is connecting to is a service within the Hadoop cluster called HiveServer2 – this is an improvement over the old HiveServer1 that came with earlier distributions of Cloudera Hadoop, which could only reliably support a single connection, whereas HiveServer2 can support many concurrent connections. If you go over to Cloudera Manager and look at the Services tab, you’ll see it listed alongside the Hive Metastore server under the main Hive service.

NewImage

Note that most out-of-the-box Cloudera Hadoop 4 distributions don’t have HiveServer2 enabled and running, so you’ll need to add it from the Services menu if you’re creating your own Hadoop setup.

HiveServer2 runs on port 10000, and ODI connects to it via a JDBC connection. The files and tables that ODI is then going to work with exist in a Hive database, which you can see by looking at Hue, and clicking on the Metastore Manager icon. The tables ODI will be working with are movieapp_log_avro, and the table called movieapp_log_odistage. In the background, these Hive tables map onto HDFS files, with the movieapp_log_avro one using the Apache Avro data serialisation tool to parse log data into separate “columns” of data. 

NewImage

So the ODI project does two things (or three, to be precise):

1. It uses the reusable mapping feature to load data from the avro-format log file into a staging table, also in Hive
2. It then takes that Hive data and loads it in to an Oracle database table

All of these are then wrapped-up into an ODI package, which calls the first step (and the reusable mapping), and then then second step.

NewImage

The Model section within the Designer navigator shows the two Hive tables as data sources and targets we can work with, with the avro file’s parsing specification turning the log file into a set of columns we can extract from.

NewImage

So looking at the first mapping, it reads from the reusable mapping over the avro table, then transforms and loads the data into another Hive table.

NewImage

Switching to the Physical tab in the mapping editor, you can see that all the work is taking place within a single execution unit – because the transformation is all internal to Hive.

NewImage

Looking at the target properties, you can see the IKM Hive Control Append knowledge module was used.

NewImage

Running the mapping shows you the various steps in the process, and as this is an internal Hive transformation, all you see is HiveQL – the SQL dialect used by Hive.

NewImage

The second mapping then takes this Hive staging table and loads its data into Oracle, with two execution units shown in the Physical mapping view.

NewImage

Then, looking at the target object’s properties, you can see that the IKM File-Hive to Oracle (OLH-OSCH) knowledge module is used to move data out of Hadoop and into Oracle.

NewImage

And when you execute the mapping, you can see the Oracle Loader for Hadoop mapping file being created, the utility run, and then the data moved through the usual staging table and into the target Oracle table.

NewImage

So there you have it. There’s more you can do with ODI and the ODI Application Adaptor for Hadoop but these are two nice examples – take a look if you get a chance.

Categories: BI & Warehousing

Oracle “BigDataLite” VM Now Available for Download on OTN

Wed, 2014-01-29 04:20

Oracle released a new developer VM for download on OTN yesterday called “bigdatalite” – if you’re interested in big data, Hadoop and some of the SQL-on-Hadoop technologies I’ve been looking at recently on the blog, this is something you’ll want to download as soon as possible and play around with. I’ve had access to an earlier version of this VM back from 2012 because of some development work I did with ODI these technologies, but up until now there’s not been a publicly downloadable version I could point people to. Now there is, so I just wanted to walk through what in it, and how you can start to play around with some of the features.

Once you’ve downloaded the various archive files and imported the VM into Virtualbox, log in as oracle/welcome1 and you’ll see a (strangely militaristic-looking) desktop and some links to start an Oracle database, open a browser and so on:

NewImage

Give the various services a few seconds to start up, and then click on the “Start Here” link on the desktop to open your browser.

The getting started page lists out the various products that are installed on the VM, which you can group as:

  • Hadoop and big data products from Cloudera – Cloudera Manager, their equivalent to Enterprise Manager; Cloudera’s distribution of Hadoop (similar to how Red Hat and SuSE distribute their own versions of Linux); and Cloudera Impala and Search, their add-ons to Hadoop that make querying and searching faster
  • Oracle’s Big Data Connectors, a set of technologies that link the Oracle database to Hadoop, allowing you to query Hadoop from Oracle, and load and unload data between the two platforms
  • Oracle Data Integrator 12c, with a couple of Hadoop integration examples pre-created
  • Oracle Database 12c, to use with the Big Data Connectors and ODI
  • Oracle NoSQL database, a key/value database similar to Apache HBase
  • A bunch of other related Oracle tools such as Jdeveloper, SQL Developer, and Oracle’s R Distribution – with R Studio and additional R packages separately installable

So a great place to start playing around with Hadoop in-general, a way to get some experience with Impala and Hive if you’re an OBIEE developer, and also a great way to try out the integration pieces between the Oracle Database and Hadoop including ODI’s capabilities in this area.

If you click on the Cloudera Manager link (http://localhost:7180/cmf/login) you’ll be taken to Cloudera Manager. This web UI allows you to see the state of the various services managed by Cloudera Manager, including

  • HDFS (the distributed filesystem that holds the datafiles then typically analysed using Hive and Impala); 
  • Hive and Impala (two technologies for issuing SQL-type queries over HDFS files); 
  • MapReduce (the core data-processing technology within Hadoop that splits operations into mapping, shuffling and reducing (aggregating) data and automatically parallelises it over nodes in the Hadoop cluster)
  • Sqoop (for loading data into and out of Hadoop from relational databases)
  • Hue (a web UI for all of the above, that we’ll look at in a moment)

NewImage

Hue is the other main web interface you’ll want to look at, and this is more of a developer-focused web app that allows you to create and view HDFS files, create Hive tables and then query them using Hive and Impala.

NewImage

I covered Hue and the process of uploading files to create Hive tables in the two blog posts below the other week, and once you’ve done that you can query them from tools such as OBIEE using the 11.1.1.7 release’s Hive connectivity:

If you’re more from the database side, there’s some tutorials available on the big data connectors and so forth – there doesn’t appear to be any separate tutorials for ODI though so you’ll need to “reverse-engineer” the two examples in ODI Studio to work through how they’ve been created. I’ll try and do this soon and post it on the blog, if anyone’s interested.

NewImage

Anyway, the VM is downloadable now with supporting materials available on OTN here. I’ve added some links below to earlier posts on our blog that might be of interest to you if you’re looking to try OBIEE and ODI with this platform:

Categories: BI & Warehousing

Automated Regression Testing for OBIEE

Thu, 2014-01-23 15:00

In the first article of this series I explored what regression testing is, why it matters, and by breaking down the OBIEE stack into its constituent parts where it is possible to do it for OBIEE. In this posting, I explore some approaches that lend themselves well to automation for testing that existing analyses and dashboards are not affected by RPD changes.

Easy Automated RPD regression testing – it’s all about the numbers

“Bring me solutions not problems” goes the mantra, and in the first article all I did was rain on the parade of the de facto regression testing approach, looking at the front end using functional testing tools such as Selenium. So if not at the front end, then where should we focus our automated regression testing of OBIEE? Answer: the data.

The data should arguably be what is most important to our users. If it’s wrong, that’s bad, and if it’s right, hopefully they’re going to be happy. Obviously, there are other factors in making users happy not least performance and the visual appearance of the data. But a system that gives users wrong data, or no data, is fundamentally a failed one.

Looking at the following diagram of a request/response through the OBIEE stack we can see that so far as data is concerned, it is the BI Server doing all the work, handling both logical and physical SQL and data sets:

The data that it passes back up to Presentation Services for rendering in the user’s web browser is the raw data that feeds into what the user will see. Obviously the data gets processed further in graphs, pivot tables, narrative views, and so on – but the actual filtering, aggregation and calculation applied to data is all complete by the time that it leaves the BI Server.

How the BI Server responds to data requests (in the form of Logical SQL queries) is governed by the RPD, the metadata model that abstracts the physical source(s) into a logical Business model. Because of this abstraction it means that all Logical queries (i.e. analysis/dashboard data requests) are compiled into Physical SQL for sending to the data source(s) at runtime only. Whenever the RPD changes, the way in which any logical query is handled may change.

So if we focus on regression testing the impact that changes to the RPD have on the data alone then the available methods become clearer and easier. We can take the Logical SQL alone that Presentation Services generates for an analysis and sends to the BI Server, and we can run it directly against the BI Server ourselves to get the resulting [logical] dataset. This can be done using any ODBC or JDBC client, such as nqcmd (which is supplied with OBIEE at installation).

Faith in reason

If:

  • the Logical SQL remains the same (i.e. the analysis has not changed nor Presentation Services binaries changed – but see caveat below)
  • the data returned by the BI Server as a result of the Logical SQL before and after the RPD change is made is the same

Then we can reason (through the above illustration of the OBIEE stack) that

  • the resulting report shown to the user will remain the same.

Using this logic we can strip away the top layer of testing (trying to detect if a web page matches another) and test directly against the data without having to actually run the report itself.

In practice

To use this method in practice, the process is as follows:

  1. Obtain the Logical SQL for each analysis in your chosen dashboards
  2. Run the Logical SQL through BI Server, save the data
  3. Make the RPD changes
  4. Rerun the Logical SQL through BI Server, save the data
  5. Compare data before & after to detect if any changes occurred

The Logical SQL can be obtained from Usage Tracking, for example:

SELECT QUERY_TEXT   
FROM   S_NQ_ACCT   
WHERE  START_TS > SYSDATE - 7  

or you can take it directly from the nqquery.log. For more than a few analyses, Usage Tracking is definitely the more practical option.


You can also generate Logical SQL directly from an analysis in the catalog using runcat.sh – see later in this post for details.

If you don’t know which dashboards to take the Logical SQL from, ask yourself which are going to cause the most upset if they stop working, as well as making sure that you have a representative sample across all usage of your RPD’s Subject Areas.

Give the Logical SQL of each analysis an ID and have a log book of which dashboard it is associated with, when it was taken, etc. Then run it through nqcmd (or alternative) to return the first version of the data.

nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s analysis01.lsql -o analysis01.before.csv  

where:

  • -d is the BI Server DSN. For remote testing this is defined as part of the configuration/installation of the client. For testing local to the server it will probably be AnalyticsWeb on Linux and coreapplication_OHxxxxxx on Windows
  • -u and -p are the credentials of the user under whose ID the test should be executed
  • -s specifies the Logical SQL script to run
  • -o the output file to write the returned data to.

For more information about nqcmd see the manual.

Once you’ve run the initial data sample, make your RPD changes, and then rerun the data collection with the same command as before (except to a different output file). If the nqcmd command fails then it’s an indication that your RPD has failed regression testing already (because it means that the actual analysis will presumably also fail).

An important point here is that if your underlying source data changes or any time-based filter results change then the test results will be invalid. If you are running an analysis looking at “Sales for Yesterday”, and the regression test takes several days then “Yesterday” may change (depending on your init-block approach) and so will the results.

A second important point to note is that you must take into account the BI Server cache. If enabled, either disable it during your testing, or use the DISABLE_CACHE_HIT request variable with your Logical SQL statements.

Having taken the before and after data collections for each analysis, it’s a simple matter of comparing the before/after for each and reporting any differences. Any differences are typically going to mean a regression has occurred, and you can use the before/after data files to identify exactly where. On Linux the diff command works perfectly well for this

In this case we can see that the ‘after’ test failed with a missing table error. If both files are identical (meaning there is no difference in the data before and after the RPD change), there is no output from diff:

Tools like diff are not pretty but in practice you wouldn’t be running all this manually, it would be scripted, reporting on exceptions only.

So a typical regression test suite for an existing RPD would be a set of these nqcmd calls to an indexed list of Logical SQL statements, collecting the results for comparison with the same executions once the RPD changes have been made.

Tips
  • Instead of collecting actual data, you could run the results of nqcmd directly through md5 and store just the hash of each resultset, making for faster comparisons. The drawback of this approach would be that to examine any discrepancies you’d need to rerun both the before & after tests. There is also the theoretical risk of a hash collision (where the same hash is generated for two non-matching datasets) to be aware of.
  • diff sets a shell return code depending on whether there is a difference in the data (RC=1) or not (RC=0), which makes it handy for scripting into if/then/else shell script statements
  • nqcmd uses stdout and stderr, so instead of specifying -o for an output file, you can redirect the output of the Logical SQL for each analysis to a results file (file descriptor 1) and an error file (file descriptor 2), making spotting errors easier:
    nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s analysis01.lsql 1>analysis01.out 2>analysis01.err  
Taking it one step further

We can strip away the layers even further, in two additional stages:

  • Instead of examining the data returned by a generated physical SQL query, simply compare the generated SQL query itself, before and after a RPD change. If the query is the same, then therefore the data returned will be the same, and therefore the report will be the same.
    One thing to watch for is that the Physical query logged in Usage Tracking (although not in nqquery.log) has a Session ID embedded at the front of it which will make direct comparison more difficult.
  • The Physical SQL is dependant on the RPD; if the RPD changes then the Physical SQL may change. However, if neither the RPD nor inbound Logical SQL has changed and only the underlying data source has changed (for example, a schema modification or database migration) then we can ignore the OBIEE stack itself and simply test the results of the Physical SQL statement(s) associated with the analysis/dashboard in question and make sure that the same data is being returned before and after the change
The fly in the ointment – Logical SQL generation

This may all sound a bit too good to be true; and there is indeed a catch of which you should be aware.

Presentation Services does not save the Logical SQL of an analysis, but rather regenerates it at execution time. The implication of this is that the above nqcmd method could be invalid in certain circumstances where the generated Logical SQL changes even when the analysis and patch level remain unchanged. If an analysis’ Logical SQL changes then we cannot use the same before/after dataset comparison as described above — because the ‘after’ dataset would not actually match what would be returned. In reality, if the Logical SQL changes then the corresponding Logical resultset is also going to be different.

Two factors that will cause Presentation Services to generate different Logical SQL for an analysis without changing the analysis at all are modifications to an RPD Logical Column’s Sort order column or Descriptor ID column configuration.

As an example, consider a standard “Month” column. By default, the column will be sorted alphabetically, so starting with April (not January)

The Logical SQL for this can be seen in the Advanced tab

Now without modifying the analysis at all, we change the RPD to add in a Sort order column:

Reload the RPD in Presentation Services (Reload Files and Metadata) and reload the analysis and examine the Logical SQL. Even though we have not changed the analysis at all, the Logical SQL has changed:

When executed, the analysis results are now sorted according to the Month_YYYYMM column, i.e. chronologically:

The same happens with the Descriptor ID Column setting for a Logical Column – the generated Logical SQL will change if this is modified. Changes to Logical Dimensions can also affect the Logical SQL that is generated if an analysis is using hierarchical columns. For example, if the report has a hierarchical column expanded down one level, and that level is then deleted from the logical dimension in the RPD, the analysis will instead show the next level down when next run.

Regression Testing Logical SQL generation

It is important that Logical SQL is considered as part of regression testing if we are using this targeted approach – it is the price to pay for selectively testing elements of the stack using reasoning to exclude others. In this case, if the Logical SQL changes then we cannot compare datasets (because the source query will have changed when the actual analysis is run). In addition, if the Logical SQL changes then this is a regression in itself. Consider the above Sort order column example – if that were removed from an RPD where it had been present, users would see the effect and quite rightly raise it as a regression.

There are at least two ways to get the Logical SQL for an analysis programatically : the generateReportSQL web service, and the Presentation Services Catalog Manager tool. We will look at the latter option here. The Catalog Manager can be run interactively through a GUI, or from the command line. As a command line utility it offers a rich set of tools for working with objects in the Presentation Catalog, including generating the Logical SQL for a given analysis. The logical outline for using it would be as shown below. If the Logical SQL is not identical, or runcat.sh fails to generate Logical SQL for the analysis after the RPD is changed, then a regression has occurred. If the Logical SQL has remained the same then the testing can proceed to the nqcmd method to compare resulting datasets.

rt32

runcat.sh is a powerful utility but a bit of a sensitive soul for syntax. First off, it’s easiest to call it from its home folder:

cd $FMW_HOME/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/catalogmanager

To see all the things that it can do, run

 ./runcat.sh -help

Or further information for a particular command (in our case, we’re using the report command):

./runcat.sh -cmd report -help

So to generate the Logical SQL for a given analysis, call it as follows:

./runcat.sh -cmd report -online http://server:port/analytics/saw.dll -credentials creds.txt -forceOutputFile output.lsql -folder "/path/to/analysis" -type "Analysis" "SQL"

Where you need to replace:

  • server:port with your BI Server and Managed Server port number (for example, biserver:9704)
  • creds.txt is a file with your credentials in, see below for further details
  • output.lsql is the name of the file to which the Logical SQL will be written. Remove the ‘force’ prefix if you want runcat.sh to abort if the file exists already rather than overwrite it
  • /path/to/analysis is the full path to the analysis (!), which you can get from both OBIEE (Catalog -> Object -> Properties) and from the Catalog Manager in GUI mode (Object -> Properties). In the screenshots here the full path is /shared/Standard/Analyses/Sales Reportrt33 rt34

Having called runcat.sh once, you then make the RPD change, reload the RPD in Presentation Services, and then call runcat.sh again and compare the generated Logical SQL (e.g. using diff) - if it’s the same then you can be sure that when the analysis runs it is going to do so with the same Logical SQL and thus use the nqcmd method above for comparing before/after datasets.

To call runcat.sh you need the credentials in a flatfile that looks like this:

login=weblogic
pwd=Password01

If the plaintext password makes you uneasy then consider the partial workaround that is proposed in a blog post that I wrote last year : Make Use of OBIEE’s Command Line Tools with Reduced Exposure of Plain Text Passwords

Bringing it together

Combining both nqcmd and runcat.sh gives us a logic flow as follows.

  1. Run Logical SQL through nqcmd to generate initial dataset
  2. Run analysis to which the Logical SQL corresponds through runcat.sh to generate initial Logical SQL
  3. Make RPD changes
  4. Run analysis through runcat.sh again. If it fails, or the Logical SQL doesn’t match the previous, then regression occurred.
  5. Rerun nqcmd, and compare the before/after datasets. If they don’t match, or nqcmd, fails, then regression occurred.

You may wonder why we have two Logical SQL statements present – that for use with nqcmd, and that from runcat.sh. The Logical SQL for use with nqcmd will typically come from actual analysis execution (nqquery.log / Usage Tracking) with filter values present. To compare generated Logical SQL, the source analysis needs to be used.

Summary

So to summarise, automated regression testing of OBIEE can done using just the tools that are shipped with OBIEE (and a wee bit of scripting to automate them). In this article I’ve demonstrated how automated regression testing of OBIEE can be done, and suggested how it should be done if the changes are just in the RPD. Working directly with the BI Server, Logical SQL and resultset is much more practical and easier to automate at scale. Understanding the caveat to this approach – that it relies on the Logical SQL remaining the same – and understanding in what situations this may apply is important. I demonstrated another automated method that could be used to automatically flag any tests for which the dataset comparison results would be invalid.

Testing the data that analyses request and receive from the BI Server can be done using nqcmd by passing in the raw Logical SQL, and this Logical SQL we can also programatically validate using the Catalog Manager tool in command line mode, runcat.sh.

Looking back at the diagram from the first post in this series, you can see the opportunities for regression testing in the OBIEE stack, with the point being that a clear comprehension of this would allow one to accurately target testing, rather than assuming that it must take place at the front end:

If we now add to this diagram the tools that I have discussed in the article, it looks like this:

I know I’ve not covered Selenium, but I’ve included it in the diagram for completeness. The other tool that I plan to cover in a future posting is the OBIEE Web Services as they also could have a role to play in testing for specific regressions.

Conclusion

Take a step back from the detail, I have shown here a viable and pragmatic approach to regression testing OBIEE in a manner that can actually be implemented and automated at scale. It is important to be aware that this is not 100% test coverage. For example, it omits important considerations such as testing for security regressions (can people now see what they shouldn’t). However, I would argue that some regression testing is better than none, and regression testing “with one’s eyes open” to its limitations that can be addressed manually is a sensible approach.

Regression testing doesn’t have to be automated. A sensible mix of automation and manual checking is a good idea to try and maximise the test coverage, perhaps following an 80/20 rule. The challenges around regression testing the front end mean that it is sensible to explore more focussed testing further down the stack where possible. Just because the front end regression testing can’t be automated, it doesn’t mean that the front end shouldn’t be regression tested – but perhaps it is more viable to spend time visually checking and confirming something than investing orders of magnitude more hours in building an automated solution that will only ever be less accurate and less flexible.

Regression Testing OBIEE is not something that can be solved by throwing software at it. By definition, software must be told what to do and OBIEE is too flexible, too complex, to be able to be constrained in such a manner that a single software solution will be able to accurately detect all regressions.

Successfully planning and executing regression testing for OBIEE is something that needs to be not only part of a project from the outset but is something that the developers must take active responsibility for. They have the user interviews, the functional specs, they know what the system should be doing, and they know what changes they have made to it — and so they should know what needs testing and in what way. A siloed approach to development where regression testing is “someone else’s problem” is never going to be as effectively (in terms of accuracy and time/money) as one in which developers actively participant in the design of regression tests for the specific project and development tasks at hand.

Many thanks to Gianni Ceresa for his thoughts and assistance on this subject.

Categories: BI & Warehousing

Successful BI Apps Implementation Part 2: BI Apps 7.9.6

Wed, 2014-01-22 17:58

Welcome to Part 2. If you missed Part 1, where I give an introduction to BI Apps and discuss the project life cycle, the link is below.

Successful BI Apps Implementation Part 1: Introduction and the Project Life Cycle

Successful BI Apps Implementation Part 2: BI Apps 7.9.6

In this post, I get technical and take a deeper look at BI Apps 7.9.6. Here goes!

Customising the Data Model

As defined by Oracle, customisations fall into the following categories:

  • Category 1 customisations where we add a new column to an existing fact or dimension table.
  • Category 2 customisations where we create new fact or dimension tables.
  • Category 3 customisations where we add an additional data source to an existing fact or dimension table.
  • Other trivial customisations such as changing filter conditions.

Typically, with the trivial and category 1 customisations, there isn’t much to go wrong. We may need to think about auxiliary change capture (see below) if the column comes from a table other than the base table in the OLTP.

With the category 2 customisations, we need to consider whether new fact tables are suitable for being added to an existing subject area or (more commonly) whether they merit a subject area of their own. I sometimes see subject areas with multiple fact tables and many non-conformed dimensions. This happens when a subject area is iteratively expanded again and again until the usability of the subject area is severely diminished. Things have probably gone wrong if either a) it is possible to create queries that the BI server refuses to interpret or b) the development process constantly involves setting metrics to total over non-conformed dimensions.

Regarding category 2 customisations involving new dimensions, another complication that arises is the modelling of many-to-many (M:M) relationships from the OLTP. Take this example: a user looks at an opportunity in Siebel and as well as seeing all the fields that are visible in OBIEE, they also see a field called Sales Team. The user asks for this to be added to the Sales – CRM Pipeline subject area. This is non-trivial to model in OBIEE as the Siebel Sales Team field actually opens a shuttle applet that displays the multiple sales team members for the opportunity. It might seem obvious that Sales Team can’t be modelled as an attribute of the opportunity dimension but what should be done instead? 99% of the time, my advice would be to examine the business reason behind the request to add the column and find an alternative. It may be that just the primary sales team member is required for reporting purposes or that the Created By or Last Updated By users could be used instead. In the remaining 1% of cases we can consider the use of bridge tables, but considering the number of OOTB metrics that would have to be remodelled to accommodate this change, the functional advantage gained from the change may not be enough to justify such dramatic technical changes. In situations like this, knowing when not to customise can be important for both ease of use and ease of upgrade.

Category 3 customisations do not strictly require RPD changes but we should at least have a column to tell us which source system the row of data originated from. Also, if factual rows from any particular source system cannot be reconciled with certain dimensions (due to the lack of any foreign key column), it can be beneficial to add a row to that dimension (with ROW_WID equal to -1 for example) to cater for these fact rows. It allows us to differentiate between rows from an Oracle source system that do not have an associated dimension row and rows from a non-Oracle source system that cannot have an associated dimension row.

Data Lineage / Tracking Customisations

One of the questions that I sometimes get asked by end users is ‘where does this column come from in EBS / Siebel?’. When dealing with OOTB BI Apps, this is usually quite easy to answer because:

  • Even though our semantic model may be extremely vast, it is usually quite simple and
  • Oracle provides good data lineage documentation and more importantly, descriptive text in the RPD which manifests as hover-over text in the UI.

However, I often find that after a couple of years of customising, the question isn’t so easy to answer. From an IT point-of-view, it’s usually quite easy to determine what is vanilla and what is custom, but from an end-user point-of-view, the exact functional definition of custom columns is not always that obvious. The simple solution is to follow the same standards that Oracle adheres to for the vanilla material: give new subject areas, presentation tables and presentation columns useful descriptions; keep subject areas small and intuitive (1 or 2 fact tables per area); and maintain a data lineage spread sheet that maps columns between source applications and the Oracle data warehouse.

This documentation becomes very useful when we come to upgrade BI Apps. Typically, BI Apps upgrades involve a combination of automated and manual code merging – using a 3-way repository merge on the RPD and some more manual steps to re-apply customisations to the new vanilla Informatica metadata. When testing the merged RPD and customising copies of the new Informatica mappings, the new code should be functionally identical but may be technically different due to data model changes between BI Apps versions. At this point, the above documentation becomes invaluable.

Indexing and Partitioning

In my previous post, I talked about building performance testing and monitoring into the project lifecycle. Here we will focus more on the design and development stages.

In the above section, I describe category 1 customisations as extremely simple. Well, they are, but the one mistake that people often make is not considering whether an index should be created along with the new column. This can lead to a slow but steady increase in average query times as more and more non-indexed columns are used to group and filter data. Generally, if any dimensional attribute is going to be used for grouping or filtering, it will probably benefit from an index. The decision about whether to make it a bitmap index depends on data density. For example, for the field ‘Customer Priority’ (high/med/low), use a bitmap and for ‘Customer ID’, don’t. Also, if a new column is used as a key in an update statement, make sure that the corresponding index has a type of ‘ETL’ in DAC so that it is not dropped as part of the ETL.

Partitioning is an interesting topic in the context of BI Apps. Typically, OOTB BI Apps reports and dashboards perform well even with large data volumes, due to the thousands of indexes that are provided with the OBAW. Therefore, why partition? I can think of 2 main reasons:

  • We have extremely large data volumes in some of our fact tables and have decided to partition by year and include a corresponding filter (i.e. current year) in one of our logical table sources.
  • We have performed a category 3 customization or similar and again have a logical table source filter that could be used as a partition key. For example, our revenue fact contains rows sourced from both Siebel and another system, and only the Siebel data should be visible in the OOTB subject areas.

In both of the above scenarios, we know that the partition key will be used in queries generated by the BI server due to the RPD customisations that accompany the partitioning. So what about scenarios that don’t involve the corresponding RPD changes? Should we consider partitioning to speed up a certain group of reports that have certain filters applied? In my opinion, no, we should only consider it as a viable option when we can guarantee that a) the partition key will be used in queries and b) we expect little or no movement of rows across partitions. Even if these criteria are met, we should only be looking at partitioning after we are convinced that our indexing strategy is perfect. Why? Because badly chosen partition keys can make performance worse! Even the addition of a partition that speeds up some reports can have a negative impact on those that do not reference the partition key due to the increased number of I/O operations involved in reading indexes on multiple partitions.

One important point to note is that bitmap indexes must be defined as LOCAL on partitioned tables. This means that we have to change how DAC creates its OOTB bitmap indexes if we partition an OOTB table. This can be done using Index Actions in DAC but should serve as another deterrent to unnecessary partitioning!

Customising the ETL

There are two main types of customisations that we make to the OOTB DAC and Informatica metadata. Firstly, we may take a copy of an OOTB mapping and make some minor changes. Typically, these will include new columns, sources and lookups, depending on the type of mapping. Secondly, we can create some custom mappings to extract from new sources or load into new targets. Before I give any advice about making these changes, let me first make a point about ETL vs ELT.

In OBIA 7.x, the Informatica server is the ETL engine. Compare this with Oracle Data Integrator where we typically use the Oracle Database as the ELT engine. With ODI, a fact mapping can be used to generate a single SQL statement that takes data from a staging table and loads into a target, populating the necessary foreign keys and generating surrogate keys where necessary. With Informatica, this will instead involve multiple SQL statements just to extract the staging data and data used in lookups, followed by a large amount of processing on the Informatica server itself. Finally, row-by-row insert / update statements will be issued to the target table. Clearly, this is less efficient than the ELT option regardless of how much optimisation we do on the Informatica server side.

The above weakness of the tool means that when we need a mapping to exclusively perform updates, it is often tempting to add some custom SQL to the end of a mapping / workflow rather than creating a whole new mapping that will a) take longer to develop and b) probably run slower. The problem with this approach is code readability. The same applies for the use of stored procedures within mappings (which is often not great for performance either). So my advice is to minimise the number of PLP mappings where possible (save them for populating aggregate tables) and to stick to standard Informatica development methods where possible.

Changed Data Capture

Changed data capture (CDC) is one of the topics that seem to create problems for BI Apps developers. It’s not an intrinsically difficult concept to master but I have seen it overlooked in both the design and testing phases of a couple of projects (mainly when working with Siebel). It’s worth pointing out that CDC is handled very differently for different source systems due to the varying degrees of metadata capture and audit trail functionality between systems.

CDC for EBS is pretty simple – EBS tables have nice metadata columns such as LAST_UPDATE_DATE that allow us to easily extract data that has changes since the last ETL run (minus a few days). Siebel is slightly more complicated and involves creating change capture tables and triggers in the OLTP database.  We still have nice metadata columns like in EBS, but we have to worry more often about auxiliary change capture (see below) and tracking hard-deletes. When working with Peoplesoft, complications exist because update timestamps do not exist on some base tables.

Importantly, when customising the Oracle Business Analytics Warehouse and adding new source tables to the ETL, we must consider the CDC requirements for each new source. As an example, imagine we are loading opportunities from the S_OPTY table in Siebel. By default, our SDE mapping will extract records from the associated change capture view V_OPTY that has been left outer joined to a bunch of auxiliary tables. Now imagine that we need to extract a new custom column called X_IMPORTANCE from the auxiliary table S_OPTY_X. The steps to achieve this are obvious – add S_OPTY_X as a source in the SDE mapping and create a left outer join between S_OPTY and S_OPTY_X, then map the new column from source to target. However, what happens when the value of X_IMPORTANCE is updated in the OLTP? Do we see the update in the OBAW? The answer is ‘maybe’, it depends if the core opportunity record also got updated since the last ETL.

By default, the change in the auxiliary table will not lead to the corresponding opportunity appearing in our change capture view. If we want to see the update in OBIEE without relying on an update to the base opportunity record, we must create or extend a separate mapping that exists purely to populate the change capture view used in the SDE mapping. In this situation, there is a trade-off between functionality (seeing all source system updates in OBIEE) and ETL performance (due to the extra auxiliary change capture process) but I advise starting with the assumption that auxiliary change capture is required. I have seen faulty auxiliary change capture go unnoticed a couple of times so make sure that this functionality is tested!

That’s it for BI Apps 7.9.6. Keep an eye out for more posts in 2014 where I will be blogging about BI Apps 11g.

Categories: BI & Warehousing

Successful BI Apps Implementation Part 1: Introduction and the Project Life Cycle

Tue, 2014-01-21 04:00

Since I am new to the Rittman Mead blog, let me first introduce myself. I am Mike and I have been working at Rittman Mead for the last year, mainly focusing on Oracle BI Applications. I have a background in Siebel CRM and many of my BI projects have involved integration with Siebel. Being an OBIA consultant, my work combines CRM, ERP, ETL, data warehousing and analytics.

Over the next two postings I will be looking at what makes an OBIA project successful. Firstly, I will try to justify why BI Apps can add so much value to a business and how best to extract this value from the product. Most of this will be non-technical and will examine the BI Apps project lifecycle as a whole. In Part 2, I will look at BI Apps with Informatica and DAC, focusing on good technical implementation and aspects of customisation that are often overlooked. Looking ahead to 2014, I will look at BI Apps 11g and how our steps to a successful project are affected by the architectural changes to the product.

I will add the links in as the postings are published, but here are the topics and links for the other parts of the series:

Successful BI Apps Implementation Part 1: Introduction and the Project Life Cycle

Successful BI Apps Implementation Part 2: BI Apps 7.9.6

Introduction to BI Apps

Oracle Business Intelligence Applications (OBIA or BI Apps) is a packaged BI solution for use with Oracle source systems such as EBS, JD Edwards, Siebel and PeopleSoft. It uses OBIEE as a reporting platform and until recently, Informatica as the ETL tool. Now, Oracle also provides BI Apps versions that use Oracle Data Integrator as the ELT tool. In either case, data from Oracle source systems is loaded into the Oracle Business Applications Warehouse (OBAW) using prebuilt Informatica / ODI metadata. Users then access prebuilt dashboards and subject areas in OBIEE that use the OBAW as a source. OBIA is modular and an individual Application is used to describe all of the Informatica / ODI and OBIEE metadata used for a particular functional area such as Financial Analytics or Sales Analytics. OBIA can be customised to reflect customisations in Oracle source systems or to introduce entirely new reporting functionality including new transactional sources.

BI Apps 7.9.x Architecture

If you have an Oracle transactional system, the TCO and time to production associated with implementing BI Apps should be significantly better than if you were to implement a custom BI solution. The technical quality, functional depth, ease-of-use and extensibility of the applications mean that businesses can gain very rapid insight into their processes without the need for extremely long IT projects.

Defining Success for a BI Apps Project

So let’s define our criteria for a successful OBIA project. My belief is that by the time that BI Apps is in production, the following five statements should hold true:

  • Data is accurate and importantly, users trust that the data is accurate!
  • The system performs ‘well’. I could devote an entire blog post to this topic so let’s say that when a user asks a question, they should get a response in no more than 10 seconds. We must also consider ETL performance.
  • The data model and dashboards presented to the user are intuitive and any data lineage information is self-contained.
  • Users understand their own data visibility rules and these rules are well aligned with the rules in the OLTP source systems (EBS, Siebel etc.).
  • The system is being actively used to drive business decisions, or as an integral part of a business process. Ideally, there should be some measurable ROI.

Now, how do we go about achieving these goals? Well, the answer is long enough to merit a few blog posts! Before we focus on anything too technical, let’s talk about how we handle requirements and design.

Don’t do it blind

Let’s start at the beginning of the Project Life Cycle. Often, big waterfall-style IT projects go through requirements and design phases before users begin to understand exactly how the final product will look and function. Personally, I don’t like this approach in general and I certainly think that there is a better approach for BI Apps projects (see a description of Extreme BI by Stewart Bryson for Agile methodology with OBIEE). With BI Apps, we have the ability to install the out-of-the-box (OOTB) product fairly quickly and allow users to see what the OOTB dashboards and subject areas will look like with their own data. There is some configuration required prior to performing an initial data load but the process of detailed requirements gathering and design becomes much easier if users can perform their own fit/gap analysis with realistic data. This approach allows us to ask users ‘how well aligned is this Subject Area to your needs?’ instead of ‘what would you like to report on?’. Typically, we get much more enthusiasm and useful feedback with this approach.

Great, I can see everyone’s data!

At this discovery and prototyping stage of the project, I believe that implementing enterprise security can be beneficial. Typically, we need to decide: who sees what, who does what and importantly, how we administer this. It may seem counter-intuitive to worry about boring old data security at this point in the project but in some industries, the OOTB security options for integration with applications such as EBS and Siebel may not be sufficient. I have found this to be particularly true in the banking industry and in implementations where OBIEE is customer-facing. Also, when dealing with EBS R12 integration, be aware that Oracle provides an optional EBS patch that provides more granular control of account visibility across different areas of the suite (GL, AR, AP etc.) and that implementing the same functionality in BI Apps requires customisation. Identifying at this stage if and how the BI Apps security model requires customising can save major headaches later on in the project lifecycle.

Design processes, not just dashboards

By this point in the project, we should have a good understanding of the changes required to the BI Apps data model, Dashboards and security configuration. So, there is no more need to speak to users then? Actually, it’s probably time to start thinking about Agents, Actions, and Guided Navigation. I see two main reasons for this.

Firstly, I think it makes a huge difference to initial user experience and user acceptance in the days and weeks immediately following a go-live. As an example, a sales manager may use an OOTB dashboard to perform some pipeline analysis on a daily basis. However, a better solution might be to create an exception report (such as ‘South East Opportunities pending approval for more than 2 days’) and an associated agent. Even better would be to create an Action allowing the manager to approve the opportunity with one click. This way, the sales manager can spend less time getting data and more time acting on it. In order to add this level of business value, we need to commit time to understanding exactly what users want to achieve – not just what data they want.

Secondly, examining existing or future processes in this level of detail should highlight any missing functionality from the customised data model. Using the above example, the custom metric ‘# of Days Pending Approval’ is intrinsic to the process and we would notice if it was missing from the data model.

Monitor, Evaluate, Educate

Now, let’s assume that we have a functioning system in production (see the next blog post for how to get to this stage). Dashboards perform well, functionality is well aligned with business processes and the IT guys are happy because the nightly load only takes 1 hour. Unfortunately, things can change! We should expect and plan for: data growth, a growth in BI content (such as reports and agents) and more users concurrently accessing the system. Importantly, we can and should monitor both query and ETL performance.

I believe that the best way to do this is via an Administrator Dashboard with some associated exception reports and agents.  This should highlight: long running queries, long running ETL jobs, high numbers of concurrent queries (sometimes caused by everyone scheduling their agents for 9am on a Monday) and data growth trends. Creating such a dashboard requires a combination of Usage Tracking (which is well documented) and reporting from our DAC and Informatica (or ODI) repositories. If you want a separate post on this, let me know in the comments.

Some problems with query performance can be solved with a bit of user education (without training, some users like to export everything and then filter in Excel) and others might highlight areas where the data model is functionally lacking. An example of this is where users implement complex selection steps to derive categories that could be derived during the ETL. Therefore, be proactive about engaging with the business if you see large query times.

That’s it for the introduction and project lifecycle. Stay tuned for the next post where we will get technical and talk about: RPD modelling, best practices for Informatica and DAC; data accuracy; making code upgrade-proof and data lineage.

Categories: BI & Warehousing

OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala

Sat, 2014-01-18 04:56

In yesterday’s post on analyzing Hadoop data using Cloudera CDH4, Amazon EC2 and OBIEE 11.1.1.7, I went through the setup process for Cloudera Manager Standard and then used it to set up a four-node Hadoop cluster in Amazon AWS’s EC2 service. In today’s post, we’ll use a tool called “Hue” to upload some flight delays stats from OBIEE’s SampleApp / Exalytics demos, create Hive tables over those files and then analyse them first using Hive, and then using Cloudera Impala. If you’ve not seen yesterday’s post then take a look at that first for the setup instructions, and then we’ll move on to today’s topics.

Using Hue to Create a Database and Upload Data to It

1. Hue is a web-based application that ships with Cloudera’s distribution of Hadoop, and is used to run queries against Hadoop and perform general data activities – think of it as SQL*Developer or ApEx for Hadoop, as compared to Cloudera Manager which is more like Oracle Enterprise Manager. To first navigate to Hue, go back to Cloudera Manager, select Services > hue1, then select Hue Web UI from the tabs over the Hue service details. You’re then taken to a setup screen where you can create a new Hue user for admin purposes – I’ll use the username/password “admin/admin”, and then press the Sign-Up button – which then takes me to Hue’s Quick Start Wizard, like this:

NewImage

Press Next, and then click on the All button under the Install all the application examples label. This installs demo data for Hive, Impala, Pig (a procedural tool used for PL/SQL-type data transformation) and Oozie (a workflow tool). We won’t use this example data in this exercise, but it’s handy to have around for playing around with later.

NewImage

On the Step 3: Users page, click on the User Admin button and create a new user called “airlines”, password “airlines” – we’ll use this in a moment when uploading data to the cluster. Then return to the quick start wizard, press Next and then click on the Hue Home button to go into Hue proper.

2. You’re now at the Hue homepage where you can upload and work with the HDFS filesystem, create Hive tables, use Sqoop 2 to connect to and load data from a relational database, and perform other tasks.

NewImage

Select admin > Sign Out, and then log in again this time as the “airlines” user. We’re going to now use this user to create a new Hive database called “bi_airlines”, and then create tables out of four pipe-delimited files exported earlier from an Oracle database, and that I’ve uploaded to Dropbox in case you want to use them too.

3. To create the new Hive database, click on the Tables link, and then the Databases link at the top of the page, like this:

NewImage

Then, when the next screen is displayed, click on the Create a new database link and call it “bi_airlines”, accepting the default location (in HDFS) for the files it uses. Press the Create Database button that’s then displayed, and check the log and the output to make sure it’s created OK. At that point then, you should see two databases displayed – the “bi_airlines” one you just created, and the default one.

4. Click on the bi_airlines database to select it; another page will then be displayed that will list the tables within that database (which at this point is of course empty), and links to create a new table from a file, or to create one manually.

NewImage

Now in the real world, you’d create your Hive tables manually as you’re most likely going to map them onto a directory of files (or set of directories, if you want to use Hive table partitioning), and you’re also likely to have done some processing using MapReduce, Pig, R or another tool before having the data in something resembling table file extracts. In this example though, we’re just going to use four pipe-delimited files and use Hue’s ability to upload a file and create a table from it automatically.

The four files we’ll be using are:

  • flight_performance_2008-10.txt : three years of flight delay stats including origin and departure airport, number of flights, distance, arrival and departure delay in minutes (524MB)
  • geog_dest.txt : destination airports, their state and city, as referenced in the flight stats (150KB)
  • geog_origin.txt : the same set of data as the destinations file (150KB)
  • carriers.txt : carrier (airline) codes and descriptions

Given the size of the cluster and the potential data available, you could easily use a larger dataset if you’ve got access to an Exalytics demo environment. As such, there’s about 20m flight legs in the main file which is enough to give things a bit of a spin.

5. Still logged in as the “airlines” user, click on the Create a new table from a file link. Type in “flight_performance” as the table name, and then use the ellipses (“..”) button next to the file path area. When the Choose a File dialog is shown, press the Upload a File button and then navigate to the flight_performance_2008-10.txt file on your local machine. Double-click on the file to select it, and the Hue web UI will then upload the file to the Hadoop cluster, storing it on the HDFS distributed file-system (note that on a Mac I had to switch to Firefox to get this uploader to display).

NewImage

After a while the file will finish uploading; when it has, click on the link for it in the dialog to select it, and then move onto the next page in the wizard.

The Choose Delimiter page is then displayed. This file is pipe (“|”) delimited, so select Other as the Delimiter and key in “|” (no quotes); press Preview to then display the file data, which will look like this:

NewImage

Press Next, and then define the columns in the table like this:

  • Year (int)
  • uniquecarrier (string)
  • origin (string)
  • dest (string)
  • arrivaldelaymins (int)
  • depdelaymins (int)
  • flights (int)
  • distance (int)

Then, press the Create Table button, and the table will then be created and the file used to populate it. Once complete, review the Columns tab in the table display and then the Sample tab, which should output something like this:

NewImage

Then, repeat this process for the other three files, creating the following tables:

origin (based on geog_origin.txt)

  • origin (string)
  • origin_desc (string)
  • origin_city (string)
  • origin_state (string)
  • origin_airportid (string)

destination (based on geog_dest.txt)

  • dest (string)
  • dest_desc (string)
  • dest_city (string)
  • dest_state (string)
  • dest_airportid (string)

carriers (based on carriers.txt)

  • carrier (string)
  • carrier_desc (string)

We’ve now uploaded all the data to the Hadoop cluster; let’s take a look at it now before we move over to OBIEE.

6. When you normally upload files via Hue into Hadoop’s HDFS filesystem, it normally puts them into the home directory for that user in HDFS (for example /user/airlines/). If you choose to create a table from that file though, Hue and Hive move the file into Hive’s part of the HDFS filesystem, creating a sub-directory first for that new database. You can see where your files have gone by clicking on the File Browser button at the top of the Hue page, then navigating to /user/hive/warehouse/bi_airlines.db – you should see your files there (or more correctly, directories that contain your files). You can also map Hive tables to files outside of the /user/hive/warehouse directory (they’re called “external tables”), but this action is the default, and we’ll leave them there now.

NewImage

So where are these files kept, in this Hadoop cluster in the EC2 cloud. To find out, click on the flight_performance entry, and notice that the file (with the .txt extension) is actually contained within it – we’d actually clicked on a directory for that file. In fact, Hive tables can just as easily map onto a directory of files, so you could add in other years’ data here, or in fact thousands of files – this is usually how incoming data is received in big data-type applications. With the single flight_performance.txt file displayed, click on the file to view its contents, and then click on the View File link and then notice the First Block | Previous Block | Next Block | Last Block and bytes areas – HDFS in-fact breaks the file into blocks, and stores the file in several (redundant) places on nodes in the cluster, to give us fault-tolerance and make it easy for multiple nodes to process the dataset in block chunks.

7. So let’s start by running a couple of queries in Hive, using Hue again. Click on the Beeswax (Hive UI) icon at the top of the page, ensure bi_airlines is selected as the Database, and then type in the following HiveQL query:

select count(*) from flight_performance;

Execute the query and then watch the log output. You’ll see Hive creating and then submitting for execution the MapReduce jobs to select your columns (the “map” part) and then aggregate the results (the “reduce” bit).

NewImage

The count should return about 19m rows, and the query should take around a minute and a half to run. Now let’s try something more interesting:

select sum(f.flights)
from flight_performance f join origin o on (f.origin = o.origin)
where o.origin = 'SFO'

This time the query takes a bit longer, and when it completes you can see links for the two MapReduce jobs that it used to sum the flight data, as shown in the screenshot below:

NewImage

Click on one of the MR Jobs links and you can see a bit more detail about the MapReduce job that provided that part of the dataset – in the example below, there were three mappers that ran initially, then another two to setup and cleanup the job, and then one reducer to aggregate the data. Clicking on the other link is a similar story – a single mapper for the main data selection, then a reducer and control mappers to control and aggregate the dataset.

NewImage

When you think about it, it’s pretty amazing what Hive does, compared to writing the MapReduce code yourself and then running it. And it’s probably fine for ETL-type access where most probably there’s a lot more data to load than just this small fact table, but it wouldn’t really be good for BI-type queries as we’re talking 1,2 or 3 minutes to return data. And that’s what Impala is for – access to the same data, using the same Hive catalog, but much-faster queries that don’t use MapReduce to retrieve the data.

8. So let’s run the same query using Impala. Click on the Cloudera Impala (TM) Query UI icon at the top of the Hue page, and select from the drop-down menu under Database – note how the bi_airlines database isn’t showing there. To have it show, go into the query editor area and type in:

invalidate metadata;

This will have Impala re-load the Hive catalog metadata, and the bi_airlines database should then be listed. Select it, and then try the same query as before:

NewImage

select sum(f.flights)
from flight_performance f join origin o on (f.origin = o.origin)
where o.origin = 'SFO';

This time when you run it, it returns in a couple of seconds. So now we’ve got some data and some options for querying it, let’s move over to OBIEE and try and connect it to the cluster.

Connecting OBIEE 11.1.1.7 to Cloudera Hadoop on Amazon EC2 using Hive and Impala

For simplicity’s sake, we’ll use OBIEE 11.1.1.7 running on Windows (Windows Server 2008 R2 64-bit, in my case), and we’ll use Cloudera’s own ODBC drivers to make the connection. Oracle’s recommendation is that you use Linux for Apache Hadoop / Hive connectivity though, and they provide their own drivers as part of the 11.1.1.7 install and on OTN; however I don’t think these connect to the Hiveserver2 service that recent CDH4 installs use, and I know these work. So starting with a standard install of OBIEE 11.1.1.7 on Windows 64-bit, follow these steps to initially connect via Hive.

1. Start by downloading the Impala and Hive ODBC drivers from the Cloudera website, which at the time of writing can be found here:

Install the Hive one first, and we’ll try those before going over to the Impala ones. Run the MSI installer for Hive, and then open the 64-bit ODBC Data Source Administrator utility in Windows, so we can create the ODBC connection through to Hive.

2. Next we need to find the external EC2 DNS name for the virtual server we added the HiveServer2 service to in the previous article. Open up the Cloudera Manager website on the instance you created right back at the start of yesterday’s article to host Cloudera Manager, and navigate to the Hosts > All Hosts page, like this:

In my example, the virtual server that’s running HiveServer2 is displayed with its internal EC2 DNS name, like this:

NewImage

This name only works when you’re internal to the EC2 network though, so you’ll need to go over to the AWS Management Console and find the entry for that instance using the private DNS name, and then use that you retrieve the public one, like this:

NewImage

2. Now you’re good to go. Back in the Windows desktop, click on System DSN, and then Add. In the list of ODBC data source drivers, you should see Cloudera ODBC Driver for Apache Hive; select it and press Finish.

Then, when the Cloudera ODBC Driver for Apache Hive DSN Setup dialog is shown, enter the following details, substituting the host name that you just retrieved in the previous step, that’s running HiveServer2:

Data Source Name : hive_demo
Host : <your host name with HiveServer2 running>
Port : 10000
Database : bi_airlines
Hive Server Type : HIve Server 2
Authentication Mechanism : User Name
User Name : airlines

so that the dialog looks like this:

NewImage

Press Test, and check that the test results are successful (note I’m using a slightly older version of the drivers, so the dialog might look a bit different in the latest version). Then press OK, and OK again to close the dialog and save the system DSN.

Now, create a new RPD or log into an online one that’s also on this OBIEE host server, so that any online access can also use the ODBC drivers you just installed. When the RPD is open for editing, select File > Import Metadata .., and then when prompted, select the DSN you created a moment ago – in my case, “hive_demo”, enter the airlines/airlines username and password, and then press Next to proceed to the table import page.

Make sure “Tables” is still checked, press Next, and then select and bring across the bi_airlines database you created earlier, as shown in the Data source view:

NewImage

Press Finish to complete the metadata import.

3. Now click on the Hive physical database in the Physical panel in the BI Administration tool, to display the Database properties dialog. Change the Database type: from ODBC Basic to Apache Hadoop, and press No when asked if you’d like to edit the connection pool properties for this database, then press OK to close this dialog, then check in the changes to the RPD. To now check connectivity to the Hadoop cluster via Hive ODBC, right-click on one of the table and select View Data…

NewImage

Once all is working OK, create keys on the origin.origin, destination.dest and carriers.carrier tables and then connect the fact table them, so you’ve got a physical model that looks like this:

NewImage

Then, finally, pull the rest of the RPD together and create a simple report in answers; the key thing is that you get some data through, as I’ve managed to do in the screenshot below.

NewImage

But of course – it’s not very fast – queries typically take 2, 3 minutes to run, and these are just simple ones. As I said earlier – fine for ETL, particularly when the dataset it likely to be a lot bigger, but not great for ad-hoc BI queries. So let’s set up an Impala connection instead, and see how that goes.

4. Before this will work though, we’ll need to add the Impala port – 21050 – to the security group that Cloudera Manager created when it provisioned the Hadoop nodes yesterday. To do this, go back into the AWS Management Console, click on the Security Groups menu item and navigate to the security group set up by Cloudera Manager, in my case called “jclouds#impala-demo-cdh”. Click on it to select it, and use the Inbound tab to add an additional security rule like this:

Port Range : 21050
Source : 0.0.0.0/0

Then, press Add Rule and then Apply Rule Changes to add this additional port to the security group. Finally, check the list of ports now open for that security group to see that 21050 is now listed.

You can now over to the Windows environment, install the Impala ODBC drivers and use the ODBC Administrator utility in Windows and set up the Impala ODBC connection. In my case, I use the following values:

Data Source Name : impala_demo
Host : ec2-46-137-25-2.eu-west-1.compute.amazonaws.com
Port : 21050
Database : bi_airlines
Mechanism : No Authentication

NewImage

Press Test to check that it’s all working OK, and then import the bi_airlines tables into the RPD as you did with the Hive import.

NewImage

Double-click on the new physical database and set the database type to “Apache Hadoop” again. In addition though, click on the Features tab in the Database Properties dialog and uncheck the ORDERBY_SUPPORTED checkbox – Impala SQL requires a LIMIT clause after each ORDER BY but OBIEE doesn’t currently provide this (Impala’s an unsupported source at this point in time, so its not unexpected), so by unchecking this property we get the BI Server to do the results ordering, and queries will then run OK.

NewImage

Right-click and select View Data… on one of the imported Impala tables to check that it returns data OK, and then build-out the rest of the RPD as you did with the Hive data. Now when results come back, they come back in a matter of seconds (and the results look more correct, too).

NewImage

So there you have it – a more-or-less step-by-step to setting up a Hadoop cluster in Amazon EC2, then analysing data on it using OBIEE and Hive / Impala. Hopefully it was useful – more on this topic over the next few weeks.

Categories: BI & Warehousing

OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 1 : Install and Set-up an EC2 Hadoop Cluster

Fri, 2014-01-17 11:47

I’ve been over in San Francisco this last week for BIWA Summit 2014, and one of the things I demo’d during the week was OBIEE 11.1.1.7 connecting to a Hadoop cluster running on Amazon EC2, and analysing the flight delays dataset that ships with recent SampleApps and Exalytics. There’s quite a few interesting steps and concepts in setting this up, so I thought it’d be interesting to go through them on the blog, so that others can have a try if they’re interested. Don’t take this as a definitive, 100%-complete set of steps you’ll need to work through to set up the example – I’m currently writing this in the BA lounge at SFO trying to get this written before my flight leaves, and I might have inadvertently missed a couple of steps – but this should give you the gist of what’s involved and show what’s possible.

What the example will do is create the following setup:

NewImage

In this setup, we’ll initially create an Amazon EC2 instance that we’ll then install the free version of Cloudera Manager 4.5 onto; Cloudera are a company that have created a distribution of Hadoop which they then sell alongside their own management tools (similar to how Red Hat took Linux, made it “enterprise” and sold software and services around it), but who also provide a freely-downloadable version of their tools (“Cloudera Standard”) that have special setup routines when run on Amazon EC2.

We’ll then use this install of Cloudera Manager to automatically create and provision four Amazon EC2 instances which we’ll then install Hadoop onto, along with other tools like Impala (for in-memory SQL access over the cluster), Hive, HDFS and so on. Then, in the second part of this two-part series, we’ll then upload some data from the Flight Delays dataset into the cluster, connect OBIEE to it via the Cloudera Impala ODBC drivers, and analyse from Answers. I’m assuming with this that you’ve got some familiarity with Amazon AWS, EC2 and the rest of their cloud platform, and that you’ve got yourself set up with an account, your secret access keys and so on – if not, do that first before you try and of these steps.

Let’s start by setting up the initial EC2 virtual server instance onto which we’ll install Cloudera Manager.

Installing the EC2 Hadoop Cluster

1. What we’re going to before anything else is create what’s called a “security group”, a collection of firewall settings that we’ll apply to the Cloudera Manager virtual server so that it can then connect out to the nodes it’s going to set up to run Hadoop (and so that we can connect to it to run the web interface). To do this, log into the AWS Management Console, and from the Amazon Web Services menu navigate to EC2 > Network & Security > Security Groups.

Then when the Security Groups page is displayed, press the Create Security Group button, then enter the following details when prompted:

Name : CDH-Manager
Description : Security group for CDH4 Manager instance
VPC : No VPC

Then, with this new security group selected, use the Add Rule button to add the following inbound rules:

SSH  : 0.0.0.0/0
7180 : 0.0.0.0/0
7182 : 0.0.0.0/0
7183 : 0.0.0.0/0
7432 : 0.0.0.0/0
Custom ICMP rule : Echo Reply  0.0.0.0/0

Once you’ve done this, the security group area should look like this:

NewImage

Then, press the Apply Rule Changes button to register the security settings.

2. Next we’ll create an Amazon EC2 virtual server instance to run Cloudera Manager on, using this security group settings to ensure the right ports are open – then we’ll use that instance and install of Cloudera Manager to then set up the Hadoop cluster.

To do this with the EC2 Dashboard web page still open, click on the Instances menu item on the left-hand side of the page, then press Launch Instance, noting the EC2 region you’ll be working in at the same point (for me, it’s the EU Ireland region).

For this initial virtual server instance, use the Ubuntu Server 12.0.4 LTS 64-bit image – Cloudera Manager 4.5 Free can install onto either Ubuntu or Centos, and will adjust what it installs accordingly, so for now let’s select Ubuntu.

NewImage

Then, when prompted, select the m1.medium image type, and on the Step 7: Review Instance Launch page, select the security group you created a moment ago for the instance’s security group settings. Once done, press the Launch button, create or select an SSH key pair and then download that key pair to your local laptop or PC so you can connect to the virtual server once it’s spun-up.

3. Now you need to SSH into this new EC2 virtual server and download the Cloudera Manager software to it, to then create the Hadoop cluster. To do this, first make a note of the instance name that the EC2 launch instance process gave you, like this:

NewImage

Click on that link to then show the status of the virtual server, and more importantly, its public DNS address. Once the virtual server shows a status of “running”, you can then SSH into it and download and run the Cloudera Manager software; note that “EC2-cluster.pem” is the name of the keypair I created in the previous steps, and this file will need to be “chmod 400”-protected before EC2 and SSH will let you use it – see this blog article on setting up EC2 command-line access on the Mac for example details.

To SSH into the virtual server and install and run Cloudera Manager, type in the following (using your own SSH key file name and virtual server DNS address):

ssh -i EC2-cluster.pem ubuntu@ec2-54-216-126-144.eu-west-1.compute.amazonaws.com

Then, once you’re connected, download and install Cloudera Manager like this:

wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin 

You’ll then be walked through a wizard that will get you to agree to a couple of licenses, and then download and install the Cloudera Manager software for your instance type. Note that this is something CM does when it detects it’s running on Amazon EC2 – for other types of install it’s a slightly different process.

4. Once the Cloudera Manager software install has completed, give it a couple of minutes and then use your web browser to navigate to the Cloudera Manager website, at machine-name:7180, in my case:

http://ec2-54-216-126-144.eu-west-1.compute.amazonaws.com:7180

Log in as “admin/admin” and when prompted, select the free Cloudera Standard option. Press Continue so that you’re then presented with the Provide instance specification page. Using this page, you can select the EC2 instance size and type, the number of nodes in your cluster, and a group name for your instances. In this example, we’ll create a four-node cluster using the Ubuntu 12.0.4. LTS 64-bit image. select m1.large as the image type, and call it “impala-demo-cdh”.

NewImage

Then, on the Provide Credentials page, paste in your AWS access key ID and Secret Access Key, let Cloudera Manager generate a new key pair for use with the cluster (or upload your own one from before), and then press the Start Installation button on the next page to have Cloudera Manager start provisioning the cluster instances. Once the instances are created, download the additional key file and place it with the other one, “chmod 400”-ing it as before so it’ll work with SSH into EC2.

5. Once the instance provisioning completes, Cloudera Manager will then install the relevant software onto the different nodes. The Installation in Progress page will show you the progress of these installs, with the screenshot below showing it mid-way through the process.

NewImage

Assuming all the cluster nodes install properly, walk through the rest of the steps to confirm what’s installed where, check all of the services are running OK and complete the process.

NewImage

Configuring and Setting up Hadoop

So assuming all of the install and service startup steps went OK, what you have now is a four-node Hadoop cluster running on Amazon EC2, with additional management tools and services provided by Cloudera – think of it like a Linux distribution by Red Hat or Suse, where the core is standard open-source software and the vendor provides other complementary tools, and tools they write themselves, to enhance the product. The screenshot below is the overall summary page for your cluster, as provided by Cloudera Manager – don’t worry too much about the warnings, they’re down to log file disk space and can be ignored for this particular exercise.

NewImage

If you select Services > All Services from the Cloudera Manager menu, you’ll see what’s been installed on your cluster:

NewImage

Some of the key services are:

  • HDFS – the cluster filesystem that Hadoop processes data on, and we’ll use later on to upload text files containing the flight delays data we’re going to analyse. HDFS is unix-like in how you work with it, but it stores data redundantly across all nodes in the cluster, enabling parallel operations and providing fault-tolerance.
  • HBase – a NoSQL database that we won’t use here, but that stores data in key/value pairs using the HDFS filesystem
  • Hive – a SQL-like access layer over Hadoop, typically used for ETL access, and currently by OBIEE
  • Impala – an improved version of Hive that runs in-memory and bypasses MapReduce code creation, the thing that slows Hive down
  • Hue – a web UI that we’ll be using later on to run Hive and Impala queries, and create tables in Hive’s HCatalog
  • MapReduce – the framework and server within Hadoop that typically crunches, filters and transforms the data

Before we go into Hue to create some Hive tables, there’s one tasks we need to do if we’re to access this cluster via Hive – we need to install something called “Hiveserver2”, a server process that the Hive ODBC drivers OBIEE uses will need in order to connect to the cluster, but that isn’t installed by default. 

To install Hiveserver2, from the Cloudera Manager website select Services > hive1, and then click on the instances tab. Then, scroll-across so that you can see the HiveServer2 column, locate the cluster node with the majority of services and the Hive Metastore Server installed on it, and check the checkbox to select that service for install.

NewImage

Press Continue, and then back on the Role Instances page, select the new hiveserver2 service, and select Actions for Selected > Start to start the service.

Now we’re at the point where we can use Hue to set up a Hive database, upload some files and create some tables for analysis. Check back tomorrow for the second-part in this series where we’ll do just that.

Categories: BI & Warehousing

OBIEE Regression Testing – An Introduction

Fri, 2014-01-17 10:00

In this article I’m going to look at ways to test changes that you make to OBIEE to ensure that they don’t break existing functionality. In all but the simplest IT systems it’s common for one (planned) action to inadvertently cause another (unplanned).

What IS Regression Testing?

When we make a change to a system we use functional unit tests to ensure that it does do what it is supposed to do. We should also make sure that the same changes don’t do what they’re not supposed to, that is, cause functionality already existing in the system to change behaviour. If this does happen it is known as a regression and is something we want to ensure doesn’t happen without us knowing. Some examples of regressions seen in standard OBIEE development changes include:

  • Reports stop returning data, showing an error instead
  • Reports start to show the wrong data
  • Some combinations of dimensions and facts to no longer show data, or show an error
  • Dashboards that reference a particular analysis stop working

As well as these, less common system changes can also cause regressions, for example:

  • An OBIEE version upgrade causes certain types of graph to render in a different way from the previous version
  • An OBIEE patch introduces a bug in the front end user interface
What drives Regression Testing?

The requirement for regression testing OBIEE broadly comes from two different types of change:

  • New binaries – that is, an upgrade (or patch) of OBIEE
  • New “application code” – changes to the RPD, the underlying database schema, and so on.

These two requirements have the same aim – make sure nothing breaks when we make the change – but differ in ways that make how we address them important:

  1. Frequency : OBIEE may get patched once or twice a year, and upgraded every few years. Compare this to development changes made to the RPD et al, which users would often like to see happening on a frequent basis (sometimes daily at the beginning of an implementation). If these changes are happening with great regularity then (a) we don’t want to be the ones causing the bottleneck because we can’t regression test them and thus (b) we need to find a repeatable way to perform these tests accurately and quickly.
  2. Delta Visibility : When Oracle change the OBIEE code base, we are blind as to what has gone on under the covers. Sure, we know what’s changed in the documentation, but as a starting point for “what might have broken” we can only assume everything has and test accordingly. Conversely, in a planned development we know exactly what we changed and we can therefore work out the scope of the necessary testing.

The points in bold above are what I aim to address in this article. Regression testing OBIEE doesn’t have to mean one technique alone – it can be refined based on what we know has changed.

Why Regression Test?

If you don’t regression test then you place a wager that you’ll be able to fix any problems that arise. As soon as they arise. In Production. With angry users on the phone. And the project manager screaming blue murder because their change is getting blamed for breaking everything.

This is a recipe for compounding errors upon errors, not a stable system. Testing, in all flavours, is about gaining confidence about the impact of a proposed change to a system. Functional testing reassures us that the change will do what it was designed to do. Performance testing helps us understand how a system behaves from a response time and capacity perspective. Regression testing gives us the confidence that a change, whilst doing what it ought to, isn’t going to affect something else.

The confidence in what is (and isn’t) going to happen when we deploy a change enables us to make these changes more frequently as required by the users. Instead of a long development cycle with a huge number of changes bunched in together, and one big bang test and release, we can take a more rapid, flexible, and responsive approach to development and release because we have the confidence that an individual change is going to work.

In addition to confidence in additional releases to new deployments, a good regression testing framework enables us to have confidence in making changes to long-standing big ball of mud systems. So long as we understand the relevant interfaces points in OBIEE, we can build a pass/fail test framework on top of the most complex RPD/schema.

Targeting Regression Testing Effectively

Regression testing is easy. You pay a troop of monkeys to sit at a set of computers and run every single dashboard, build every permutation of adhoc report, and if you’ve just upgraded or patched OBIEE, go through the user interface with a fine toothed comb. After the appropriate period of several weeks, any differences they find from before your change was made is a regression. Congratulations. All you need to do now is fix the problem – and then of course, regression test your new change. So monkeys are one option, but they’re expensive (you should see the wholesale market peanut price these days), they’re not infallible (monkeys get distracted by YouTube too), and they are slow.

Better than monkeys is automated regression testing, targeted smartly at the area of OBIEE that has been changed. We will now take a look at which changes can cause regressions in which area, and from that derive a list of testing methods appropriate for each type of change made.

Regression testing points in the OBIEE stack

To understand how we can regression test OBIEE, let us look at where a regression can be detected. The following diagram illustrates the request/response flow through the components in the OBIEE stack. We can use it to see where regressions may expose themselves, and thus understand at what points we can consider testing for them.

Starting from the point of view of the actual end user:

  • The user interface may regress. They may be actual bugs that weren’t there before, or ‘regressions’ in the sense that functionality or icons/layout have changed. These changes would typically only come about through software changes (patching/upgrades).
    Regressions could also occur if you are manipulating the UI through the analysis itself (eg narrative view) and the behaviour changes, but this type of UI modification is less common.
  • Regressions caused by changes to the underlying data, RPD or analyses are going to manifest themselves through a dashboard. This could be in the data or the presentation of the data (tables, graphs, etc).
  • Considering a dashboard by its constituent parts, an individual analysis could exhibit differences in its data or the presentation of the data

 

Next to consider is that each analysis sends a “Logical” SQL request to the BI Server. It is not common, but it is possible that a change to the binaries (version upgrade/patch) could introduce a regression that caused the Logical SQL to be generated incorrectly. Specific changes to the RPD can also cause the Logical SQL generation to change, potentially erroneously.

The Logical SQL that is generated is executed by the BI Server which in turn returns the requested logical resultset data. This resultset may expose a regression in how the BI Server is handling the logical request.

A “Logical” SQL request on the BI Server is parsed through the metadata layer, the RPD, and one or more “Physical” SQL statements are sent to the underlying data source(s). An error in the RPD could result in the Physical SQL being generated incorrectly.

Finally, each “Physical” SQL request at the data source returns data back to the BI Server. Any errors in changes to the physical sources will show themselves through the physical query failing, or the results being incorrect.

Regression testing opportunities

To summarise the previous section, our testing points for regression are as follows.

  1. The logical query generated by Presentation Services for an analysis
  2. The physical query/queries generated by the BI Server to retrieve the data from the data source(s)
  3. The data supplied by the data source to the BI server
  4. The data supplied by the BI server for an analysis (logical resultset)
  5. User interface, including the dashboard/analysis, taking into account both rendered data and presentation/UI.

Regression testing is based around comparing one state (before a planned change) to another (after the planned change). In considering how we are going to perform our testing, let’s take a very simplistic view on what we need to test:

  1. Does it look the same
  2. Are the numbers the same

Of these two, one is very easy to get a computer to do (and conversely, very laborious to perform manually), and the other is very difficult to explain to a computer (and relatively easy to do manually): -

  • Telling a computer to fetch some data twice and compare the first result with the second is bread and butter automation.
  • Trying to explain to a computer what a page “looks” like, or what a user interface “does” is extremely time consuming, and inevitably specific to the single item in question. Of course, we can programmatically compare the underlying code for a dashboard before and after a change, but the question I pose is whether we should.
Computers are blind

The user interface for an OBIEE end user is a web browser, and OBIEE builds its web pages through a set of languages and protocols that used to be quaintly referred to as “Web 2.0”. It uses HTML, CSS, XML, and JavaScript, taking plentiful advantage of asynchronous page loading and in-flight modifications to the Document Object Model (DOM) too. AJAX is a term which certainly covers some of the magic that goes on. The resulting user interface is pretty slick with drop down menus, expanding hierarchy trees, and partial dashboard rendering as data is returned rather than waiting for all analyses to complete. All of this omits the knockout blow that is Flash, used for rendering all graph objects in OBIEE and the subject of at notable UI bug in OBIEE 11.1.1.6. The “Developer Tools” option in modern web browsers gives us a glimpse into what is going on under the covers. We can see the number of resources that go into rendering a single page…


…and how many layers there are to the object model:


Getting a computer to interface with all of this, simulating a user interaction and parsing the response is possible with functional testing tools such as Selenium, Oracle Application Testing Suite, and HP’s QuickTest Professional. Each of these tools is capable of simulating a user (often by ‘recording’ a session as the starting point) and parsing the responses from OBIEE.

But, there is a  fundamental complication to using these tools. For all the AJAX/CSS/DOM magic to happen, the page that OBIEE generates is littered with element identifiers (so that the JavaScript code can identify the element to manipulate). For example, the following table cell has the ID in this particular execution of e_saw_14485_10_1_0_0:

Some of these IDs may change between report executions or sessions, but either way, cannot be relied on to be consistent. Therefore, getting our testing tool (such as Selenium) to compare the before/after results to detect a regression becomes a whole heap more tricky. It is possible to work out element paths based on their relative position within the page rather than an absolute ID, but that becomes even more page specific and complex to implement. Therefore to compare a before and after page programatically we have to either

  • define a particular part of the page alone to check remains the same (and risk chucking the baby out with the bath water, that is, missing other genuine regressions elsewhere on the page)

or we have to

  • compile a list of elements that we expect may change but that we don’t count as a regression (i.e. exceptions).

The latter is going to be prone to causing false positives (i.e. failing regression tests that aren’t genuine regressions) because it relies on reverse engineering the full Document Object Model of the OBIEE page. All of this is also without even taking into account software patching and upgrades – so far as Oracle are going to be concerned how a page is rendered is their own business and thus at full liberty to completely change the internal structure of a page as they desire. Given this above complication, it becomes clear that building a test against a single page is time consuming, and it will typically be specific to that page only. This becomes a problem the greater the scale of the deployment you are trying to test. Hardcoding the testing for one specific page might be fine, but given more than a handful of pages you risk ending up with a large inflexible regression test code base (that itself may become error prone and need regression testing when it’s changed…).

Conclusion

So, we come back to not how we test the front end but more should we, in every case? Given a finite amount of time, what are you going to get most benefit from in your regression tests? In the next post I will demonstrate one of the ways you can get the most “bang for your buck” when regression testing OBIEE, by concentrating your automation efforts on the query part of the OBIEE stack, and not the front end. Stay tuned!

 

Many thanks to Gianni Ceresa for his thoughts and assistance on this subject.

Categories: BI & Warehousing