Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 17 hours 35 min ago

Loading, Updating and Deleting From HBase Tables using HiveQL and Python

Thu, 2015-05-21 14:32

Earlier in the week I blogged about a customer looking to offload part of the data warehouse platform to Hadoop, extracting data from a source system and then incrementally loading data into HBase and Hive before analysing it using OBIEE11g. One of the potential complications for this project was that the fact and dimension tables weren’t append-only; Hive and HDFS are generally considered write-once, read-many systems where data is inserted or appended into a file or table but generally then can’t be updated or overwritten without deleting the whole file and writing it again with the updated dataset.

To get around this problem we loaded our incoming data into HBase tables, a NoSQL key/value-store database that also runs on Hadoop and HDFS but permits update and delete operations on rows as well as selects and inserts; later on we took the main fact table stored in Hive-on-HBase and copied its contents into Impala to considerably improve the response time of queries against this tables and the still-Hive-on-HBase dimension tables, but going back to the insert-update-delete operations on the HBase tables, how exactly does this work and what’s the most efficient way to do it?

Taking a step back for a moment, HBase is a NoSQL, key/value-type database where each row has a key (for example, “SFO” for San Francisco airport) and then a number of columns, grouped into column families. In the Flight Delays dataset that we used in the previous blog post, an HBase of origin airports might have a few thousand entries with each entry, or row, keyed on a particular airport code like this:

NewImage

(Note that at the start, these key values won’t be there – they’re more for illustrative purposes)

At the time of HBase table definition, you specify one or more “column families”. These are group headers for columns you might add earlier, and in the case of my origin airport table I might just use the column family name “dest”, so that the HBase table DDL looks like this:

create 'geog_origin','origin'

and the conceptual view of the table would look like this:

NewImage

Now what’s neat about NoSQL-style databases like this (and Endeca Server is the same) is that you can define individual columns just by using them. For example, I could create columns for the airport name, airport city, airport state and airport code just by using their name in a data load, prefixing those column names with the named of a previously-defined column family. Using the HBase Shell, for example, I could issue the following PUT commands to insert the first row of data into this HBase table, like this:

put 'geog_origin’,’SFO','origin:airport_name','San Francisco, CA: San Francisco'
put 'geog_origin’,’SFO','origin:city’,’San Francisco, CA'
put 'geog_origin’,’SFO',’origin':state','California'
put 'geog_origin’,'SFO',’origin':id’,'14771'

Now my HBase table conceptually looks like this:

NewImage

If I then want to use another column under the “origin” column family for LAX, I can just do so by using it in the next set of PUT commands, like this:

put 'geog_origin','LAX’,origin:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_origin','LAX','origin:city','Los Angeles, CA'
put 'geog_origin','LAX','origin:state','California'
put 'geog_origin','LAX','origin:region’,’West Coast'
put 'geog_origin','LAX','origin:id','12892'

NewImage

Each column within column families has its values individually set, retrieved and deleted using PUT, GET and DELETE commands, and as long as you prefix the column name with one of the previously-defined column-family names and provide the key value for the row you’re interested in, HBase database tables are very flexible and were designed for simple product catalog-type applications running on hundreds of sharded server nodes for companies of the likes of Amazon, Google and Facebook (see this HBase “Powered-by” page for more examples of organizations using HBase).

But what HBase very much isn’t is a relational database like Oracle, Microsoft SQL server or even Apache Hive, databases that we’re much more likely to store data warehouse-type data in. In the previous post I showed how Hive table structures can in-fact be put over HBase tables, mapping HBase columns to Hive columns, and then HiveQL INSERT INTO TABLE … SELECT commands can be used to bulk-load these HBase tables with initial sets of data. So back to the original question – what’s the best way to then incrementally load and refresh these HBase tables, and I can I still use HiveQL for this?

In my original post, I defined Hive tables over my HBase ones using the Hive-on-Hbase (yum install hive-hbase) package and associated Hive storage handler; for example, the Hive table that provided SQL access over the flight_delays HBase tables was defined like this:

ADD JAR /usr/lib/hive/lib/zookeeper.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler.jar;
ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
ADD JAR /usr/lib/hive/lib/hbase-client.jar;
ADD JAR /usr/lib/hive/lib/hbase-common.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop2-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-protocol.jar;
ADD JAR /usr/lib/hive/lib/hbase-server.jar;
ADD JAR /usr/lib/hive/lib/htrace-core.jar;

 
CREATE EXTERNAL TABLE hbase_flight_delays
 (key string,
  year string,
  carrier string,
  orig string,
  dest string,
  flights string,
  late   string,
  cancelled string,
  distance string
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,measures:flights,measures:late,measures:cancelled,measures:distance")
TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");

With the underlying HBase table defined with a key and two column families, one for dimension columns and one for fact (measure) ones – the key is a sequence number that I added to the source dataset to give each row a unique identifier.

create ‘test1_flight_delays','dims','measures'

To initially populate the table, I’ve created another Hive table with the initial set of source data in it, and I just insert its values in to the Hive-on-HBase table, like this:

insert into table hbase_flight_delays              
select * from flight_delays_initial_load;      
         
Total jobs = 1
...
Total MapReduce CPU Time Spent: 11 seconds 870 msec
OK
Time taken: 40.301 seconds

This initial load of 200,000 rows in this instance took 40 seconds to load; not bad, certainly acceptable for this particular project. Imagine now for every day after this we typically added another 500 or so flight records; in regular Hive this would be straightforward and we’d use the LOAD DATA or INSERT INTO TABLE … SELECT commands to add new file data to the Hive table’s underlying HDFS directories. And we can do this with the Hive-on-HBase table too, with the INSERT INTO TABLE command adding the incoming data to new rows/cells in the HBase table. Checking the row count and min/max ID values in the Hive-on-HBase table at the start, like this:

select count(*), min(cast(key as bigint)) as min_key, max(cast(key as bigint)) as max_key
from hbase_flight_delays;
 
Total jobs = 1
...
Total MapReduce CPU Time Spent: 14 seconds 660 msec
OK
200000  1  200000
Time taken: 53.076 seconds, Fetched: 1 row(s)

I can see that there’s 200,000 rows in the HBase table, starting at key value 1 and ending at key value 200,000. The table containing new data has key values going from 200,001 to 200,500, so let’s insert that new data into the Hive-on-HBase table:

insert into table hbase_flight_delays                                              
select * from flight_delays_daily_update_500_rows;     
                                 
Total jobs = 1
...
Total MapReduce CPU Time Spent: 3 seconds 870 msec
OK
Time taken: 26.368 seconds

Not bad – 26 seconds for the 500 rows, not quite as fast as the initial load but acceptable. Let’s also check that the data went in OK:

select count(*), min(cast(key as bigint)) as min_key, max(cast(key as bigint)) as max_key
from hbase_flight_delays;  
                                                      
Total jobs = 1
...
Total MapReduce CPU Time Spent: 13 seconds 580 msec
OK
200500   1   200500
Time taken: 44.552 seconds, Fetched: 1 row(s)

As I’d hoped, the number of rows has increased by 500 and the maximum key value is now 200,500. But how do we apply updates to the data in the table? I’ve got another source table that this time contains 1,000 randomly-selected rows from the initial data load dataset, where I’ve set the LATE column value to ‘999’:

hive> select * from flight_delays_daily_changes_1000_rows                                      
    > limit 5;
OK
21307  2008 WN BDL  BWI  1  999  1  283
136461  2008  OO  ORD  TYS  0  999  1  475
107768  2008  WN  BWI  ORF  0  999  1  159
102393  2008  OO  SLC  ACV  0  999  1  635
110639  2008  WN  BOI  OAK  0  999  1  511
Time taken: 0.139 seconds, Fetched: 5 row(s)

In fact the way you apply these updates is just to INSERT INTO TABLE … SELECT again, and the incoming values create new versions of existing rows/cells if needed. Some versions of HBase automatically keep a number of versions of each cell value (typically 3 versions), however the version of HBase that comes with CDH5.2 and higher only keeps one version by default (you can increase this number per table, or system wide, using the steps in the CDH5.2 release notes). Let’s try this out now, first using the HBase shell to see the values and timestamps currently held for one particular key value I know should by updated by the next dataset:

hbase(main):029:0> get 'test1_flight_delays', '102393'
COLUMN                                     CELL                                                                                                                       
 dims:carrier                              timestamp=1432236609421, value=OO                                                                                          
 dims:dest                                 timestamp=1432236609421, value=ACV                                                                                         
 dims:orig                                 timestamp=1432236609421, value=SLC                                                                                         
 dims:year                                 timestamp=1432236609421, value=2008                                                                                        
 measures:cancelled                        timestamp=1432236609421, value=1                                                                                           
 measures:distance                         timestamp=1432236609421, value=635                                                                                         
 measures:flights                          timestamp=1432236609421, value=0                                                                                           
 measures:late                             timestamp=1432236609421, value=0                                                                                           
8 row(s) in 0.0330 seconds

I’ll now use Hive to apply the updates, like this:

insert into table hbase_flight_delays                                              
select * from flight_delays_daily_changes_1000_rows;
 
Total jobs = 1
...
Total MapReduce CPU Time Spent: 4 seconds 340 msec
OK
Time taken: 24.805 seconds
 
select count(*), min(cast(key as bigint)) as min_key, max(cast(key as bigint)) as max_key
from hbase_flight_delays;                                                          
Total jobs = 1
...
Total MapReduce CPU Time Spent: 13 seconds 430 msec
OK
200500 1 200500
Time taken: 47.379 seconds, Fetched: 1 row(s)

Notice how this third INSERT didn’t create any new rows, the max key ID in the follow-up query hasn’t increased since the previous insert of new data. Querying one of the rows that I know was changed by this new table of data updates, I can see that the LATE column value has been changed:

select * from hbase_flight_delays where key = '102393';
Total jobs = 1
...
Total MapReduce CPU Time Spent: 3 seconds 600 msec
OK
102393  2008  OO  SLC  ACV  0  999  1  635

Let’s go into the HBase shell now and take a look at the columns cells for that same key ID:

hbase(main):030:0> get 'test1_flight_delays', '102393'
COLUMN                                     CELL                                                                                                                       
 dims:carrier                              timestamp=1432236723680, value=OO                                                                                          
 dims:dest                                 timestamp=1432236723680, value=ACV                                                                                         
 dims:orig                                 timestamp=1432236723680, value=SLC                                                                                         
 dims:year                                 timestamp=1432236723680, value=2008                                                                                        
 measures:cancelled                        timestamp=1432236723680, value=1                                                                                           
 measures:distance                         timestamp=1432236723680, value=635                                                                                         
 measures:flights                          timestamp=1432236723680, value=0                                                                                           
 measures:late                             timestamp=1432236723680, value=999                                                                                         
8 row(s) in 0.0800 seconds

Notice how the timestamp for each of the cells has now updated? If I had more than the default 1 version of each cell enabled, I could query the previous versions to see the old values and timestamps. So this works pretty well, and all I need to do is use HiveQL and INSERT INTO TABLE … SELECT to initially populate, append to and even update values in the table. But what If I want to update HBase more “programmatically”, maybe as part of a process that reads directly from a source application (for example, Salesforce or a web service) and then writes directly into HBase without the intermediate step of landing the incoming data into a file? For this we can use the HBase Client API of which there are libraries for many languages with the most popular being the Java API. If Java is too much though and you’d rather interact with HBase using a language such as Python, as this Cloudera blog post explains you can use either a REST API interface to HBase or one using the Thrift interface and work with languages such as Python.

In my case, my preferred way of programatically working with HBase is to use Python and a developer library called Happybase, where I can also bring in other libraries such as ones to work with Hive and even ones to work with OBIEE and Fusion Middleware and do my work at a much higher-level of abstraction. To show how this might work, I’m going to use Python, the HBase Client API and Happybase to programatically read from my update Hive tables (in real-life I’d probably connect directly to a web service if going down this more complicated route) and write a routine to read rows from the Hive table and load them into HBase.

Again I’m using the Oracle Big Data Lite 4.1 VM which has Python 2.7.6 already installed, and to get ready to install the Happybase library I first need to install pip, the “preferred installer program” for Python. As per the pip installation instructions, first download pip and then install it from the command-line:

sudo python get-pip.py

Then use Pip to install Happybase 

sudo pip install happybase

Whist you’re there you might as well install “pyhs2”, another python package that in this case lets us easily connect to Hive tables via the HiveServer2 interface found on CDH5+ and the Big Data Lite 4.1 VM.

sudo pip install pyhs2

Now I can put together a Python program such as the one below, that in this case creates a connection to a Hive table, selects all rows from it into a cursor and then PUTs these rows into the HBase table, via a batch process that sends data to HBase via the Thrift interface every 10,000 rows:

import pyhs2
import happybase
 
connection = happybase.Connection('bigdatalite')
flight_delays_hbase_table = connection.table('test1_flight_delays')
b = flight_delays_hbase_table.batch(batch_size=10000)
 
with pyhs2.connect(host='bigdatalite',
               port=10000,
               authMechanism="PLAIN",
               user='oracle',
               password='welcome1',
               database='default') as conn:
    with conn.cursor() as cur:
 
        #Execute query
        cur.execute("select * from flight_delays_initial_load")
 
        #Fetch table results
        for i in cur.fetch():
            b.put(str(i[0]),{'dims:year': i[1],
                             'dims:carrier': i[2],
                             'dims:orig': i[3],
                             'dims:dest': i[4],
                             'measures:flights': i[5],
                             'measures:late': i[6],
                             'measures:cancelled': i[7],
                             'measures:distance': i[8]})
b.send()

which I can then run from the command-line like this:

[oracle@bigdatalite ~]$ python ./load_update_flight_delays.py

As I said, using this approach I could just as easily connect to a web service or read in data via Flume or Kafka, and I can delete rows as well as insert/update them and add any other logic. From my testing it’s not all that faster than going via HiveQL and INSERT INTO TABLE … SELECT scripts (most probably because I’m still going into HBase indirectly, via the Thrift interface) but it does offer the possibility of direct inserts into HBase (and therefore Hive) from the source application without the intermediate step of writing files to disk.

So to finish this short series, tomorrow I’ll look at how well these Hive-on-HBase tables, and the Impala table I created in the previous example, work when queried from OBIEE11g. Back tomorrow.

Categories: BI & Warehousing

Using HBase and Impala to Add Update and Delete Capability to Hive DW Tables, and Improve Query Response Times

Tue, 2015-05-19 16:21

One of our customers is looking to offload part of their data warehouse platform to Hadoop, extracting data out of a source system and loading it into Apache Hive tables for subsequent querying using OBIEE11g. One of the challenges that the project faces though is how to handle updates to dimensions (and in their case, fact table records) when HDFS and Hive are typically append-only filesystems; ideally writes to fact tables should only require INSERTs and filesystem appends but in this case they wanted to use an accumulating fact snapshot table, whilst the dimension tables all used SCD1-type attributes that had their values overwritten when updates to those values came through from the source system.

The obvious answer then was to use Apache HBase as part of the design, a NoSQL database that sits over HDFS but allows updates and deletes to individual rows of data rather than restricting you just to append/inserts. I covered HBase briefly on the blog a few months ago when we used it to store webserver log entries brought into Hadoop via Flume, but in this case it makes an ideal landing point for data coming into our Hadoop system as we can maintain a current-state record of the data brought into the source system updating and overwriting values if we need to. What was also interesting to me though was how well we could integrate this HBase data into our mainly SQL-style data processing; how much Java I’d have to use to work with HBase, and whether we could get OBIEE to connect to the HBase tables and query them directly (with a reasonable response time). In particular, could we use the Hive-on-HBase feature to create Hive tables over the HBase ones, and then query those efficiently using OBIEE, so that the data flow looked like this?

NewImage

To test this idea out, I took the Flight Delays dataset from the OBIEE11g SampleApp & Exalytics demo data [PDF] and created four HBase tables to hold the data from them, using the BigDataLite 4.1 VM and the HBase Shell. This dataset has four tables:

  • FLIGHT_DELAYS – around 220m US flight records listing the origin airport, destination airport, carrier, year and a bunch of metrics (flights, late minutes, distance etc)
  • GEOG_ORIGIN – a list of all the airports in the US along with their city, state, name and so on
  • GEOG_DEST – a copy of the GEOG_ORIGIN table, used for filtering and aggregating on both origin and destination 
  • CARRIERS – a list of all the airlines associated with flights in the FLIGHT_DELAYS table

HBase is a NoSQL, key/value-store database where individual rows have a key, and then one or more column families made up of one or more columns. When you define a HBase table you only define the column families, and the data load itself creates the columns within them in a similar way to how the Endeca Server holds “jagged” data – individual rows might have different columns to each other and like MongoDB you can define a new column just by loading it into the database.

Using the HBase Shell CLI on the BigDataLite VM I therefore create the HBase tables using just these high-level column family definitions, with the individual columns within the column families to be defined later when I load data into them.

hbase shell
 
create 'carriers','details'
create 'geog_origin','origin'
create 'geog_dest','dest'
create 'flight_delays','dims','measures'

To get data into HBase tables there’s a variety of methods you can use. Most probably for the full project we’ll write a Java application that uses the HBase client to read, write, update and delete rows that are read in from the source application (see this previous blog post for an example where we use Flume as the source), or to set up some example data we can use the HBase Shell and enter the HBase row/cell values directly, like this for the geog_dest table:

put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:city','Los Angeles, CA'
put 'geog_dest','LAX','dest:state','California'
put 'geog_dest','LAX','dest:id','12892'

and you can then use the “scan” command from the HBase shell to see those values stored in HBase’s key/value store, keyed on LAX as the key.

hbase(main):015:0> scan 'geog_dest'
ROW                                    COLUMN+CELL                                                                                                     
 LAX                                   column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles                           
 LAX                                   column=dest:city, timestamp=1432067861375, value=Los Angeles, CA                                                
 LAX                                   column=dest:id, timestamp=1432067862018, value=12892                                                            
 LAX                                   column=dest:state, timestamp=1432067861404, value=California                                                    
1 row(s) in 0.0240 seconds

For testing purposes though we need a large volume of rows and entering them all in by-hand isn’t practical, so this is where we start to use the Hive integration that now comes with HBase. For the BigDataLite 4.1 VM all you need to do to get this working is install the hive-hbase package using yum (after first installing the Cloudera CDH5 repo into /etc/yum.repos.d), load the relevant JAR files when starting your Hive shell session, and then create a Hive table over the HBase table mapping Hive columns to the relevant HBase ones, like this:

hive
 
ADD JAR /usr/lib/hive/lib/zookeeper.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler.jar;
ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
ADD JAR /usr/lib/hive/lib/hbase-client.jar;
ADD JAR /usr/lib/hive/lib/hbase-common.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop2-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-protocol.jar;
ADD JAR /usr/lib/hive/lib/hbase-server.jar;
ADD JAR /usr/lib/hive/lib/htrace-core.jar;
 
CREATE EXTERNAL TABLE hbase_carriers
 (key string,
  carrier_desc string
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,details:carrier_desc")
TBLPROPERTIES ("hbase.table.name" = "carriers");
 
CREATE EXTERNAL TABLE hbase_geog_origin
 (key string,
  origin_airport_name string,
  origin_city string,
  origin_state string,
  origin_id string
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,origin:airport_name,origin:city,origin:state,origin:id")
TBLPROPERTIES ("hbase.table.name" = "geog_origin");
 
CREATE EXTERNAL TABLE hbase_geog_dest
 (key string,
  dest_airport_name string,
  dest_city string,
  dest_state string,
  dest_id string
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,dest:airport_name,dest:city,dest:state,dest:id")
TBLPROPERTIES ("hbase.table.name" = "geog_dest");
 
CREATE EXTERNAL TABLE hbase_flight_delays
 (key string,
  year string,
  carrier string,
  orig string,
  dest string,
  flights tinyint,
  late   tinyint,
  cancelled bigint,
  distance smallint
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,measures:flights,measures:late,measures:cancelled,measures:distance")
TBLPROPERTIES ("hbase.table.name" = "flight_delays");

Bulk loading data into these Hive-on-HBase tables is then just a matter of loading the source data into a regular Hive table, and then running INSERT INTO TABLE … SELECT commands to copy the regular Hive rows into the HBase tables via their Hive metadata overlays:

insert into table hbase_carriers                           
select carrier, carrier_desc from carriers;
 
insert into table hbase_geog_origin
select * from geog_origin;
 
insert into table hbase_geog_dest
select * from geog_dest;
 
insert into table hbase_flight_delays
select row_number() over (), * from flight_delays;

Note that I had to create a synthetic sequence number key for the fact table, as the source data for that table doesn’t have a unique key for each row – something fairly common for data warehouse fact table datasets. In fact storing fact table data into a HBase table is not a very good idea for a number of reasons that we’ll see in a moment, and bear-in-mind that HBase is designed for sparse datasets and low-latency inserts and row retrievals so don’t read too much into this approach yet.

So going back to the original reason for using HBase to store these tables, updating rows within them is pretty straightforward. Taking the geog_origin HBase table at the start, if we get the row for SFO at the start using a Hive query over the HBase table, it looks like this:

hive> select * from hbase_geog_origin where key = 'SFO'; 
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
SFO   San Francisco, CA: San Francisco   San Francisco, CA   California   14771
Time taken: 29.126 seconds, Fetched: 1 row(s)

To update that row and others, I can load a new data file into the Hive table using HiveQL’s LOAD DATA command, or INSERT INTO TABLE … SELECT from another Hive table containing the updates, like this:

insert into table hbase_geog_origin    
select * from origin_updates;

To check that the value has in-fact updated I can either run the same SELECT query against the Hive table over the HBase one, or drop into the HBase shell and check it there:

hbase(main):001:0> get 'geog_origin','SFO'
COLUMN                                 CELL                                                                                                           
 origin:airport_name                   timestamp=1432050681685, value=San Francisco, CA: San Francisco International                                  
 origin:city                           timestamp=1432050681685, value=San Francisco, CA                                                               
 origin:id                             timestamp=1432050681685, value=14771                                                                           
 origin:state                          timestamp=1432050681685, value=California                                                                      
4 row(s) in 0.2740 seconds

In this case the update file/Hive table changed the SFO airport name from “San Francisco” to “San Francisco International”. I can change it back again using the HBase Shell like this, if I want:

put 'geog_origin','SFO','origin:airport_name','San Francisco, CA: San Francisco'

and then checking it again using the HBase Shell’s GET command on that key value shows it’s back to the old value – HBase actually stores X number of versions of each cell with a timestamp for each version, but by default it shows you the current one:

hbase(main):003:0> get 'geog_origin','SFO'
COLUMN                                 CELL                                                                                                           
 origin:airport_name                   timestamp=1432064747843, value=San Francisco, CA: San Francisco                                                
 origin:city                           timestamp=1432050681685, value=San Francisco, CA                                                               
 origin:id                             timestamp=1432050681685, value=14771                                                                           
 origin:state                          timestamp=1432050681685, value=California                                                                      
4 row(s) in 0.0130 seconds

So, so far so good. We’ve got a way of storing data in Hive-type tables on Hadoop and a way of updating and amending records within them by using HBase as the underlying storage, but what are these tables like to query? Hive-on-HBase tables with just a handful of HBase rows return data almost immediately, for example when I create a copy of the geog_dest HBase table and put just a single row entry into it, then query it using a Hive table over it:

hive> select * from hbase_geog_dest2;
OK
LAXLos Angeles, CA: Los AngelesLos Angeles, CACalifornia12892
Time taken: 0.257 seconds, Fetched: 1 row(s)

Hive in this case even with a single row would normally take 30 seconds or more to return just that row; but when we move up to larger datasets such as the flight delays fact table itself, running a simple row count on the Hive table and then comparing that to the same query running against the Hive-on-HBase version shows a significant time-penalty for the HBase version:

hive> select sum(cast(flights as bigint)) as flight_count from flight_delays;
Total jobs = 1
Launching Job 1 out of 1
...
Total MapReduce CPU Time Spent: 7 seconds 670 msec
OK
29483653
Time taken: 37.327 seconds, Fetched: 1 row(s)

compared to the Hive-on-HBase version of the fact table:

hive> select sum(cast(flights as bigint)) as flight_count from hbase_flight_delays;
Total jobs = 1
Launching Job 1 out of 1
...
Total MapReduce CPU Time Spent: 1 minutes 19 seconds 240 msec
OK
21473738
Time taken: 99.154 seconds, Fetched: 1 row(s)

And that’s to be expected; as I said earlier, HBase is aimed at low-latency single-row operations rather than full table scan, aggregation-type queries, so it’s not unexpected that HBase performs badly here, but the response time is even worse if I try and join the HBase-stored Hive fact table to one or more of the dimension tables also stored in HBase.

In our particular customer example though these HBase tables were only going to be loaded once-a-day, so what if we copy the current version of each HBase table row into a snapshot Hive table stored in regular HDFS storage, so that our data loading process looks like this:

NewImage

and then OBIEE queries the snapshot of the Hive-on-HBase table joined to the dimension table still stored in HBase, so that the query side looks like this:

NewImage

Let’s try it out by taking the original Hive table I used earlier on to load the hbase_flight_delays table. and join that to one of the Hive-on-HBase dimension tables; I’ll start first by creating a baseline response time by joining that source Hive fact table to the source Hive dimension table (also used earlier to load the corresponding Hive-on-HBase table):

select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f 
join geog_origin o on f.orig = o.origin                                                             
and o.origin_state = 'California'                                                                       
group by o.origin_airport_name; 
...
OK
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
...
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 43.896 seconds, Fetched: 27 row(s)

So that’s just under 44 seconds to do the query entirely using regular Hive tables. So what if I swap-out the regular Hive dimension table for the Hive-on-HBase version, how does that affect the response time?

hive> select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f       
    > join hbase_geog_origin o on f.orig = o.key                                                        
    > and o.origin_state = 'California'                                                                 
    > group by o.origin_airport_name;
...
OK
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
...
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 51.757 seconds, Fetched: 27 row(s)

That’s interesting – even though we used the (updatable) Hive-on-HBase dimension table in the query, the response time only went up a few seconds to 51, compared to the 44 when we used just regular Hive tables. Taking it one step further though, what if we used Cloudera Impala as our query engine and copied the Hive-on-HBase fact table into a Parquet-stored Impala table, so that our inward data flow looked like this:

NewImage

By using the Impala MPP engine – running on Hadoop but directly reading the underlying data files, rather than going through MapReduce as Hive does – and in-addition storing its data in column-store query-orientated Parquet storage, we can take advantage of OBIEE 11.1.1.9’s new support for Impala and potentially bring the query response time even further. Let’s go into the Impala Shell on the BigDataLite 4.1 VM, update Impala’s view of the Hive Metastore table data dictionary, and then create the corresponding Impala snapshot fact table using a CREATE TABLE … AS SELECT Impala SQL command:

[oracle@bigdatalite ~]$ impala-shell
 
[bigdatalite.localdomain:21000] > invalidate metadata;
 
[bigdatalite.localdomain:21000] > create table impala_flight_delays
                                > stored as parquet
                                > as select * from hbase_flight_delays;

Now let’s use the Impala Shell to join the Impala version of the flight delays table with data stored in Parquet files, to the Hive-on-HBase dimension table created earlier within our Hive environment:

[bigdatalite.localdomain:21000] > select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
                                > join hbase_geog_origin o on f.orig = o.key
                                > and o.origin_state = 'California'  
                                > group by o.origin_airport_name;
Query: select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
join hbase_geog_origin o on f.orig = o.key
and o.origin_state = 'California'
group by o.origin_airport_name
+--------------+------------------------------------------------------------------+
| flight_count | origin_airport_name                                              |
+--------------+------------------------------------------------------------------+
| 31907        | Fresno, CA: Fresno Yosemite International                        |
| 125433       | Burbank, CA: Bob Hope                                            |
...
| 1653         | Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field |
+--------------+------------------------------------------------------------------+
Fetched 27 row(s) in 2.16s

Blimey – 2.16 seconds, compared to the best time of 44 seconds we go earlier when we just used regular Hive tables, let alone join to the dimension table stored in HBase. Let’s crank-it-up a bit and join another dimension table in, filtering on both origin and destination values:

[bigdatalite.localdomain:21000] > select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
                                > join hbase_geog_origin o on f.orig = o.key
                                > join hbase_geog_dest d on f.dest = d.key
                                > and o.origin_state = 'California'  
                                > and d.dest_state = 'New York'
                                > group by o.origin_airport_name;
Query: select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
join hbase_geog_origin o on f.orig = o.key
join hbase_geog_dest d on f.dest = d.key
and o.origin_state = 'California'
and d.dest_state = 'New York'
group by o.origin_airport_name
+--------------+-------------------------------------------------------+
| flight_count | origin_airport_name                                   |
+--------------+-------------------------------------------------------+
| 947          | Sacramento, CA: Sacramento International              |
| 3880         | San Diego, CA: San Diego International                |
| 4030         | Burbank, CA: Bob Hope                                 |
| 41909        | San Francisco, CA: San Francisco International        |
| 3489         | Oakland, CA: Metropolitan Oakland International       |
| 937          | San Jose, CA: Norman Y. Mineta San Jose International |
| 41407        | Los Angeles, CA: Los Angeles International            |
| 794          | Ontario, CA: Ontario International                    |
| 4176         | Long Beach, CA: Long Beach Airport                    |
+--------------+-------------------------------------------------------+
Fetched 9 row(s) in 1.48s

Even faster. So that’s what we’ll be going with as our initial approach for the data loading and querying; load data into HBase tables as planned at the start, taking advantage of HBase’s CRUD capabilities but bulk-loading and initially reading the data using Hive tables over the HBase ones; but then, before we make the data available for querying by OBIEE, we copy the current state of the HBase fact table into a Parquet-stored Impala table, using Impala’s ability to work with Hive tables and metadata and create joins across both Impala and Hive tables, even when one of the Hive tables uses HBase as its underlying storage.

Categories: BI & Warehousing

OBIEE 11.1.1.9 Now Supports HiveServer2 and Cloudera Impala

Sun, 2015-05-17 15:52

As you all probably know I’m a big fan of Oracle’s BI and Big Data products, but something I’ve been critical of is OBIEE11g’s lack of support for HiveServer2 connections to Hadoop clusters. OBIEE 11.1.1.7 supported Hive connections using the older HiveServer1 protocol, but recent versions of Cloudera CDH4 and CDH5 use the HiveServer2 protocol by default and OBIEE 11.1.1.7 wouldn’t connect to them; not unless you switched to the Windows version of OBIEE and used the Cloudera ODBC drivers instead, which worked but weren’t supported by Oracle.

OBIEE 11.1.1.9 addresses this issue by shipping more recent DataDirect ODBC drivers for Hive, that are compatible with the HiveServer2 protocol used by CDH4 and CDH5 (check out this other article by Robin on general new features in 11.1.1.9). Oracle only really support Hive connectivity for Linux installs of OBIEE, and the Linux version of OBIEE 11.1.1.9 comes with the DataDirect ODBC drivers already installed and configured for use, all you have to do then is set up the ODBC connection in the odbc.ini file on Linux and install the Cloudera Hive ODBC drivers on your Windows workstation for the Admin too (the Hive ODBC drivers that Oracle supply on MOS still look like the old HIveServer1 version, though I could be wrong). To check that it all worked on this new 11.1.1.9 version of OBIEE11g I therefore downloaded and installed the Windows Cloudera Hive ODBC drivers and set up the System DSN like this:

NewImage

and set up a corresponding entry in the Linux OBIEE 11.1.1.9’s odbc.ini file, like this:

NewImage

with the key thing being to make sure you have matching DSN names on both the Windows workstation (for the Admin tool initial datasource setup and table metadata import) and the Linux server (for the actual online connection to Hive from the BI Server, and subsequent data retrieval). One thing I did notice was that whilst I could connect to the Hive database server and set up the connection in the Admin tool, I couldn’t view any Hive tables and had to manually create them myself in the RPD Physical Layer – this could just be a quirk on my workstation install though so I wouldn’t read too much into it. Checking connectivity in the Admin tool then showed it connecting properly and retrieving data from Hive on the Hadoop cluster. I didn’t test Kerberos-authentication connections but I’m assuming it’d work, as the previous version of OBIEE 11.1.1.7 on Linux just failed at this point anyway. The docs are here if you’d like to look into any more details, or check the full set of setup steps.

NewImage

For Cloudera Impala connections, you’re directed in the docs to download the Windows Cloudera Impala ODBC drivers as Oracle don’t even ship them on MOS, but again the Linux install of OBIEE 11.1.1.9 comes with DataDirect Impala drivers that are already setup and ready for use (note that if you upgrade from 11.1.1.7 to 11.1.1.9 rather than do the fresh install that I did for testing purposes, you’ll need to edit the opmn.xml file to register these updated DataDirect drivers). Then it’s a case of setting the Windows System DSN up for the initial metadata import, like this:

NewImage

then creating a corresponding entry in the Linux server’s odbc.ini file, like this:

NewImage

Note that the docs do mention the issue with earlier versions of Impala where the Impala server is expecting LIMIT clauses when using ORDER BY in Impala SQL queries, and gives a couple of workarounds to fix the issue and stop Impala expecting this clause; for more recent (CDH5+) versions of Impala this requirement is in-fact lifted and you can connect-to and use Impala without needing to make the configuration change mentioned in the doc (or use the workaround I mentioned in this earlier blog post). Checking connectivity in the Admin tool then shows the connection is making its way through OK, from the Windows environment to the Linux server’s ODBC connection:

NewImage

and creating a quick report shows data returned as expected, and considerably quicker than with Hive.

NewImage

As I said, I’ve not really tested either of these two connections using Kerberos or any edge-case setups, but connectivity seems to be working and we’re now in a position where OBIEE11g can properly connect to both Hive, and Impala, on recent CDH installs and of course the Oracle Big Data Appliance. Good stuff, now what about Spark SQL or ElasticSearch..?

Categories: BI & Warehousing

Presentation Slides and Photos from the Rittman Mead BI Forum 2015, Brighton and Atlanta

Sat, 2015-05-16 13:41

It’s now the Saturday after the two Rittman Mead BI Forum 2015 events, last week in Atlanta, GA and the week before in Brighton, UK. Both events were a great success and I’d like to say thanks to the speakers, attendees, our friends at Oracle and my colleagues within Rittman Mead for making the two events so much fun. If you’re interested in taking a look at some photos from the two events, I’ve put together two Flickr photosets that you can access using the links below:

NewImage

We’ve also uploaded the presentation slides from the two events (where we’ve been given permission to share them) to our website, and you can download them including the Delivering the Oracle Information Management and Big Data Reference Architecture masterclass using the links below:

Delivering the Oracle Information Management & Big Data Reference Architecture (Mark Rittman & Jordan Meyer, Rittman Mead)

Brighton, May 7th and 8th 2015

Atlanta, May 14th and 15th 2015

Congratulations also to Emiel van Bockel and Robin Moffatt who jointly-won Best Speaker award at the Brighton event, and to Andy Rocha and Pete Tamsin who won Best Speaker in Atlanta for their joint session. It’s time for a well-earned rest now and then back to work, and hopefully we’ll see some of you at KScope’15, Oracle Openworld 2015 or the UKOUG Tech and Apps 2015 conferences later in 2015.

Categories: BI & Warehousing

What’s New in OBIEE 11.1.1.9 for Systems Administrators and Developers

Thu, 2015-05-14 11:42

After over two years since the last major release of OBIEE, Oracle released version 11.1.1.9 in May 2015. You can find the installers here and documentation here. 11.1.1.9 is termed the “terminal release” of the 11g line, and the 12c version is already out in closed-beta. We’d expect to see patchsets for 11g to continue for some time covering bugs and any security issues, but for new functionality in 11g I would hazard a guess that this is pretty much it as Oracle concentrate their development efforts on OBIEE 12c and BICS, particularly Visual Analyser.

For both the end user and backend administrator/developer, OBIEE 11.1.1.9 has brought with it some nice little touches, none of which are going to revolutionise the OBIEE world but many of which are going to make life with the tool just that little bit smoother. In this article we take a look at what 11.1.1.9 brings for the sysadmin & developer.

BI Server Query Instrumentation and Usage Tracking

There are some notable developments here:

  1. Millisecond precision when logging events from the BI Server
  2. Usage Tracking now includes the physical query hash, which is what is also visible in the database, enabling end-to-end tracing
  3. User sessions can be tracked and summarised more precisely because session ID is now included in Usage Tracking.
  4. The execution of initialisation blocks is now also recorded, in a new Usage Tracking table called S_NQ_INITBLOCK.
Millisecond precision in BI Server logs

OBIEE 11.1.1.9 writes the nqquery.log with millisecond precision for both the timestamp of each entry, and also the summary timings for a query execution (at last!). It also calls out explicitly “Total time in BI Server” which is a welcome addition from a time profiling/performance analysis point of view:

[2016-07-31T02:11:48.231-04:00 [...] Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<221516>>), connection pool named Aggr Connection, logical request hash 544131ec, physical request hash 5018e5db: [[  
[2016-07-31T02:12:04.31-04:00 [...] Query Status: Successful Completion  
[2016-07-31T02:12:04.31-04:00 [...] Rows 2, bytes 32 retrieved from database query id: <<221516>>  
[2016-07-31T02:12:04.31-04:00 [...] Physical query response time 2.394 (seconds), id <<221516>>  
[2016-07-31T02:12:04.31-04:00 [...] Physical Query Summary Stats: Number of physical queries 1, Cumulative time 2.394, DB-connect time 0.002 (seconds)  
[2016-07-31T02:12:04.31-04:00 [...] Rows returned to Client 2  
[2016-07-31T02:12:04.31-04:00 [...] Logical Query Summary Stats: Elapsed time 16.564, Total time in BI Server 16.555, Response time 16.564, Compilation time 0.768 (seconds), Logical hash 544131ec

One thing to notice here is the subsecond timestamp precision seems to vary between 2 and 3 digits, which may or may not be a bug.

Being able to see this additional level of precision is really important. Previously OBIEE recorded information by the second, which was fine if you were looking at query executions taking dozens of seconds or minutes – but hopefully our aspirations for systems performance are actually closer to the realms of seconds or subsecond. At this scale the level of precision in the timings really matters. On the assumption that OBIEE was rounding values to the nearest whole number, you’d see “0 seconds” for a Logical SQL compile (for example) that was maybe 0.499 seconds. Per query this is not so significant, but if those queries run frequently then cumulatively that time stacks up and would be useful to be properly aware of and target with optimisation if needed.

Usage Tracking changes

Usage Tracking has five new columns for each logical query recorded in S_NQ_ACCT:

  • ECID
  • TENANT_ID
  • SERVICE_NAME
  • SESSION_ID
  • HASH_ID

The presence of SESSION_ID is very useful, because it means that user behaviour can be more accurately analysed. For example, within a session, how many reports does a user run? What is the median duration of a session? Note that the session here is the session as seen by the BI Server, rather than Presentation Services.


ECID is also very useful for being able to link data in Usage Tracking back to more detailed entries in nqquery.log. Note that an ECID is multipart and concanated with RID and you won’t necessarily get a direct hit on the ECID you find in Usage Tracking with that in nqquery.log, but rather a substring of it. In this example here the root ECID is 11d1def534ea1be0:20f8da5c:14d4441f7e9:–8000–0000000000001891,0:1:103 and the varying component of the relationship (RID) id 1 and 3 respectively:

Usage Tracking:

select ecid,session_id,start_dt,start_hour_min ,saw_src_path from biee_biplatform.s_nq_acct

sa50208

nqquery.log:

[2015-05-12T08:58:38.704-04:00] [...] [ecid: 11d1def534ea1be0:20f8da5c:14d4441f7e9:-8000-0000000000001891,0:1:103:3] [...]  
-------------------- SQL Request, logical request hash:  
3fabea2b  
SET VARIABLE QUERY_SRC_CD='Report',SAW_DASHBOARD='/shared/02. Visualizations/_portal/2.11 Table Designs',SAW_DASHBOARD_PG='Conditional Format',SAW_SRC_PATH='/shared/02. Visualizations/Configured Visuals/Conditional Formats/CF based on a hidden column',PREFERRED_CURRENCY='USD';SELECT^M  
   0 s_0,^M  
[...]

In the above example note how the absence of a timezone in the Usage Tracking data is an impedance to accurate interpretation of the results, compared to nqquery.log which has a fully qualified timezone offset.

Usage Tracking changes – Physical Hash ID

As well as additions to the logical query table, there are two new columns for each physical query logged in S_NQ_DB_ACCT:

  • HASH_ID
  • PHYSICAL_HASH_ID

The implications of this are important – there is now native support in OBIEE for tracing OBIEE workloads directly down to the database (as discussed for OBIEE < 11.1.1.9 here), because the PHYSICAL_HASH_ID is what OBIEE sets as the ACTION field when it connects to the database and is available in Oracle through both AWR, V$ views, and DBMS_MONITOR. For example, in V$SESSION the ACTION field is set to the physical hash:

SQL> select username,program,action 
  from v$session where lower(program) like 'nqs%';

USERNAME PROGRAM                                          ACTION  
-------- ------------------------------------------------ ---------  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         5065e891  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         2b6148b2  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         8802f14e  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         206c8d54  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         c1c121a7

The ACTION is also available in many EM screens such as this one:

sa50210
Now with OBIEE 11.1.1.9 the physical hash – which was previously only available in the nqquery.log file – is available in S_NQ_DB_ACCT which can in turn be joined to S_NQ_ACCT to find out the logical request related to the physical query seen on the database. Cool huh!

SELECT PHYSICAL_HASH_ID,  
       USER_NAME,  
       SAW_SRC_PATH,  
       SAW_DASHBOARD,  
       SAW_DASHBOARD_PG  
FROM   BIEE_BIPLATFORM.S_NQ_DB_ACCT PHYS  
       INNER JOIN BIEE_BIPLATFORM.S_NQ_ACCT LOGL  
               ON LOGL.ID = PHYS.LOGICAL_QUERY_ID  
WHERE  PHYS.PHYSICAL_HASH_ID = '5065e891'

sa50207

This can be extended even further to associate AWR workload reports with specific OBIEE requests:

sa50209

One little grumble (no pleasing some people…) – it would have been nice if Usage Tracking also stored:

  • Timings at millisecond precision as well
  • The number of bytes (rather than just row count)
  • A proper TIMESTAMP WITH TIME ZONE (rather than the weird triplet of TS/DT/HOUR_MIN)
  • “Total time in BI Server”

Who knows, maybe in 12c?…

Footnote – START_TS in Usage Tracking in 11.1.1.9

As a note for others who may hit this issue, my testing has shown that Usage Tracking in 11.1.1.9 appears to have introduced a bug with START_TS (on both S_NQ_ACCT and S_NQ_DB_ACCT), in that it stores only the date, not date + time as it did in previous versions. For example:

  • 11.1.1.7:
    SELECT TO_CHAR(START_TS, 'YYYY-MM-DD HH24:MI:SS') AS START_TS, 
           TO_CHAR(START_DT, 'YYYY-MM-DD HH24:MI:SS') AS START_DT, 
           START_HOUR_MIN 
    FROM   S_NQ_ACCT 
    WHERE  ROWNUM < 2 
    
    START_TS            START_DT            START_HOUR_MIN   
    ------------------- ------------------- -----  
    2015-03-19 15:32:23 2015-03-19 00:00:00 15:32
  • 11.1.1.9:
    SELECT TO_CHAR(START_TS, 'YYYY-MM-DD HH24:MI:SS') AS START_TS, 
           TO_CHAR(START_DT, 'YYYY-MM-DD HH24:MI:SS') AS START_DT, 
           START_HOUR_MIN 
    FROM   S_NQ_ACCT 
    WHERE  ROWNUM < 2 
    
    START_TS            START_DT            START_HOUR_MIN   
    ------------------- ------------------- -----  
    2015-01-27 00:00:00 2015-01-27 00:00:00 10:41
Initialisation Block information in Usage Tracking

A new table, S_NQ_INITBLOCK, has been added to BIPLATFORM and holds details of when an init block ran, for which user, and importantly, how long it took. From a performance analysis point of view this is really valuable data and it’s good to seeing it being added to the diagnostic data captured to database with Usage Tracking.

From a glance at the data it looks like there’s a bit of a bonus logging going on, with user sign in/sign out also recorded (“SIGNNING ON/SIGNED ON/SIGNED OFF”).

2015-05-13_22-56-30

Note that there is no MBean for Init Block Usage Tracking, so regardless of how you configure the rest of Usage Tracking, you need to go to NQSConfig.ini to enable this one.

Presentation Services Cursor Cache

Oracle have added some additional Administration functionality for viewing and managing sessions and the cursor cache in Presentation Services. These let you track and trace more precisely user sessions.

From the Administration Page in OBIEE the new options are:


  1. Set dynamic log level per session from manage sessions

  2. Filter cursor cache based on specific user sessions

  3. Change sort order of cursor cache

  4. Show Presentation Services diagnostics per cursor

  5. Download cursor cache list as CSV

Some of these are somewhat low-level and will not be used day-to-day, but the general move towards a more open diagnostics interface with OBIEE is really positive and I hope we see more of it in 12c… :-)

Command Line Aggregate Advisor

Only for use by those with an Exalytics licence, the Summary Advisor was previously available in the Windows Administration Tool only but can now be run from the command line:

[oracle@demo setup]$ nqaggradvisor -h

Usage:  
    nQAggrAdvisor -d <dataSource> -u <userName> -o <outputFile> -c <tupleInQuotes>  
                  [-p <password>] [-F <factFilter>] [-z <maxSizeAggr>] [-g <gainThreshold>]  
                  [-l <minQueryTime>] [-t <timeoutMinutes>] [-s <startDate>]  
                  [-e <endDate>] [-C <on/off>] [-M <on/off>] [-K <on/off>]

Options:  
    -d      : Data source name  
    -u      : User name  
    -o      : Output aggregate persistence script file name  
    -c      : Aggregate persistence target - tuple in quotes: Fully qualified Connection pool, fully qualified schema name, capacity in MB  
    -p      : Password  
    -F      : Fact filter file name  
    -z      : Max size of any single aggregate in MB  
    -g      : Summary advisor will run until performance improvement for new aggregates drops below this value, default = 1  
    -l      : The minimum amount of query time accumulated per LTS in seconds, before it is included for analysis, default = 0  
    -t      : Max run time in minutes - 0 for unlimited, default = 0  
    -s      : Statistics start date  
    -e      : Statistics end date  
    -C      : Prefer optimizer estimates - on/off, default = off  
    -M      : Only include measures used in queries - on/off, default = off  
    -K      : Use surrogate keys - on/off, default = on

Examples:  
    nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:\temp\aggr_advisor.out.txt"  
        -c "DW_Aggr"."Connection Pool","DW_Aggr".."AGGR",1000

    nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:\temp\aggr_advisor.out.txt" -F "C:\temp\fact_filter.txt" -g 10  
        -c "TimesTen_instance1"."Connection Pool","dbo",2000 -s "2011-05-02 08:00:00" -e "2011-05-07 18:30:00"  -C on -M on -K off

Note that in the BIPLATFORM schema S_NQ_SUMMARY_STATISTICS is now called S_NQ_SUMMARY_ADVISOR.

HTML5 images

In previous versions of OBIEE graph images were rendered in Flash by default, and PNG on mobile devices. You could force it to use PNG for all images but would loose the interactivity (tooltips etc). Now in OBIEE 11.1.1.9 you can change the default from Flash to HTML5. This removes the need for a Flash plugin and is generally the way that a lot of visualisations are done on the web nowadays. To my eye there’s no difference in appearance:


To use HTML5 graphs by default, edit instanceconfig.xml and under <Charts> section add:

<DefaultWebImageType>html5</DefaultWebImageType>

Note that html5 is case-sensitive. The config file should look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>  
<WebConfig xmlns="oracle.bi.presentation.services/config/v1.1">  
   <ServerInstance>  
   [...]  
        <Views>  
        [...]  
            <Charts>  
                <DefaultWebImageType>html5</DefaultWebImageType>  
            [...]  
            </Charts>  
        [...]  
        </Views>  
    [...]  
   </ServerInstance>  
</WebConfig>

If Presentation Services doesn’t come back up when you restart it after making this change then check the stdout logfile console~coreapplication_obips1~1.log as well as the standard sawlog.log file, both of which you’ll find in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/. The reason to check the console log file is that Presentation Services will refuse to start if the configuration supplied is invalid, and you’ll see an error message stating this here.

NQS ODBC functions

One for the Neos amongst you, a quick call of NQSGetSQLProcedures (as seen in SampleApp dashboard 7.90 NQS ODBC Procedures) and comparison with 11.1.1.7.150120 shows the following new & changed NQS ODBC calls. If this means nothing to you then it probably doesn’t need to, but if you’re interested in exploiting OBIEE functionality from all angles, documented or not, then these might be of interest. It goes without saying, these are entirely undocumented and unsupported, completely liable to change or be removed at any time by Oracle.

  • Added
    • NQSGetUserDefinedFunctions
    • NQSPAFIntegration
    • NQSSearchPresentationObjects
    • NQS_GetAllCacheEntries
    • NQS_GetOverallCacheInfo
    • NQS_GetRepositories
    • NQS_LoadNewBaseRP
    • NQS_LoadNewRPVersion
    • NQS_LockSessionAgainstAutoRPSwitchOver
    • NQS_SetRPDReadOnlyMode
    • NQS_SwitchOverThisSessionToNewRP
    • SAPurgeCacheBySubjectArea
    • SAPurgeCacheEntryByIDVector
    • SAPurgeXSACache
    • SASeedXSACache
  • Modified
    • NQSGetQueryLogExcerpt (additional parameter)
    • SAPurgeInternalCache (additional enum)
  • Removed
    • NQSChangeSelfPassword
Web Services

Web Services are one of the best ways to integrate with OBIEE programatically. You don’t need to be building heavy java apps just to use them – you can create and send the necessary SOAP messages from python or even just send it from bash with curl.

There are 2.5 new WSDLs – two new ones (v9, v10) plus v8 which has changed. The new services are:

  • KPIAssessmentService
  • ScorecardAssessmentService
  • ScorecardMetadataService
  • UserPersonalizationService

You’ll find documentation for the Web Services in the Integrator’s Guide.

User Image Upload

Users can now upload their own images for use in Title views, conditional formats, etc. From an administration point of view this means you’ll want to be keeping an eye on /root/shared/custom/images/ in the Presentation Catalog, either on disk and/or through the OBIEE Catalog View, switch to Admin and enable “Show Hidden Items”:

QUERY_LIMIT_WARNING_INSTEAD_OF_ERROR

This new setting in NQSConfig.ini will warn users when they’re breaching defined query limits, but it won’t abort the query.

Pointless hacks

If you’re a geek like me, part of the fun of a new tool is simply poking around and seeing what’s new – not necessarily what’s useful. There’s plenty of great new stuff in 11.1.1.9, but let’s take a look at the “under the hood”, just Because Geek.

It was John Minkjan who first blogged several years about the xsd configuration schema files, and it is from this that we can find all the things that Presentation Services might be able to do – not just what it definitely can do, and not just what Oracle have documented that it can do. I wrote about some of these options a while back, and there are a few new ones in 11.1.1.9.

ALL OF THESE ARE COMPLETELY UNDOCUMENTED AND UNSUPPORTED. DO NOT USE THEM.

  • EnableCloudBIEEHome sets the home page of OBIEE to be as it would be on BI Cloud Service (BICS). This is completely pointless since all the interesting stuff (Load Data, Model, Manage) is non-existent, even if it does give us a clue which application deployments are going to be supplying them (bimodeler and biserviceadministration respectively)

  • GridViews/ShowDataModels outputs a bunch of debug data in Answers Table or Pivot Views:

  • VirusScannerConfiguration – When a user uploads a custom image, this command will be called with it. For example, this simple script writes to a file the time and name of the file passed to it:

    echo '---' >> /tmp/log.txt  
    date >> /tmp/log.txt  
    echo $1 >> /tmp/log.txt

    If I save this as /tmp/test-script.sh and add it to instanceconfig.xml:

    <VirusScannerConfiguration>  
       <ScannerInvocationCommandLine>/tmp/test-script.sh</ScannerInvocationCommandLine>  
    </VirusScannerConfiguration>

    When I upload an image I get a row written to my log file. That in itself isn’t useful, but it could be a handy hook maybe from an auditing point of view, or indeed, virus scanning:

    [oracle@demo tmp]$ cat /tmp/log.txt  
    ---  
    Wed May 20 16:01:47 EDT 2015  
    /app/oracle/biee/instances/instance1/tmp/OracleBIPresentationServicesComponent/coreapplication_obips1/defaultpool/sawserver_8673_5553759a_2-1.tmp
Categories: BI & Warehousing

Some Oracle Big Data Discovery Tips and Techniques

Tue, 2015-05-12 05:49

I’ve been using Oracle Big Data Discovery for a couple of months now, and one of the sessions I’m delivering at this week’s Rittman Mead BI Forum 2015 in Atlanta is on Big Data Discovery Examples from the Field, co-presented with Tim Vlamis from Vlamis Software. Tim is going to concentrate on a data analysis and visualization example using a financial/trading dataset,  and I’m going to look at at some of the trickier, or less obvious aspects to the BDD development process that we’ve come across putting together PoCs and demos for customers. I’ll start first with the data ingestion and transformation part of BDD.

There’s two basic ways to get data into BDD’s DGraph engine; you can either use the Data Processing CLI command-line utility to sample, ingest and enrich Hive table data into the DGraph engine, or you can use the web-based data uploader to ingest data from a spreadsheet, text file or similar. For example, to load a Hive table called “bdd_test_tweets” into the DGraph engine using the command-line, you’d enter the commands:

[oracle@bigdatalite Middleware]$ cd BDD1.0/dataprocessing/edp_cli
[oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t bdd_test_tweets

Big Data Discovery would then read the Hive table metastore to get the table and column names, datatypes and file location, then spin-up a Spark job to sample, enrich and then load the data into the DGraph engine. If the Hive table has fewer than 1m rows the whole dataset gets loaded in, or the dataset is sampled if the number of Hive rows is greater than 1m. The diagram below shows the basic load, profile and enrich ingestion process.

NewImage

There’s a couple of things to bear in-mind when you’re loading data into BDD in this way:

  • You can only load Hive tables, not Hive views, as the Spark loading process only works with full table definitions in the Hive metastore
  • If your Hive table uses a SerDe other than the ones that ship with Base CDH5, you’ll need to upload the SerDe into BDD’s EDP JAR file area in HDFS and update some JAR reference files before the import will work, as detailed in Chapter 3 of the Big Data Discovery Data Processing Guide doc
  • If you’ve installed BDD on a laptop or a smaller-than-usual Hadoop setup, you’ll need to make sure the SPARK_EXECUTOR_MEMORY value you set in bdd.conf file when you installed the product can be handled by the Hadoop cluster – by default SPARK_EXECUTOR_MEMORY is set to 48G for the install, but on my single laptop install I set it to 2G (after having first installed BDD, the data ingestion process didn’t work, and then I had to reinstall it with SPARK_EXECUTOR_MEMORY = 2G as the new setting)
  • If you installed an early copy of BDD you might also need to change the OLT_HOME value in the /localdisk/Oracle/Middleware/user_projects/domains/bdd_domain/bin/setDomainEnv.sh file so that OLT_HOME=”/opt/bdd/olt” instead reads OLT_HOME=”/opt/bdd/edp/olt” – recent updates to the install files and installer correct this problem, but if it’s set wrong then the noun extraction part of the ingestion process won’t work either from the CLI, or from the BDD Studio Transformation screen
  • There’s also no current way to refresh or reload a BDD DGraph dataset, apart from deleting it from BDD and then re-importing it. Hopefully this, and the lack of Kerberos support, will be addressed in the next release

Another thing you might want to consider when providing datasets for use with BDD is whether you leave quotes around the column values, and whether you pre-strip out HTML tags from any text. Take for example the text file below, stored in CSV-delimited format:

NewImage

The file contains three comma-separated fields per line; one with the IP address of the requesting user, the others with the page title and page content snippet, all three fields having quotes around their values due to the presence of commas in the content text. Loading this data into Hive using the Hue web interface gives us a table with quotes around all of the fields, as Hue (in this release) doesn’t strip quotes from CSV fields.

NewImage

When I ingest this table into BDD using the Data Processing CLI, I’ve got just these three columns still with the quotes around the fields. I can easily remove the quotes by going into the Transformation screen and use Groovy transforms to strip the first and last characters from the fields, but this is more work for the user and I don’t benefit from the automatic enrichment that BDD can do when performing the initial ingest.

NewImage

If, however, I replace the comma separator with a pipe symbol, and remove the quotes, like this:

NewImage

and therefore use Hue’s ability to use pipe and other separators instead of commas (and quotes), my Hive table looks like this:

NewImage

Now, when we ingest this table into BDD, we get six more derived attributes as the enrichment part of the ingestion process recognises the fields as containing IP addresses, text and so on. Presumably in the future BDD will have an option to ignore quotes around field values, but for now I tend to strip-out the quotes and uses pipes instead for my BDD ingestion files.

NewImage

Similarly, with hive tables that contain fields with HTML content you can just load those fields into BDD as-is, and BDD will generally extract nouns and keywords and created derived fields for those. And whilst you can run Groovy transformations to strip-out the HTML tags (mostly), you’re then stuck with these derived columns that include HTML tag names – img, h2 and so on – in the keywords list. What I tend to do then is re-export the dataset with the content field stripped of the HTML tags, then re-ingest that table so I get a new keyword field with the HTML tags removed. What would be simpler though would be to strip-out the HTML tags before you load up the Hive table, so you didn’t have to do this round-trip to get rid of the HTML tag names from the noun keyword lists that are automatically generated during the ingest enrichment process.

Once you’ve got datasets loaded into BDD, something I didn’t quite get the value of when I first used BDD studio was the “scratchpad” feature. To take an example, in the masterclass session I bring in a table of tweets referencing Rittman Mead, and one of the attributes in the resulting BDD dataset is for the first hashtag mentioned in the tweet. I can select this attribute and click on the “Add to Scratchpad” link to add it into the BDD Studio scratchpad, like this:

NewImage

The scratchpad then displays above the list of attributes for that dataset, and by default it shows a horizontal bar chart listing the number of times each hashtag in the dataset is referenced.

NewImage

I could then, should I wish to, use the Refine By button to the left of the chart to filter-down (or “refine” in BDD and Endeca-speak) the chart to include just those tweets by a subset of Twitter users – in this case myself, Robin, Michael, Jerome and Edel.

NewImage

I can also add other attributes to the scratchpad as well – for example, the Twitter handle for the person tweeting – so that we can turn the bar chart into a stacked bar chart with the Twitter handles used to show the breakdown of use of that hashtag by each of our staff.

NewImage

You can also use these Scratchpad visualisations as the start of your main BDD Studio “Discover” dashboards, by pressing the Add to Discover page at the bottom right-hand corner of each visualization. In this way rather than creating your Discover dashboards from scratch each time, you can seed them with some starter graphs and data visualizations right from the dataset attribute views.

NewImage

The last bit I’m going to talk about in the BI Forum session session are “dataset views”; by default, each dataset you create within a BDD project has just its own attributes within it, and if you use one of them to create a visualization in the Discovery section, you’ll not be able to use any of the attributes from your other datasets (although the faceted search feature above every BDD Studio page searches all datasets in your project and in the BDD Catalog, just like the Endeca Information Discovery “value searches” that I talked about in this older article. To use attributes from more than one dataset in a BDD Studio visualisation component you need to join them, similar to how you’d join tables in the OBIEE RPD.

To take the Tweets, Page Views and Page Content datasets I use in the BI Forum masterclass, consider a situation where I’d like to list out all o the tweets that reference our website, along with details of the page title, page category and other attributes that I can get from a second dataset that I pull from the website itself. To link these two datasets together I join them in BDD Studio using their common URL attribute (in reality I had to massage the datasets so that both URLs featured a trailing forward-slash (“/“) to make them join properly, but that’s another story)

NewImage

If I then go to the Data Views tab within the Project Settings BDD Studio screen, I can see that two data views have been setup for this join; one (“rm_linked_tweets – linked”) leads on the RM Linked Tweets dataset (the tweets) and returns the 1547 tweets in that first dataset joined to pages in the Site Content dataset, the “site_content – linked” dataset starts from the 2229 records in the Site Content dataset and joins those records to the RM Linked Tweets dataset; you can then choose which one you want to use (or “drive off”) when you add components to the Discover dashboard part.

NewImage

Where it gets interesting is when you add third, fourth datasets to the join. Depending on the way you join-in the third table affects the number of rows returned by the join; if join the web server logs dataset (“access_per_post_cat_authors”) to the Site Contents dataset the resulting three-way join view returns the 2229 rows driven by the entries in the Site Contents dataset, whereas if I join the tweets dataset to the web server logs dataset directly, so the tweets dataset joins first to the site contents dataset, and then separately to the web server logs dataset, like this:

NewImage

then the resulting data view joining all three datasets return a row count equal to the rows in the tweets dataset driving it all.

NewImage

The best way to work this all out in your head is to do what I did, and create a series of datasets with distinct row counts and join characteristics and then just test creating joins and viewing the resulting row count using the Preview button below the list of data views. To make things even more interesting you can choose, in the Discover page properties section, whether to left-outer join, equi-join or full-outer join a “primary” dataset used for the page with any it’s joined with, in our instance determining whether the full set of tweets are filtered by the list of pages they refer to (removing tweets that reference non-existant RM web pages in this example), or whether all tweets are returned regardless.

NewImage

It’s actually pretty powerful stuff and you should avoid the temptation to pre-create all your joined datasets in Hive beforehand rather than use BDD Studio joins, as once you get your head around the concept it’s a flexible and straightforward way to join your datasets up in whatever way makes sense for your analysis – leading off of website hits for one type of analysis, and then pages referenced in tweets in the other, allowing yo to easily swap what would be the fact, and dimension tables in a regular relational database report.

That’s if for now though; it’s the day before the Atlanta Rittman Mead BI Forum and I need to to get myself to the venue and get set up for tomorrow’s masterclass with Jordan Meyer. To those of you that are coming along to Atlanta I look forward to seeing you, otherwise normal service will be resumed next week.

Categories: BI & Warehousing

Final Agenda for the Rittman Mead BI Forum 2015 Atlanta – Running this Week!

Mon, 2015-05-11 02:17

It’s the Monday before the Atlanta Rittman Mead BI Forum 2015, and delegates will start to arrive in Atlanta tomorrow and during the rest of the week. The first event in Brighton was excellent, and we’re hoping for something even better in Atlanta this week!

Safe travels for everyone coming to Atlanta, the official Twitter hashtag for the event is #biforum2015, and in the meantime here’s the final agenda for Atlanta’s BI Forum 2015:

Rittman Mead BI Forum 2015
Renaissance Atlanta Midtown Hotel, Atlanta, GA, USA 

Wednesday 13th May 2015

  • 9.00am – 9.30am Registration for Masterclass attendees
  • 9.30am – 12.30pm Masterclass Part 1
  • 12.30pm – 1.30pm Lunch
  • 1.30pm – 4.30pm Masterclass Part 2
  • 5.30pm – 6.30pm Drinks Reception in Renaissance Midtown Hotel Atlanta
  • 6.30pm – 7.30pm Oracle Keynote – Chris Lynskey
  • 7.30pm – 9.30pm Dinner at Renaissance Midtown Hotel Atlanta

Thursday 14th May 2015

  • 8.45am – 9.00am Welcome and Opening Comments
  • 9.00am – 9.45am Chris Lynskey (Oracle Corporation) – Looking Ahead to Oracle BI 12c and Visual Analyzer
  • 9.45am – 10.30am Robin Moffatt (Rittman Mead) – Smarter Regression Testing for OBIEE
  • 10.30am – 11.00am Coffee
  • 11.00am – 11.45pm Chris Lynskey (Oracle Corporation) – Big Data Discovery
  • 11.45am – 12.30pm Mark Rittman (Rittman Mead) and Tim Vlamis (Vlamis Software Solutions) – Big Data Discovery – Examples from the Field
  • 12.30pm – 1.30pm Lunch
  • 1.30pm – 2.30pm Day 1 Debate – “Self-Service BI – The Answer to Users’ Prayers, or the Path to Madness?”
  • 2.30pm – 3.15pm Tim German / Cameron Lackpour – Hybrid Mode – An Essbase Revolution
  • 3.15pm – 3.45pm Coffee
  • 3.45pm – 4.30pm Kevin McGinley (Red Pill Analytics) – Agile BI Applications: A Case Study
  • 6.00pm – 6.45pm Guest Speaker/Keynote – John Foreman – How Mailchimp used qualitative and quantitative analysis to build their next product
  • 7.00pm – 7.45pm Depart for dinner at restaurant
  • 8.00pm – 10.00pm Dinner at external venue

Friday 15th May 2015

  • 09.00am – 09.45am Stewart Bryson (Red Pill Analytics) – Supercharge BI Delivery with Continuous Integration
  • 09.45am – 10.30am Gabby Rubin (Oracle Corporation) – Solid Standing for Analytics in the Cloud
  • 10.30am – 11.15am Hasso Schaap (Qualogy) – Developing strategic analytics applications on OBICS PaaS
  • 11.15am – 11.30am Coffee
  • 11.30am – 12.15pm Andy Rocha and Pete Tamisin (Rittman Mead) – OBIEE Can Help You Achieve Your GOOOOOOOOOALS!
  • 12.15pm – 1.00pm Christian Screen (Sierra-Cedar) – 10 Tenets for Making Your Oracle BI Applications Project Succeed Like a Boss
  • 1.00pm – 1.30pm Short Lunch
  • 1.30pm – 2.30pm Data Visualization Bake-off
  • 2.30pm – 3.15pm Sumit Sarkar (Progress Software) – NoSQL and OBIEE
  • 3.15pm – 3.45pm Closing Remarks, and Best Speaker Award

If you’re interested in coming along to the Rittman Mead BI Forum 2015 in Atlanta, GA, there are still spaces available with details of the event here, and the registration form here – otherwise we’ll look forward to seeing you all at the Renaissance Atlanta Midtown Hotel later this week.

Categories: BI & Warehousing

Just Under a Week to go Until the Atlanta BI Forum 2015 – Places Still Available!

Thu, 2015-05-07 09:38

The Rittman Mead Brighton BI Forum 2015 is now underway, with presentations from Oracle, Rittman Mead, partners and customers on a range of topics around Oracle BI, DW and Big Data. So far this week we’ve had a one-day masterclass from myself and Jordan Meyer on Delivering the Oracle Information Management & Big Data Reference Architecture, sessions from Oracle on OBIEE12c, the new SampleApp for OBIEE 11.1.1.9, Big Data Discovery, BI Cloud Service and Visual Analyzer. We’ve also had sessions from the likes of Emiel van Bockel, Steve Devine, Christian Screen and others on Exalytics, data visualization, Oracle BI Apps and other topics – and a very entertaining debate on self-service BI.

NewImage

… and we’re doing it all again in Atlanta, GA next week! If you’re interested in coming along to the Rittman Mead BI Forum 2015 in Atlanta, GA, there are still spaces available with details of the event here, and the registration form here. We’re running BI Forum 2015 in the Renaissance Hotel Midtown Atlanta, the masterclass with myself and Jordan Meyer runs on the Wednesday, with the event itself kicking-off with a reception, meal and keynote from Oracle on Wednesday evening, followed by the main event itself starting Thursday morning. Hopefully we’ll see some of you there…!

 

Categories: BI & Warehousing

One Day to the Brighton Rittman Mead BI Forum 2015 – Here’s the Agenda!

Tue, 2015-05-05 15:33

It’s the night before the Brighton Rittman Mead BI Forum 2015, and some delegates are already here ready for the masterclass tomorrow. Everyone else will either be arriving later in the day for the drinks reception, Oracle Keynote and dinner, or getting here early Thursday morning ready for the event proper. Safe travels for everyone coming down to Brighton, the official Twitter hashtag for the event is #biforum2015, and in the meantime here’s the final agenda for this week’s event:

Rittman Mead BI Forum 2015
Hotel Seattle, Brighton, UK 

Wednesday 6th May 2015

  • 10.00 – 10.00 Registration for Masterclass attendees
  • 10.30 – 12.30 Masterclass Part 1
  • 13.00 – 13.30 Lunch
  • 13.30 – 16.30 Masterclass Part 2
  • 18.00 – 19.00 Drinks Reception in Hotel Seattle Bar
  • 19.00 – 20.00 Oracle Keynote – Nick Tuson & Philippe Lions
  • 20.00 – 22.00 Dinner at Hotel Seattle

Thursday 7th May 2015

  • 08.45 – 09.00 Welcome and Opening Comments
  • 09.00 – 09.45 Steve Devine (Independent) – The Art and Science of Creating Effective Data Visualisations
  • 09.45 – 10.30 Chris Royles (Oracle Corporation) – Big Data Discovery
  • 10.30 – 11.00 Coffee
  • 11.00 – 11.45 Christian Screen (Sierra-Cedar) – 10 Tenets for Making Your Oracle BI Applications Project Succeed Like a Boss
  • 11.45 – 12.30 Philippe Lions and Nick Tuson (Oracle Corporation) Looking Ahead to Oracle BI 12c and Visual Analyzer
  • 12.30 – 13.30 Lunch
  • 13.30 – 14.30 Day 1 Debate – “Self-Service BI – The Answer to Users’ Prayers, or the Path to Madness?”
  • 14.30 – 15.15 Emiel van Bockel (CB) Watch and see 12c on Exalytics
  • 15.15 – 15.45 Coffee
  • 15.45 – 16.30 Philippe Lions (Oracle Corporation) – Solid Standing for Analytics in the Cloud
  • 16.30 – 17.15 Manuel Martin Marquez (C.E.R.N.) – Governed Information Discovery: Data-driven decisions for more efficient operations at CERN
  • 18.00 – 18.45 Guest Speaker/Keynote – Reiner Zimmermann (Oracle Corporation) – Hadoop or not Hadoop …. this is the question
  • 19.00 – 20.00 Depart for dinner at restaurant
  • 20.00 – 22.00 Dinner at external venue

Friday 8th May 2015

  • 09.00 – 09.45 Daniel Adams (Rittman Mead) User Experience First: Guided information and attractive dashboard design
  • 09.45 – 10.30 André Lopes (Liberty Global) A Journey into Big Data and Analytics
  • 10.30 – 11.00 Coffee 
  • 11.00 – 11.45 Antony Heljula (Peak Indicators) – Predictive BI – Using the Past to Predict the Future
  • 11.45 – 12.30 Robin Moffatt (Rittman Mead) Data Discovery and Systems Diagnostics with the ELK stack
  • 12.30 – 13.00 Short Lunch
  • 13.00 – 14.00 Data Visualization Bake-off
  • 14.00 – 14.45 Gerd Aiglstorfer (G.A. itbs GmbH) Driving OBIEE Join Semantics on Multi Star Queries as User
  • 14.45 – 15.00 Closing Remarks, and Best Speaker Award

See you all at the Hotel Seattle, Brighton, tomorrow!

Categories: BI & Warehousing

So What’s the Real Point of ODI12c for Big Data Generating Pig and Spark Mappings?

Wed, 2015-04-29 00:30

Oracle ODI12c for Big Data came out the other week, and my colleague Jérôme Françoisse put together an introductory post on the new features shortly after, covering ODI’s new ability to generate Pig and Spark transformations as well as the traditional Hive ones. How this works is that you can now select Apache Pig, or Apache Spark (through pySpark, the Spark API through Python) as the implementation language for an ODI mapping, and ODI will generate one of those languages instead of HiveQL commands to run the mapping.

NewImage

How this works is that ODI12c 12.1.3.0.1 adds a bunch of new component-style KMs to the standard 12c ones, providing filter, aggregate, file load and other features that generate pySpark and Pig code rather than the usual HiveQL statement parts. Component KMs have also been added for Hive as well, making it possible now to include non-Hive datastores in a mapping and join them all together, something it was hard to do in earlier versions of ODI12c where the Hive IKM expected to do the table data extraction as well.

But when you first look at this you may well be tempted to think “…so what?”, in that Pig compiles down to MapReduce in the end, just like Hive does, and you probably won’t get the benefits of running Spark for just a single batch mapping doing largely set-based transformations. To my mind where this new feature gets interesting is its ability to let you take existing Pig and Spark scripts, which process data in a different, dataflow-type way compared to Hive’s set-based transformations and which also potentially also use Pig and Spark-specific function libraries, and convert them to managed graphical mappings that you can orchestrate and run as part of a wider ODI integration process.

Pig, for example, has the LinkedIn-originated DataFu UDF library that makes it easy to sessionize and further transform log data, and the Piggybank community library that extends Pig’s loading and saving capabilities to additional storage formats, and provides additional basic UDFs for timestamp conversion, log parsing and so forth. We’ve used these libraries in the past to process log files from our blog’s webserver and create classification models to help predict whether a visitor will return, with the Pig script below using the DataFu and Piggybank libraries to perform these tasks easily in Pig.

register /opt/cloudera/parcels/CDH/lib/pig/datafu.jar;
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar;

DEFINE Sessionize datafu.pig.sessions.Sessionize('60m');
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.9','0.95');
DEFINE VAR datafu.pig.VAR();
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();

--------------------------------------------------------------------------------
-- Import and clean logs
raw_logs = LOAD '/user/flume/rm_logs/apache_access_combined' USING TextLoader AS (line:chararray);

-- Extract individual fields
logs_base = FOREACH raw_logs
GENERATE FLATTEN
(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')) AS
(remoteAddr: chararray, remoteLogName: chararray, user: chararray, time: chararray, request: chararray, status: chararray, bytes_string: chararray, referrer:chararray, browser: chararray);

-- Remove Bots and convert timestamp
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|Bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*');

-- Remove uselesss columns and convert timestamp
clean_logs = FOREACH logs_base_nobots GENERATE CustomFormatToISO(time,'dd/MMM/yyyy:HH:mm:ss Z') as time, remoteAddr, request, status, bytes_string, referrer, browser;

--------------------------------------------------------------------------------
-- Sessionize the data

clean_logs_sessionized = FOREACH (GROUP clean_logs BY remoteAddr) {
ordered = ORDER clean_logs BY time;
GENERATE FLATTEN(Sessionize(ordered))
AS (time, remoteAddr, request, status, bytes_string, referrer, browser, sessionId);
};

-- The following steps will generate a tsv file in your home directory to download and work with in R
store clean_logs_sessionized into '/user/jmeyer/clean_logs' using PigStorage('\t','-schema');

If you know Pig (or read my previous articles on this theme), you’ll know that pig has the concept of an “alias”, a dataset you define using filters, aggregations, projections and other operations against other aliases, with a typical pig script starting with a large data extract and then progressively whittling it down to just the subset of data, and derived data, you’re interested in. When it comes to script execution, Pig only materializes these aliases when you tell it to store the results in permanent storage (file, Hive table etc) with the intermediate steps just being instructions on how to progressively arrive at the final result. Spark works in a similar way with its RDDs, transformations and operations which either create a new dataset based off of an existing one, or materialise the results in permanent storage when you run an “action”. So let’s see if ODI12c for Big Data can create a similar dataflow, based as much as possible on the script I’ve used above.

… and in-fact it can. The screenshot below shows the logical mapping to implement this same Pig dataflow, with the data coming into the mapping as a Hive table, an expression operator creating the equivalent of a Pig alias based off of a filtered, transformed version of the original source data using the Piggybank CustomFormatToISO UDF, and then runs the results of that through an ODI table function that in the background transforms the data using Pig’s GENERATE FLATTEN command and a call to the DataFu Sessionize UDF.

NewImage

And this is the physical mapping to go with the logical mapping. Note that all of the Pig transformations are contained within a separate execution unit, that contains operators for the expression to transform and filter the initial dataset, and another for the table function.

NewImage

The table function operator runs the input fields through an arbitrary Pig Latin script, in this case defining another alias to match the table function operator name and using the DataFu Sessionize UDF within a FOREACH to first sort, and then GENERATE FLATTEN the same columns but with a session ID for user sessions with the same IP address and within 60 seconds of each other.

NewImage

If you’re interested in the detail of how this works and other usages of the new ODI12c for Big Data KMs, then come along to the masterclass I’m running with Jordan Meyer at the Brighton and Atlanta Rittman Mead BI Forums where I’ll go into the full details as part of a live end-to-end demo. Looking at the Pig Latin that comes out of it though, you can see it more or less matches the flow of the hand-written script and implements all of the key steps.

NewImage

Finally, checking the output of the mapping I can see that the log entries have been sessionized and they’re ready to pass on to the next part of the classification model.

NewImage

So that to my mind is where the value is in ODI generating Pig and Spark mappings. It’s not so much taking an existing Hive set-based mapping and just running it using a different language, it’s more about being able to implement graphically the sorts of data flows you can create with Pig and Spark, and being able to get access to the rich UDF and data access libraries that these two languages benefit from. As I said, come along to the masterclass Jordan and I are running, and I’ll go into much more detail and show how the mapping is set up, along with other mappings to create an end-to-end Hadoop data integration process.

Categories: BI & Warehousing

Setting up Security and Access Control on a Big Data Appliance

Tue, 2015-04-28 14:05

Like all Oracle Engineered Systems, Oracle’s field servicing and Advanced Customer Services (ACS) teams go on-site once a BDA has been sold to a customer and do the racking, installation and initial setup. They will usually ask the customer a set of questions such as “do you want to enable Kerberos authentication”, “what’s the range of IP addresses you want to use for each of the network interfaces”, “what password do you want to use” and so on. It’s usually enough to get a customer going, but in-practice we’ve found most customers need a number of other things set-up and configured before they use the BDA in development and production; for example:

  • Integrating Cloudera Manager, Hue and other tools with the corporate LDAP directory
  • Setting up HDFS and SSH access for the development and production support team, so they can log in with their usual corporate credentials
  • Come up with a directory layout and file placement strategy for loading data into the BDA, and then moving it around as data gets processed
  • Configuring some sort of access control to the Hive tables (and sometimes HDFS directories) that users use to get access to the Hadoop data
  • Devising a backup and recovery strategy, and thinking about DR (disaster recovery)
  • Linking the BDA to other tools and products in the Oracle Big Data and Engineered Systems family; Exalytics, for example, or setting up ODI and OBIEE to access data in the BDA

The first task we’re usually asked to do is integrate Cloudera Manager, the web-based admin console for the Hadoop parts of the BDA, with the corporate LDAP server. By doing this we can enable users to log into Cloudera Manager with their usual corporate login (and restrict access to just certain LDAP groups, and further segregate users into admin ones and stop/start/restart services-type ones), and similarly allow users to log into Hue using their regular LDAP credentials. In my experience Cloudera Manager is easier to set up than Hue, but let’s look at a high-level at what’s involved.

LDAP Integration for Hue, Cloudera Manager, Hive etc

In our Rittman Mead development lab, we have OpenLDAP running on a dedicated appliance VM and a number of our team setup as LDAP users. We’ve defined four LDAP groups, two for Cloudera Manager and two for Hue, with varying degrees of access for each product.

NewImage

Setting up Cloudera Manager is pretty straightforward, using the Administration > Settings menu in the Cloudera Manager web UI (note this option is only available for the paid, Cloudera Enterprise version, not the free Cloudera Express version). Hue security integration is configured through the Hue service menu, and again you can configure the LDAP search credentials, any LDAPS or certificate setup, and then within Hue itself you can define groups to determine what Hue features each set of users can use.

NewImage

Where Hue is a bit more fiddly (last time I looked) is in controlling access to the tool itself; Cloudera Manager lets you explicitly define which LDAP groups can access the tool with other users then locked-out, but Hue either allows all authenticated LDAP users to login to the tool or makes you manually import each authorised user to grant them access (you can then either have Hue check-back to the LDAP server for their password each login, or make a copy of the password and store it within Hue for later use, potentially getting out-of-sync with their LDAP directory password version). In practice what I do is use the manual authorisation method but then have Hue link back to the LDAP server to check the users’ password, and then map their LDAP groups into Hue groups for further role-based access control. There’s a similar process for Hive and Impala too, where you can configure the services to authenticate against LDAP, and also have Hive use user impersonation so their LDAP username is passed-through the ODBC or JDBC connection and queries run as that particular user.

Configuring SSH and HDFS Access and Setting-up Kerberos Authentication

Most developers working with Hadoop and the BDA will either SSH (Secure Shell) into the cluster and work directly on one of the nodes, or connect into their workstation which has been configured as a Hadoop client for the BDA. If they SSH in directly to the cluster they’ll need Linux user accounts there, and if they go in via their workstation the Hadoop client installed there will grant them access as the user they’re logged-into the workstation as. On the BDA you can either set-up user accounts on each BDA node separately, or more likely configure user authentication to connect to the corporate LDAP and check credentials there.

NewImage

One thing you should definitely do, either when your BDA is initially setup by Oracle or later on post-install, is configure your Hadoop cluster as a secure cluster using Kerberos authentication. Hadoop normally trusts that each user accessing Hadoop services via the Hadoop Filesystem API (FS API) is who they say they are, but using the example above I could easily setup an “oracle” user on my workstation and then access all Hadoop services on the main cluster without the Hadoop FS API actually checking that I am who I say I am – in other words the Hadoop FS API shell doesn’t check your password, it merely runs a “whoami” Linux command to determine my username and grants me access as them.

NewImage

The way to address this is to configure the cluster for Kerberos authentication, so that users have to have a valid Kerberos ticket before accessing any secured services (Hive, HDFS etc) on the cluster. I covered this as part of an article on configuring OBIEE11g to connect to Kerberos-secured Hadoop clusters last Christmas and you can either do it as part of the BDA install, or later on using a wizard in more recent versions of CDH5, the Cloudera Hadoop distribution that the BDA uses.

NewImage

The complication with Kerberos authentication is that your organization needs to have a Kerberos KDC (Key Distribution Center) server setup already, which will then link to your corporate LDAP or Active Directory service to check user credentials when they request a Kerberos ticket. The BDA installation routine gives you the option of creating a KDC as part of the BDA setup, but that’s only really useful for securing inter-cluster connections between services as it won’t be checking back to your corporate directory. Ideally you’d set up a connection to an existing, well-tested and well-understood Kerberos KDC server and secure things that way – but beware that not all Oracle and other tools that run on the BDA are setup for Kerberos authentication – OBIEE and ODI are, for example, but the current 1.0 version of Big Data Discovery doesn’t yet support Kerberos-secured clusters.

Coming-up with the HDFS Directory Layout

It’s tempting with Hadoop to just have a free-for-all with the Hadoop HDFS filesystem setup, maybe restricting users to their own home directory but otherwise letting them put files anywhere. HDFS file data for Hive tables typically goes in Hive’s own filesystem area /user/hive/warehouse, but users can of course create Hive tables over external data files stored in their own part of the filesystem.

What we tend to do (inspired by Gwen Shapira’a “Scaling ETL with Hadoop” presentation) is create separate areas for incoming data, ETL processing data and process output data, with developers then told to put shared datasets in these directories rather than their own. I generally create additional Linux users for each of these directories so that these can own the HDFS files and directories rather than individual users, and then I can control access to these directories using HDFS’s POSIX permissions. A typical user setup script might look like this:

[oracle@bigdatalite ~]$ cat create_mclass_users.sh 
sudo groupadd bigdatarm
sudo groupadd rm_website_analysis_grp
useradd mrittman -g bigdatarm
useradd ryeardley -g bigdatarm
useradd mpatel -g bigdatarm
useradd bsteingrimsson -g bigdatarm
useradd spoitnis -g bigdatarm
useradd rm_website_analysis -g rm_website_analysis_grp
echo mrittman:welcome1 | chpasswd
echo ryeardley:welcome1 | chpasswd
echo mpatel:welcome1 | chpasswd
echo bsteingrimsson:welcome1 | chpasswd
echo spoitnis:welcome1 | chpasswd
echo rm_website_analysis:welcome1 | chpasswd

whilst a script to setup the directories for these users, and the application user, might look like this:

[oracle@bigdatalite ~]$ cat create_hdfs_directories.sh 
set echo on
#setup individual user HDFS directories, and scratchpad areas
sudo -u hdfs hadoop fs -mkdir /user/mrittman
sudo -u hdfs hadoop fs -mkdir /user/mrittman/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/ryeardley
sudo -u hdfs hadoop fs -mkdir /user/ryeardley/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/mpatel
sudo -u hdfs hadoop fs -mkdir /user/mpatel/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/bsteingrimsson
sudo -u hdfs hadoop fs -mkdir /user/bsteingrimsson/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/spoitnis
sudo -u hdfs hadoop fs -mkdir /user/spoitnis/scratchpad
 
#setup etl directories
sudo -u hdfs hadoop fs -mkdir -p /data/rm_website_analysis/logfiles/incoming
sudo -u hdfs hadoop fs -mkdir /data/rm_website_analysis/logfiles/archive/
sudo -u hdfs hadoop fs -mkdir -p /data/rm_website_analysis/tweets/incoming
sudo -u hdfs hadoop fs -mkdir /data/rm_website_analysis/tweets/archive
 
#change ownership of user directories
sudo -u hdfs hadoop fs -chown -R mrittman /user/mrittman
sudo -u hdfs hadoop fs -chown -R ryeardley /user/ryeardley
sudo -u hdfs hadoop fs -chown -R mpatel /user/mpatel
sudo -u hdfs hadoop fs -chown -R bsteingrimsson /user/bsteingrimsson
sudo -u hdfs hadoop fs -chown -R spoitnis /user/spoitnis
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/mrittman
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/ryeardley
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/mpatel
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/bsteingrimsson
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/spoitnis
 
#change ownership of shared directories
sudo -u hdfs hadoop fs -chown -R rm_website_analysis /data/rm_website_analysis
sudo -u hdfs hadoop fs -chgrp -R rm_website_analysis_grp /data/rm_website_analysis

Giving you a directory structure like this (with the directories for Hive, Impala, HBase etc removed for clarity)

NewImage

In terms of Hive and Impala data, there’s varying opinions on whether to create tables as EXTERNAL and store the data (including sub-directories for table partitions) in the /data/ HDFS area or let Hive store them in its own /user/hive/warehouse area – I tend to let Hive store them within its area as I use Apache Sentry to then control access to those Tables’s data.

Setting up Access Control for HDFS, Hive and Impala Data

At its simplest level, access control can be setup on the HDFS directory structure by using HDFS’s POSIX security model:

  • Each HDFS file or directory has an owner, and a group
  • You can add individual Linux users to a group, but an HDFS object can only have one group owning it

What this means in-practice though is you have to jump through quite a few hoops to set up finer-grained access control to these HDFS objects. What we tend to do is set RW access to the /data/ directory and subdirectories to the application user account (rm_website_analysis in this case), and RO access to that user’s associated group (rm_website_analysis_grp). If users then want access to that application’s data we add them to the relevant application group, and a user can belong to more than one group, making it possible to grant access to more than one application data area

[oracle@bigdatalite ~]$ cat ./set_hdfs_directory_permissions.sh 
sudo -u hdfs hadoop fs -chmod -R 750 /data/rm_website_analysis
usermod -G rm_website_analysis_grp mrittman

making it possible for the main application owner to write data to the directory, but group members only have read access. What you can also now do with more recent versions of Hadoop (CDH5.3 onwards, for example) is define access control lists to go with individual HDFS objects, but this feature isn’t enabled by default as it consumes more namenode memory than the traditional POSIX approach. What I prefer to do though is control access by restricting users to only accessing Hive and Impala tables, and using Apache Sentry, or Oracle Big Data SQL, to provide role-based access control over them.

Apache Sentry is a project originally started by Cloudera and then adopted by the Apache Foundation as an incubating project. It aims to provide four main authorisation features over Hive, Impala (and more recently, the underlying HDFS directories and datafiles):

  • Secure authorisation, with LDAP integration and Kerberos prerequisites for Sentry enablement
  • Fine-grained authorisation down to the column-level, with this feature provided by granting access to views containing subsets of columns at this point
  • Role-based authorisation, with different Sentry roles having different permissions on individual Hive and Impala tables
  • Multi-tenant administration, with a central point of administration for Sentry permissions

From this Cloudera presentation on Sentry on Slideshare, Sentry inserts itself into the query execution process and checks access rights before allowing the rest of the Hive query to execute. Sentry is configured through security policy files, or through a new web-based interface introduced with recent versions of CDH5, for example.

NewImage

The other option for customers using Oracle Exadata,Oracle Big Data Appliance and Oracle Big Data SQL is to use the Oracle Database’s access control mechanisms to govern access to Hive (and Oracle) data, and also set-up fine-grained access control (VPD), data masking and redaction to create a more “enterprise” access control system.

NewImage

So these are typically tasks we perform when on-boarding an Oracle BDA for a customer. If this is of interest to you and you can make it to either Brighton, UK next week or Atlanta, GA the week after, I’ll be covering this topic at the Rittman Mead BI Forum 2015 as part of the one-day masterclass with Jordan Meyer on the Wednesday of each week, along with topics such as creating ETL data flows using Oracle Data Integrator for Big Data, using Oracle Big Data Discovery for faceted search and cataloging of the data reservoir, and reporting on Hadoop and NoSQL data using Oracle Business Intelligence 11g. Spaces are still available so register now if you’d like to hear more on this topic.

Categories: BI & Warehousing

Last Chance to Register for the Brighton Rittman Mead BI Forum 2015!

Tue, 2015-04-28 05:32

It’s just a week to go until the start of the Brighton Rittman Mead BI Forum 2015, with the optional one-day masterclass starting on Wednesday, May 6th at 10am and the event opening with a reception and Oracle keynote later in the evening. Spaces are still available if you want to book now, but we can’t guarantee places past this Friday so register now if you’re planning to attend.

NewImage

As a reminder, here’s some earlier blog posts and articles about events going on at the Brighton event, and at the Atlanta event the week after:

We’re also running our first “Data Visualisation Challenge” at both events, where we’re asking attendees to create their most impressive and innovative data visualisation within OBIEE using the Donors Choose dataset, with the rule being that you can use any OBIEE or related technology as long as the visualisation runs with OBIEE and can respond to dashboard prompt controls. We’re also opening it up to OBIEE running as part of Oracle BI Cloud Service (BICS), so if you want to give Visual Analyser a spin within BICS we’d be interested in seeing the results.

Registration is still open for the Atlanta BI Forum event too, running the week after Brighton on the 13th-15th May 2015 at the Renaissance Atlanta Midtown hotel. Full details of both events are on the event homepage, with the registration links for Brighton and Atlanta given below.

  • Rittman Mead BI Forum 2015, Brighton –  May 6th – 8th 2015 
We look forward to seeing you all in Brighton next week, or Atlanta the week after – but remember to book soon, before we close registration!
Categories: BI & Warehousing

Using the ELK Stack to Analyse Donor’s Choose Data

Sat, 2015-04-25 13:35

Donor’s Choose is an online charity in America through which teachers can post details of projects that need funding and donors can give money towards them. The data from the charity since it began in 2000 is available to download freely here in several CSV datasets. In this article I’m going to show how to use the ELK stack of data discovery tools from Elastic to easily import some data (the donations dataset) and quickly start analysing it to produce results such as this one:

I’m assuming you’ve downloaded and unzipped Elasticsearch, Logstash and Kibana and made Java available if not already. I did this on a Mac, but the tools are cross-platform and should work just the same on Windows and Linux. I’d also recommend installing Kopf, which is an excellent plugin for the management of Elasticsearch.

CSV Data Ingest with Logstash

First off we’re going to get the data in to Elasticsearch using Logstash, after which we can do some analysis using Kibana.

To import the data with Logstash requires a configuration file which in this case is pretty straightforward. We’ll use the file input plugin, process it with the csv filter, set the date of the event to the donation timestamp (rather than now), cast a few fields to numeric, and then output it using the elasticsearch plugin. See inline comments for explanation of each step:

input {  
    file {  
        # This is necessary to ensure that the file is  
        # processed in full. Without it logstash will default  
        # to only processing new entries to the file (as would  
        # be seen with a logfile for a live application, but  
        # not static data like we're working with here)  
        start_position  => beginning  
        # This is the full path to the file to process.  
        # Wildcards are valid.  
        path =>  ["/hdd/ELK/data/opendata/opendata_donations.csv"]  
    }
}

filter {  
        # Process the input using the csv filter.  
        # The list of column names I took manually from the  
        # file itself  
        csv {separator => ","  
                columns => ["_donationid","_projectid","_donor_acctid","_cartid","donor_city","donor_state","donor_zip","is_teacher_acct","donation_timestamp","donation_to_project","donation_optional_support","donation_total","dollar_amount","donation_included_optional_support","payment_method","payment_included_acct_credit","payment_included_campaign_gift_card","payment_included_web_purchased_gift_card","payment_was_promo_matched","via_giving_page","for_honoree","donation_message"]}

        # Store the date of the donation (rather than now) as the  
        # event's timestamp  
        # 
        # Note that the data in the file uses formats both with and  
        # without the milliseconds, so both formats are supplied  
        # here.  
        # Additional formats can be specified using the Joda syntax  
        # (http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html)  
        date { match => ["donation_timestamp", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss"]}  
        # ------------
        # Cast the numeric fields to float (not mandatory but makes for additional analysis potential)
        mutate {
        convert => ["donation_optional_support","float"]
        convert => ["donation_to_project","float"]
        convert => ["donation_total","float"]
        }
}

output {  
        # Now send it to Elasticsearch which here is running  
        # on the same machine.  
        elasticsearch { host => "localhost" index => "opendata" index_type => "donations"}  
        }

With the configuration file created, we can now run the import:

./logstash-1.5.0.rc2/bin/logstash agent -f ./logstash-opendata-donations.conf

This will take a few minutes, during which your machine CPU will rocket as logstash processes all the records. Since logstash was originally designed for ingesting logfiles as they’re created it doesn’t actually exit after finishing processing the file, but you’ll notice your machine’s CPU return to normal, at which point you can hit Ctrl-C to kill logstash.

If you’ve installed Kopf then you can see at a glance how much data has been loaded:

Or alternatively query the index using Elasticsearch’s API directly:

curl -XGET 'http://localhost:9200/opendata/_status?pretty=true'

[...]  
    "opendata" : {  
      "index" : {  
        "primary_size_in_bytes" : 3679712363,  
      },  
[...]  
      "docs" : {  
        "num_docs" : 2608803,

Note that Elasticsearch will take more space than the source data (in total the 1.2Gb dataset ends up taking c.5Gb)

Data Exploration with Kibana

Now we can go to Kibana and start to analyse the data. From the Settings page of Kibana add the opendata index that we’ve just created:

Go to Discover and if necessary click the cog icon in the top right to set the index to opendata. The time filter defaults to the last 15 minutes only, and if your logstash has done its job right the events should have the timestamp of the actual donation, so you need to click on the time filter in the very top right of the screen to change time period to, for example, Previous year. Now you should see a bunch of data:

Click the toggle on one of the events to see the full data for it, including things like the donation amount, the message with the donation, and geographical details of the donor. You can find details of all the fields on the Donor’s Choose website here.

Click on the fields on the left to see a summary of the data within, showing very easily that within that time frame and sample of 500 records:

  • two thirds of donations were in the 10-100 dollar range
  • four-fifths included the optional donation towards the running costs of Donor’s Choose.

You can add fields into the table itself (which by default just shows the complete row of data) by clicking on add for the fields you want:

Let’s save this view (known as a “Search”), since it can be used on a Dashboard later:

Data Visualisation with Kibana

One of my favourite features of Kibana is its ability to aggregate data at various dimensions and grains with ridiculous ease. Here’s an example: (click to open full size)

Now let’s amend that chart to show the method of donation, or the donation amount range, or both: (click to open full size)

You can also change the aggregation from the default “Count” (in this case, number of donations) to other aggregations including sum, median, min, max, etc. Here we can compare cheque (check) vs paypal as a payment method in terms of amount given:

Kibana Dashboards

Now let’s bring the visualisations together along with the data table we saw in the the Discover tab. Click on Dashboard, and then the + icon:

Select the visualisations that you’ve created, and then switch to the Searches tab and add in the one that you saved earlier. You’ve now got a data table showing all currently selected data, along with various summaries on it.

You can rearrange the dashboard by dragging each box around to suit. Once you’ve got the elements of the dashboard in place you can start to drill into your data further. To zoom in on a time period click and drag a selection over it, and to filter on a particular data item (for example, state in the “Top ten states” visualisation) click on it and accept the prompt at the top of the screen. You can also use the freetext search at the top of the screen (this is valid on the Discover and Visualize pages too) to search across the dataset, or within a given field.

Example Analysis

Let’s look at some actual data analyses now. One of the most simple is the amount given in donations over time, split by amount given to project and also as the optional support amount:

One of the nice things about Kibana is the ability to quickly change resolution in a graph’s time frame. By default a bar chart will use an “Auto” granularity on the time axis, updating as you zoom in and out so that you always see an appropriate level of aggregation. This can be overridden to show, for example, year-on-year changes:

You can also easily switch the layout of the chart, for example to show the percentage of the two aggregations relative to each other. So whilst the above chart shows the optional support amount increasing by the year, it’s actually remaining pretty much the same when taken as a percentage of the donations overall – which if you look into the definition of the field (“we encourage donors to dedicate 15% of each donation to support the work that we do.“) makes a lot of sense

Analysis based on text in the data is easy. You can use the Terms sub-aggregation, where here we can see the top five states in terms of donation amount, California consistently being the top of the table.

Since the Terms sub-aggregation shows the Top-x only, you can’t necessarily judge the importance of those values in relation to the rest of the data. To do this more specific analysis you can use the Filters sub-aggregation to use free-form searches to create buckets, such as here to look at how much those from NY and CA donated, vs all other states. The syntax is field:value to include it, and -field:value to negate it. You can string these expressions together using AND and OR.

A lot of the analysis generally sits well in the bar chart visualisation, but the line chart has a role to play too. Donations are grouped according to the value range (<10, between 10 and 100, > 100), and these plot out nicely when considering the number of donations made (rather than total value). Whilst the total donation in a time period is significant, so is the engagement with the donors hence the number of donations made is important to analyse:

As well as splitting lines and bars, you can split charts themselves, which works well when you want to start comparing multiple dimensions without cluttering up a single chart. Here’s the same chart as previously but split out with one line per instance. Arguably it’s clearer to understand, and the relative values of the three items can be better seen here than in the clutter of the previous chart:

Following on from this previous graph, I’m interested in the spike in mid-value ($10-$100) donations at the end of 2011. Let’s pull the graph onto a dashboard and dig into it a bit. I’ve saved the visualisation and brought it in with the saved Search (from the Discover page earlier) and an additional visualisation showing payment methods for the donations:

Now I can click and drag the time frame to isolate the data of interest and we see that the number of donations jumps eight-fold at this point:

Clicking on one of the data points drills into it, and we eventually see that the spike was attributable to the use of campaign gift cards, presumably issued with a value > $10 and < $100.

elkodvis0502

Limitations

The simplicity described in this article comes at a cost, or rather, has its limits. You may well notice fields in the input data such as “_projectid”, and if you wanted to relate a donation to a given project you’d need to go and look that project code up manually. There’s no (easy) way of doing this in Elasticsearch – whilst you can easily bring in all the project data too and search on projectid, you can’t display the two (project and donation) alongside each other (easily). That’s because Elasticsearch is a document store, not a relational database. There are some options discussed on the Elasticsearch blog for handling this, none of which to my mind are applicable to this kind of data discovery (but Elasticsearch is used in a variety of applications, not just as a data store for Kibana, so in others cases it is more relevant). Given that, and if you wanted to resolve this relationship, you’d have to go about it a different way, maybe using the linux join command to pre-process the files and denormalise them prior to ingest with logstash. At this point you reach the “right tool/right job” decision – ELK is great, but not for everything :-)

Reprocessing

If you need to reload the data (for example, when building this I reprocessed the file in order to define the numerics as such, rather than the default string), you need to :

  • Drop the Elasticsearch data:
    curl -XDELETE 'http://localhost:9200/opendata'
  • Remove the “sincedb” file that logstash uses to record where it last read from in a file (useful for tailing changing input files; not so for us with a static input file)
    rm ~/.sincedb*

    (better here would be to define a bespoke sincedb path in the file input parameters so we could delete a specific sincedb file without impacting other logstash processing that may be using sincedb in the same path)
  • Rerun the logstash as above

 

Categories: BI & Warehousing

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

Fri, 2015-04-24 06:18

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!

Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.

Categories: BI & Warehousing