Skip navigation.

BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading

Rittman Mead Consulting - Tue, 2014-06-10 02:49

In this series of posts, I’m going to be looking at an end-to-end ETL process on Oracle Big Data Appliance, using Hadoop technologies to do the data manipulation and Oracle Data Integrator 12c to orchestrate the process. Over the five posts, I’ll be landing web server log data on the Hadoop cluster using Flume, loading that data into a Hive table, transforming and enhancing the dataset, and then loading the final dataset into an Oracle database using one of the Oracle Big Data Connectors. The first post in the list below has a schematic for the overall process, and over the rest of the week I’ll be adding the links in for the remaining posts in the series.

So in today’s step, I want to take the initial Hive table containing the full set of log data, and then do three things; first extract the page details out of the “request” column in the incoming Hive table, then join that to some reference data also in the BDA that’ll allow me to add details of the post author and title of the post, and finally aggregate and project just certain columns from the the dataset so I’ve got a Hive table of just page accesses by IP address, with the author and title details.

In fact this is the easiest out of the five steps to set up and bring together, as we’re just using basic ODI functionality, albeit with Hive tables rather than regular database tables. In the screenshot below you can see the incoming weblog hive table, with the individual columns split-out in the previous step via the RegEx SerDe, and I just join it to the table containing post and author details, apply a distinct on the join output and then load the results into a third Hive table.

NewImage

The join between the two tables is bit complex, because I need to extract just the URL details out of the request field (which also contains the transport method and other metadata), but its no different that joining two regular database tables. Also, I did need to check the Generate ANSI Syntax checkbox, as Hive expects table joins to be in this format rather than the “Oracle” format.

NewImage

Going over to the Physical part of the mapping, I set the KM to IKM Hive Control Append, turn on the feature to enable table truncation prior to load, and I’m ready to go.

NewImage

Then, when I execute the mapping, I can see its progress in the Operator navigator, and the code view shows me the HiveQL commands that the KM has generated to move data around the BDA.

NewImage

So far so good, and as I said, this was the easy bit. For the next transformation I want to bring in some additional reference data from an Oracle database, so the question for me is whether I need to land that data into the BDA before doing the transformation, or whether I can create a join between the Oracle table and my Hive table? Check back tomorrow for the next installment…

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table

Rittman Mead Consulting - Mon, 2014-06-09 07:18

A few months ago I posted an article on the ODI12c examples in the new Oracle Big Data Appliance, and over the past few weeks I’ve been writing about the various components within the Cloudera CDH Hadoop stack, including Hive, Pig, Spark and Flume. Since then I’ve built out a virtualized example of an Oracle Big Data Appliance using the Oracle Big Data 3.0 software set, and I thought it’d be interesting to create an example ETL flow through the system showing how ODI could be used to initiate and control the process. As with any situation where you actually build a demo to do something specific, as opposed to just play around with the technology, you end up uncovering a few quirks and techniques that you wouldn’t have otherwise been aware of, so I’ll spend this week going through the various steps and calling out things others in the same situation might find useful – the steps I’ll go through are below, and I’ll add the links as the articles get published during the week;

As an overview, what I’ll be working with is a six-node Hadoop cluster running Cloudera CDH5, Oracle Big Data Connectors 3.0 and Oracle Data Integrator 12c, as you’d get with the current version of Oracle Big Data Appliance. Obviously BDA goes up to eighteen nodes as a maximum, and under the covers there’s lots more memory and much better networking that I was able to set up on VMWare ESXi, but what I’ve got is enough to prove the process. The diagram below shows the six nodes, and where the software is installed.

NewImage

I took a couple of short-cuts with the setup; obviously each node has a lot less RAM than BDA’s 64GB per node, but the main node in the cluster (bdanode1) running the HDFS NameNode and all the ODI software got bumped-up to 28GB, with the next two allocated 12GB and the others 8GB – enough to work through a demo at least. I also ran ODI Studio on the bdanode1 as well, instead of setting it up on a separate client VM, mainly to avoid having to set up all the Hadoop and Hive libraries on another machine. Other than that though, its the same CDH distribution you get on BDA, the same version of ODI and the Big Data Connectors, and so on, with the following software versions and downloads being used:

Setup of the Hadoop cluster is really out of scope for these articles, except to say that with CDH5, I find it easier to select the (non-default, deprecated) Packages install type rather than the new Parcels type, as this new methods installs all of the Hadoop software on each node in a new place – /opt/cloudera – rather than the usual /usr/lib/hadoop, /usr/lib/hive and so forth, meaning that most configuration examples you’ll read point to the wrong place for your install. Parcels are Cloudera’s way forward for delivering software components (there’s advantages in terms of patching across the cluster) but if you’re getting started with the Hadoop platform, installing in the old location usually makes things easier to follow. Other than that, the two bits of configuration you need to do is firstly to tell ODI Studio where to find that various Hadoop libraries and configuration files; because I’ve installed Studio directly on the main Hadoop node, I could then just add that node’s file locations to the Oracle user’s $HOME/.odi/oracledi/userlib/additional_path.txt file, so that it looked like this:

Also to make use of Oracle Loader for Hadoop, one of the Oracle Big Data Connectors and something we’ll use at the end to bulk-unload data from Hadoop to an Oracle database, the second thing I’ll need to do is set a couple of environment variables in the “oracle” user’s .bashrc profile file pointing to where OLH is installed, and where the Hadoop and Hive libraries and configuration files are:

[oracle@bdanode1 userlib]$ cat $HOME/.bashrc
# .bashrc
 
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
 
# User specific aliases and functions
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2

The scenario I’ll be working with is similar to the ones I covered on the blog recently, where I landed data from the Rittman Mead blog webserver into HDFS files using Hive, and then processed the files using Hive, Pig, Spark and so on. In this example though, I’ll use ODI to do the data manipulation where possible, but still use Hive and so forth under-the-covers to do the work. The diagram below shows the data flow that i’ll be looking to set up in the example:

NewImage

So in this scenario incoming data is being landed in the Hadoop cluster by Flume, using the process outlined in this previous blog post. All Flume is is a transport mechanism; it doesn’t in itself have any processing ability (making it analogous to GoldenGate) and all it does it transport log file entries from one place to another, via Flume agents on either end. All we as developers need to be aware of is (a) where the log files will be landed, and (b) that Flume will keep continuously writing to these files until the source log file gets rotated – therefore I can’t assume a log file is completely written to when I read from it.

What this means in practice is that if I want to do incremental loads, I need to consider the fact that a file I’m reading to might have more data in it later on. There’s various solutions to this – principally having Flume write to HBase, rather than raw HDFS files, and then I read from the HBase database noting the last extraction point at the time – but to keep things simple I’ll just do a full-load each time the ETL run takes place, meaning that I don’t need to think about incremental loads throughout the system.

So the first thing I need to do is have ODI take the incoming files and load them into a Hive table. To do this I set up Topology entries for the incoming HFDS files, and here’s the first “gotcha” – to create a connection to HDFS files, you use the regular File technology, but you leave the JDBC driver type blank, and put the address of the HDFS NameNode in to the JDBC URL – which of course is technically invalid and won’t allow you to either test it, or reverse-engineer the file names in, but is a “hack” used by the IKM File to Hive KM to get the address of your HDFS NameNode (if you choose to source the incoming file from HDFS rather than the local filesystem).

NewImage

Then, when you come to register the physical schemas to go with the new File technology data server, you’ll need to be aware that ODI just appends the final file name to the directory name when retrieving the file data – so if you want the connection to point to a directory, rather than just a file, you’ll need to set up the physical schema to be the directory “above” the one you’re interested in, and then set the file name later on to be that directory. In this example , I want the final file reference to point to hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files/apache_access_combined, a whole directory (HDFS aggregates all the files in a directory if you reference just the directory in an operation) rather than just a single log file. You can see the directory and the log files it contains in the Hue screenshot below:

NewImage

I therefore set the physical schema to be hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files, and the file reference in the datastore model is set to the final directory name, like this:

NewImage

If it seems a bit confusing, it’ll become clearer in a moment.

Now I need to go back to the Topology navigator and create a connection through to the Hive server on the Big Data Appliance VMs – in fact recent versions of CDH (CDH4, CDH5) swap out the old Hive Server for HiveServer2, so you’ll need to use the correct JDBC drivers (as supplied with CDH4/5) to connect to it, and also create the JDBC URL in the format jdbc:hive2://[machine name:port], as shown inn the screenshot below:

NewImage

A quick note about security at this point; by default, most CDH4/5 clusters are set up as what’s called “unsecured”, which means that whilst you use a username and login to connect to Cloudera Manager, for example, by this default although Hive and Impala request user credentials when you connect, they don’t actually check the password against anything, for example your LDAP server. You can connect these tools to LDAP or Active Directory, and typically you’d combine this with Kerebos authentication between the various components and when you connect via Impala and Hive, and typically you’d also use Apache Sentry to provide role-based access control to the data within individual HDFS files. But by default, Hive will accept more or less anything as a password when you connect, but then you may hit issues later on when your HDFS files are owned by a different user to the one you connect as.

Where this manifests itself is when a Hive table has underlying HDFS files owned by, say, the “admin” user in CDH5 (because that’s how you connected to Hue to upload them), but then you connect as the “oracle” user through Hive to then manipulate the Hive table contents. In practice, what I try to do is create any Hive tables (using Hue) as the user I’m then going to connect in Hive to them as, which means you’ll most probably need to go into Hue and create a new user called “oracle” (if that’s what you’re running ODI as, and connecting through Hive as) before creating the Hive tables you’ll then import into the ODI topology.

So once you’ve done all that, you can go into the Designer navigator and reverse-engineer the definition of your Hive tables into datastore models. In my case, I’ve got a bunch of tables that I’ll be using throughout the whole ETL process.

NewImage

Now it’s time for the first ODI mapping, to take the raw log file directory and load it into a Hive table. As it stands though, these raw log files are rows of just a single “column” of data – the log file entry in Apache CombinedLogFormat format. To make them useful to the rest of the ETL process I’ll somehow need to parse them into the individual log file elements, so I create a target Hive table that contains an entry for the raw log entry, and then columns for the various log file elements:

NewImage

The way that I parse the log file is to use a feature within the IKM File to Hive (LOAD DATA) KM that allows me to specify a regular expressed Serde (Serializer-Deserializer) to parse the log file entry into its individual columns, like this (note that you’ll need to ensure the hive-contrib-* JAR file is available to all of your Hadoop nodes before using this SerDe)

NewImage

In this instance, I want the KM to leave the source files in-place when doing the data load (Hive by default moves incoming source files into the /user/hive/warehouse directory area) as these files most probably haven’t been finished written-to by Flume yet, so I leave the EXTERNAL_TABLE value set to true (Hive external table, not Oracle external table) and make sure FILE_IS_LOCAL is set to FALSE, so that this KM knows to use the HDFS file location hack I set up in the topology earlier. Then, I just run the mapping and check that it’s worked OK:

NewImage

and I can check from the Model pane in the Designer navigator that I’ve now got a Hive table of individually split-up log entry columns to work with, for the rest of the downstream ETL process:

NewImage

So that’s the first stage done – next, I’ll be combining this Hive table with data from another one, using the IKM Hive Control Append KM.

Categories: BI & Warehousing

Rittman Mead at the Oracle Virtual Technology Summit 2014 – “The Architecture of Analytics: Big Time Big Data and Business Intelligence”

Rittman Mead Consulting - Mon, 2014-06-09 05:25

I’m pleased to announce that I’ll be presenting at the Oracle Technology Network (OTN) Virtual Technology Summit, an online event running over three days in early July. I’m speaking as part of the Middleware track, entitled “The Architecture of Analytics: Big Time Big Data and Business Intelligence”, along with fellow Oracle ACE Director Kevin McGinley, US BI Forum keynote speaker Richard Tomlinson, and by Tom Plunkett, Lead Author of the Oracle Big Data Handbook. It’s free to attend, it’s running in three timezones over the three days, and here’s the track details and session abstracts:

The Architecture of Analytics: Big Data and Business Intelligence

“More than just another hyped-up technological buzzword,  Big Data represents a dramatic shift not just in the amount of data to be dealt with, but also in how business intelligence is extracted from that data in order to inform critical business decisions. This OTN Virtual Technology Summit track will present a solution architect’s perspective on how business intelligence products in Oracle’s Fusion Middleware family and beyond fit into an effective big data architecture, and present insight and expertise from Oracle ACE Directors specializing in business Intelligence to help you meet your big data business intelligence challenges. Technologies covered in this track include Oracle Big Data Appliance, Oracle Data Integrator, Oracle Business Intelligence Enterprise Edition, Oracle Endeca, Oracle Data Warehouse, and Oracle Big Data Connectors, along with Hadoop, Hive, NoSQL, and MapReduce.”

Session Abstracts

1.Oracle Big Data Appliance Case Study: Using Big Data to Analyze Cancer-Genome Relationships [30:00]
by Tom Plunkett, Lead Author of the Oracle Big Data Handbook

This presentation takes a deep technical dive into the use of the Oracle Big Data Appliance in a project for the National Cancer Institute’s Frederick National Laboratory for Cancer Research.  The Frederick National Laboratory and the Oracle team won several awards for analyzing relationships between genomes and cancer subtypes with big data, including the 2012 Government Big Data Solutions Award, the 2013 Excellence.Gov Finalist for Innovation, and the 2013 ComputerWorld Honors Laureate for Innovation.

2.Getting Value from Big Data Variety [30:00]
by Richard Tomlinson, Director, Product Management, Oracle

Big data variety implies big data complexity. Performing analytics on diverse data typically involves mashing up structured, semi-structured and unstructured content. So how can we do this effectively to get real value? How do we relate diverse content so we can start to analyze it? This session looks at how we approach this tricky problem using Endeca Information Discovery.

3.How To Leverage Your Investment In Oracle Business Intelligence Enterprise Edition Within A Big Data Architecture [30:00]
By Oracle ACE Director Kevin McGinley

More and more organizations are realizing the value Big Data technologies contribute to the return on investment in Analytics.  But as an increasing variety of data types reside in different data stores, organizations are finding that a unified Analytics layer can help bridge the divide in modern data architectures.  This session will examine how you can enable Oracle Business Intelligence Enterprise Edition (OBIEE) to play a role in a unified Analytics layer and the benefits and use cases for doing so.

4.Oracle Data Integrator 12c As Your Big Data Data Integration Hub [90:00]
by Oracle ACE Director Mark Rittman

Oracle Data Integrator 12c (ODI12c), as well as being able to integrate and transform data from application and database data sources, also has the ability to load, transform and orchestrate data loads to and from Big Data sources. In this session, we’ll look at ODI12c’s ability to load data from Hadoop, Hive, NoSQL and file sources, transform that data using Hive and MapReduce processing across the Hadoop cluster, and then bulk-load that data into an Oracle Data Warehouse using Oracle Big Data Connectors.

We will also look at how ODI12c enables ETL-offloading to a Hadoop cluster, with some tips and techniques on real-time capture into a Hadoop data reservoir and techniques and limitations when performing ETL on big data sources.

Registration is free, and the form for all three events is online here. Look out for a preview of my big data ETL on BDA session over the next five days, and hopefully you’ll be able to make the VTS event and hear me talk about it in-person.

Categories: BI & Warehousing

Goodbye to Stewart…

Rittman Mead Consulting - Mon, 2014-06-09 05:00

We’re sad to announce that Stewart Bryson’s last day with Rittman Mead was last Friday, and he’s moving-on now to a new challenge. Five years ago Stewart co-founded Rittman Mead America with us, and most recently he’s been our Chief Innovation Officer, handing over the day-to-day running of our US business to Charles Elliott, our US Practice Manager.

One of the main contributions Stewart has made over the last few years is helping us develop the Rittman Mead Delivery Framework and the Rittman Mead Way. The Delivery Framework is a set of software, accelerators and processes that determine how Rittman Mead consultants go about projects (the Rittman Mead Way). Stewart has contributed greatly to our work around Extreme BI, version control and release management. These processes are now embedded in our delivery management team and framework, and as such will live on through the DNA of Rittman Mead, but in the meantime we wish him well going into the future.

Categories: BI & Warehousing

Data Integration Tips: GoldenGate on RAC – Action Script for Windows

Rittman Mead Consulting - Fri, 2014-06-06 14:29

I want to briefly interrupt my blog series on GoldenGate and ODI – A Perfect Match in 12c… to provide another Data Integration Tip that was found during a recent client engagement. I was tasked with installing GoldenGate 11g into a 2-node RAC, which is a pretty straight-forward process and well documented by the Oracle Data Integration product team. The challenge is that the client’s Oracle RAC environment was installed on Windows Server 2003.

The Scenario

Working through the GoldenGate setup and configuration process for RAC / Clusterware, most of the same approach applies to either Unix/Linux or Windows (maybe a future blog post?). There is also an example action script towards the bottom, but it’s written in shell script..and that just won’t work in Windows! For those unfamiliar with Clusterware, when an application is added as a resource, Clusterware will use the action script to perform specific functions, or actions, against the application. These actions, Check, Start, Stop, etc., will often times run commands through GoldenGate’s GGSCI application, ensuring the Manager process is running, starting the Manager, etc.

We decided to convert the shell script example to batch command script, as this is about as native as it gets in Windows. Everything was going well with the conversion until we reached this function and the highlighted code below.

#call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
  log "entering call_ggsci"
  ggsci_command=$1
  cd ${GGS_HOME}

  ggsci_output=`${GGS_HOME}/ggsci << EOF
  ${ggsci_command}
  exit
  EOF`

  log "got output of : $ggsci_output"
}

The script will simply change directory to the GoldenGate home, run the GGSCI application, execute the command (passed in as an argument), and then exit GGSCI. This is all wrapped within two EOF “tags”, indicating that the formatting in the script, such as hard returns, etc, should remain when the external command is run. It’s known as a heredoc, and we found that it’s not really possible to do in a batch script.

When attempting similar code in the batch script, without the heredoc, we ran into an issue. GGSCI attempted to run all of the commands on the same line, which failed miserably. We needed another approach to ensure the script would execute properly.

Here’s the Tip…

The goal was to find a way to run a GGSCI command on a single line. The solution: use GGSCI to execute GoldenGate obey files. Rather than attempt to place hard returns in the batch command script, we simply placed all of the necessary commands in an obey file and passed the obey command and script file location it into the function.

call :call_ggsci %GGS_HOME%\diroby\StartMgr.oby
 ...

:call_ggsci

 REM set ggsci_command variable with the passed-in value
 SET ggsci_command=%~1
 call :log "ggsci_command: %ggsci_command%"

 REM log whether or not the GGS_HOME path exists
 if exist %GGS_HOME% (
  call :log "%GGS_HOME% exists"
 ) else (
  call :log "%GGS_HOME% does not exist"
 )
 ...

The obey file could be as complex as you like, but in this case it simply starts the manager process.

start-mgr-obey

This approach works very well, but we did end up with the additional obey files to maintain. Next time around, I’ll probably use a scripting language such as Groovy or Python, as either should work just fine on Linux or Windows.

Look for more Data Integration Tips on the blog, as Rittman Mead always coming up with innovative solutions to interesting challenges!

Categories: BI & Warehousing

Rittman Mead Announce New Partnerships with Cloudera and Amazon Web Services

Rittman Mead Consulting - Thu, 2014-06-05 06:00

To complement our strategic partnership with Oracle Corporation, Rittman Mead are very pleased to announce two new technology partnerships to complement our Oracle-based services; with Cloudera, the industry-leading Hadoop software vendor, and with Amazon Web Services, the infrastructure-as-a-service provider that’s become the standard for deploying highly-available, secure cloud applications. As business intelligence, analytics and data warehousing systems extend onto the cloud and to encompass big data, these two new partnerships will allow us to offer best-of-breed services in these area whilst complementing our main technology partnership with Oracle. So why Cloudera, why Amazon Web Services, and why now?

Extending BI from the EDW to the “Enterprise Data Hub”

If you’re a regular reader of this blog you’ll probably have read several articles by myself and other authors about cloud, Hadoop and big data. Whilst the vast majority of BI & DW systems we put together for clients today are based around regular database and application data sources, increasingly customers are asking us to help them bring non-traditional, NoSQL and Hadoop-based data sources into their BI platform, adding what’s becoming termed “enterprise data hubs” alongside the more traditional enterprise data warehouses. We’re also seeing more interest in deploying these systems into the cloud – either driven by the economics of the cloud vs. on-premise systems, or because other key line-of-business systems are moving into the cloud and it makes sense to deploy your BI there too. And Oracle, of course, have noticed this trend, with support being added to OBIEE and ODI for reporting against, and loading data out of/into Hadoop data sources, and Oracle themselves due to launch Oracle Business Intelligence Cloud Services in the next couple of months.

NewImage

Oracle of course already have some Hadoop-based products out in the market, including the very-impressive Oracle Big Data Appliance and their Oracle Big Data Connectors, and we think these products have the potential to make a massive impact on the types of project and solutions we can offer customers today. But at Rittman Mead we like to get ahead of the technology curve, and so we decided to in addition partner directly with Cloudera, makers of the Hadoop software within Big Data Appliance but a significant software vendor in themselves, to ensure our team is at the forefront of delivering cutting-edge big data solutions based around Cloudera CDH, Apache Spark and Cloudera Impala.

Over the past few months we’ve started to deliver projects and PoCs for customers around Cloudera Hadoop technology and Oracle Engineered Systems, Database and BI tools, and what’s emerged as our core competence is the ability to bring together Oracle’s chosen Hadoop distribution and tools with the core Oracle technologies our customers use. We’ve developed systems using Oracle’s Big Data Connectors, to for example analyse huge datasets across multiple Hadoop nodes using Oracle R Advanced Analytics for Hadoop, and we’re currently working with a couple of customers’ Oracle DBA teams who are in the process of adopting Oracle Big Data Appliance. Most importantly, as official Cloudera Partners we’ve got access to their technical and solutions architecture resources, giving us a similar same level of technical backup as we have access to on our core Oracle projects.

BI Moves into the Cloud, for Improved Business Agility and TCO

The other big trend that’s driving a lot of innovation in the BI industry is “cloud”. If you were at one of our BI Forum events in Brighton and Atlanta this year, you’ll know that cloud is front-and-centre in Oracle’s product strategy at the moment, both in terms of making BI available as part of the wider Oracle Public Cloud, but also as a way of accelerating innovation and making more of the product self-service. What’s now been officially named “Oracle BI Cloud Service” (BICS) was officially launched at last year’s Openworld and we’re about to help beta-test the product for Oracle prior to its launch later in the year, and we’re expecting BICS to be particularly attractive to existing OBIEE customers looking to quickly spin-up departmental BI environments without the need for IT to get involved.

But as I covered in a couple of blog posts earlier in the year, BICS functionality is likely to be limited in its initial incarnation and many customers are going to be looking to run “full” OBIEE in the cloud, along with a full Oracle database and an ETL infrastructure using tools such as ODI or Informatica, and for these types of customer a more complete cloud solution will be needed – which is why we’ve also partnered with Amazon Web Services, in our view by far the best cloud service provider on the market and the platform behind companies such as Netflix and Dropbox.

Logo amazon aws

We’ve been long-term users of Amazon AWS since around three or four years ago, initially running our training servers on the platform but more recently, undertaking internal and customer development work on their “virtual private cloud” platform as part of our new “Extreme BI in the Cloud” initiative. As techies, we appreciate the power and flexibility of the Amazon AWS API which we use to script and automate much of our “DevOps” work, and we’re finding more and more of our customers are moving to AWS independently, due to their excellent uptime and levels of customer service, and the wide ecosystem of AWS-native and partner products such as Amazon Redshift, Amazon Elastic MapReduce, Amazon EC2 and S3, and Attunity Cloudbeam. Again, we’ve partnered officially with Amazon AWS so that we can train our team-up and access technical and solutions architecture resources, and as with the Cloudera partnership, our particular specialisation is in deploying Oracle-based technologies onto the Amazon AWS platform.

So does this mean that Rittman Mead are de-emphasising our Oracle Partnership or otherwise taking our attention away from solutions built-on OBIEE, ODI, Essbase and the Oracle Database? Far from it; the core of the company will always be around Oracle technology, but by partnering with two best-of-breed complementary organisations – Cloudera, for Oracle-centric Hadoop and big data solutions, and Amazon AWS, for cloud and hybrid deployments – we can continue to offer customers the most innovative and industry-leading solutions as BI evolves from its database and on-premise roots to cover big data and cloud deployments.

Look out for our our blog over the coming months as we cover integrating Oracle technology with Cloudera’s Hadoop platform and Oracle Big Data Appliance, and look out for products and services where we combine Oracle and Cloudera technology in innovative ways, creating cutting edge solutions for our customers both on-premise, in the Amazon and Oracle clouds, or as hybrid on-premise/cloud deployments.

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 2: Journalizing Knowledge Module

Rittman Mead Consulting - Wed, 2014-06-04 16:35

This is the second post in a blog series on the integration between Oracle Data Integrator (ODI) 12c and GoldenGate 12c. The first post focused on the latest, yet-to-be-released, Oracle Information Management Reference Architecture and some high-level features new to the 12c versions of each product. We also began working through the details of the GoldenGate installation and configuration, specifically the JAgent setup, which is necessary for communication between ODI and GoldenGate during the setup of “online” journalizing. In this post, we’ll look at the new features of the Journalizing Knowledge Module “JKM Oracle to Oracle Consistent (OGG Online)” and get started on the modifications to the JKM that must occur for us to load the Raw Data Reservoir (RDR – acronym coined by Nick Hurt in the comments of my first post…I like it!) and Foundation Layer simultaneously.

ODI Journalizing

Before I get into editing the JKM, let me briefly go through an introduction of ODI Journalizing, aka Change Data Capture (CDC). In case you missed it earlier, ODI CDC is implemented using a Journalized Knowledge Module. The JKM generates the infrastructure for the CDC, creating journal tables that store the change rows and views that provide access to the change rows for use in ODI Mappings. ODI CDC can be implemented using various capture processes, such as triggers on source tables, timestamps on rows, or mining of the database logs via Oracle Streams or, in our case, Oracle GoldenGate. A great explanation of ODI JKMs and how they work, written by Christophe Dupupet of the Oracle Data Integration A-Team, can be found here.

Import ODI JKMs

The integration between GoldenGate and ODI begins with a Journalized Knowledge Module, in our example we’ll be using the “JKM Oracle to Oracle Consistent (OGG Online)”. The JKM, delivered with ODI, will create the GoldenGate parameter files, configure GoldenGate process groups (and start them up), and generate the ODI CDC tables and views. The JKM, added to the source Model in 12c, uses the ODI metadata to generate the GoldenGate parameter file mappings. This alone saves quite a bit of manual work and reduces possible typos caused by human error.

JKM Oracle to Oracle Consistent (OGG Online)

In the previous post, I mentioned the new capability of the JKM that allows for an “online” integration between ODI and GoldenGate. But, there are many other new features that need to be described, so we’ll walk through those here.

ODI Tool: OdiOggCommand

The JKM uses an undocumented ODI Tool called OdiOggCommand in the target command of some tasks that are executed only when in “online” mode. This tool has different values for a parameter called OPERATION.

EXECUTECMD: executes various commands, such as Add Extract, within GGSCI
EXECUTEOBEY: runs the OBEY command against an obey file in GGSCI
DEFGEN: generates the source definitions file by executing the DEFGEN command
SAVEFILE: uploads the parameter and obey files to the GoldenGate installation directory

OdiOggCommand

I imagine the code behind this ODI Tool is simply executing command line calls to GGSCI, DEFGEN, etc. It would be great to see some Oracle documentation on this one!

GoldenGate Topology

The “online” aspect of the JKM requires that a Data Server, Physical Schema and Logical Schema all be setup under the GoldenGate technology in ODI. The Data Server contains the location and connection information for communicating with the JAgent on either the source or target GoldenGate server. Under that, a Physical Schema must be setup for each extract (the pump is included) and replicat process group that will be implemented on that server.

GoldenGate Physical Schema

In this screenshot, we have a Physical Schema with the process type set to Capture, also called the extract. Here we set up the directory path on the source to the location where captured transactions from the source database logs will be stored in GoldenGate’s own log files, called trail files. The remote trail file directory, setup on the target server and accessed by the pump process in order to move transactions from the source trail to the target trail, must also be added. Additional options such as trail file size (how large a trail file can get before rolling over to the next file) are also available to be set. Just below the Capture Process Properties are the Additional Options, parameters that can be added to the extract or replicat parameter files based on specific needs of the solution.

GoldenGate Physical Schema Additional Options

Here we can add options to handle different aspects of the extract or replicat, such as TRANLOGOPTIONS on the extract. Once I add this option, I get a template of the additional parameters I can configure. For example, I may want to access my transaction logs stored in ASM via the database rather than directly from ASM. The DBLOGREADER option lets me utilize the ASM API in the database, simplifying my access to the logs.

TRANLOGOPTIONS DBLOGREADER, DBLOGREADERBUFSIZE 2597152

When I add the additional option to the capture physical schema, it will be generated as a part of the extract parameter file. This helps to drastically reduce the amount of manual editing that had to occur after the parameter files were generated in the 11g version, and is a great addition to the JKM.

JKM Setup and Options

The JKM is actually applied to the source ODI Model, which is a grouping of logical table structures called Datastores. On the Journalizing tab in the Model, we first set the type (Consistent Set) and choose the appropriate JKM.

GoldenGate Process Selection

In the GoldenGate Process Selection section, we choose the Capture and Delivery Logical Schemas that were setup in the ODI Topology. If none have been created, but the Data Server for the source and target GoldenGate installations exist, we can choose to create the Physical and Logical Schemas directly from the Model by clicking the Create button. This is a nifty way to separate the system administration role (setting up the Data Server) from the metadata management or developer role.

The JKM also has a set of Options with configurable values, some being required and others optional. The number of Options has been reduced by quite a bit in the 12c JKM. This makes sense, as we’ve seen that additional metadata such as the trail file location and trail file size are set elsewhere. We’ll go through these in more detail later on when setting the values in our example.

JKM Options

That covers the updated features in the JKM, now let’s talk Knowledge Module customization.

JKM Customization

In the first post, I described how we want to extract the source data once and replicate it into both the Foundation Layer and Raw Data Reservoir in parallel. The JKM is set up to load the ODI CDC Framework (the fully replicated table and J$ table) out of the box, but not the Foundation table. In case you’re unfamiliar with the purpose of the Foundation layer, the idea is to store all transactional history from the source tables by converting every transaction into an insert, and tracking the type of change (insert / update / delete), commit date, and commit SCN. With this information stored from the beginning of the data warehouse, it can be used for historical drill-throughs from the dimensional model, or to completely reload a star schema – including all history. With the stage set, let’s look at what we’re going to change in the JKM.

1. Add New Option “APPLY_FOUNDATION”
This option, when true, will allow the Start Journal process to generate the source-to-foundation mapping statement in the Replicat (apply) process.

2. Add New Option “FND_LSCHEMA”
The Logical Schema name for the Foundation layer schema.

3. Add New Task “Create apply prm (4) RM”
This task will create the source-to-foundation mapping code and add it to the replicat parameter file.

4. Set Option on “Create apply prm (4) RM” Task
Set the execution Options to have APPLY_FOUNDATION as the only checked option. This will determine whether or not that task will execute when Start Journal is run.

5. Edit Task “Execute apply commands online RM”
We’ll comment out the “start replicat …” command, as we’ll first need to complete an initial load of the source data to the target.

Now, I did say I would get into the editing of the JKM in this post, but after describing the new features of the Knowledge Module, I’ll save the actual details for the part 3 of the series! Up next, editing the JKM and building the necessary ODI objects so we can start journalizing.

Categories: BI & Warehousing

Fun with SQL - Silver Pockets Full

Chet Justice - Wed, 2014-06-04 16:13
Silver Pockets Full, send this message to your friends and in four days the money will surprise you. If you don't, well, a pox on your house. Or something like that. I didn't know what it was, I just saw this in my FB feed:



Back in November, I checked to see the frequency of having incremental numbers in the date, like 11/12/13 (my birthday) and 12/13/14 (kate's birthday). I don't want to hear how the rest of the world does their dates either, I know (I now write my dates like YYYY/MM/DD on everything, just so you know, that way I can sort it...or something).

Anyway, SQL to test out the claim of once every 823 years. Yay SQL.

OK, I'm not going to go into the steps necessary because I'm lazy (and I'm just lucky to be writing here), so here it is:
select *
from
(
select
to_char( d, 'yyyymm' ) year_month,
count( case
when to_char( d, 'fmDay' ) = 'Saturday' then 1
else null
end ) sats,
count( case
when to_char( d, 'fmDay' ) = 'Sunday' then 1
else null
end ) suns,
count( case
when to_char( d, 'fmDay' ) = 'Friday' then 1
else null
end ) fris
from
(
select to_date( 20131231, 'yyyymmdd' ) + rownum d
from dual
connect by level <= 50000
)
group by
to_char( d, 'yyyymm' )
)
where fris = 5
and sats = 5
and suns = 5
So over the next 50,000 days, this happens 138 times. I'm fairly certain that doesn't rise to the once every 823 years claim. But it's cool, maybe.
YEAR_MONTH       SATS       SUNS       FRIS
---------- ---------- ---------- ----------
201408 5 5 5
201505 5 5 5
201601 5 5 5
201607 5 5 5
201712 5 5 5
128 more occurrences...
214607 5 5 5
214712 5 5 5
214803 5 5 5
214908 5 5 5
215005 5 5 5

138 rows selected
I'm not the only dork that does this either, here's one in perl. I'm sure there are others, but again, I'm lazy.
Categories: BI & Warehousing

How We Deliver Agile OBIEE Projects – Introducing ExtremeBI

Rittman Mead Consulting - Wed, 2014-06-04 04:56

Most OBIEE projects that we see are delivered through some sort of “waterfall” method, where requirements are defined up-front, there’s several stages of development, one or more major releases at the end, and any revision to requirements takes the form of a change request. These work well where requirements can be defined upfront, and can be reassuring to customers when they want to agree a fixed-price up-front with every subsequent change clearly costed. But, as with the development world in general, some customers are starting to look at “agile” methods for delivering BI projects, where requirements emerge over the course of a project, there isn’t so much of a fixed design or specification at the start, but instead the project adds features or capabilities in response to what are called “user stories”, making it more likely in-the-end that what ends-up getting delivered is more in-line with what users want – and where changes and additions to requirements are welcomed, rather than extra-cost change requests.

OBIEE naturally lends itself to working in an agile manner, through the three-layer nature of the repository (RPD); by separating the physical representation of the source data from how it is then presented to the end-users, you can start from the off with the dimensional model that’s your end goal, and then over time evolve the back-end physical layer from pointing directly at the source system to instead point at a data warehouse or OLAP cube. In fact, I covered this approach back in 2008 in a blog post called “A Future Oracle OBIEE Architecture” where I positioned OBIEE’s BI Server as a “business logic layer”, and speculated that at some point in the future, OBIEE might be able to turn the logical > physical mappings in the RPD into actual ODI mappings and transformation.

NewImage

In the end, although OBIEE’s aggregate persistence feature gave us the ability to spin-off aggregate tables and cubes from the RPD logical model, full ETL “push-down” never came although you can see traces of it if you have a good poke around the DLLs and directories under the BI Server component. What did happen though was Exadata; with Exadata, features such as SmartScan, and its ability to do joins between normalised tables much faster than regular databases meant that it became possible to report directly against an OLTP schema, or a ODS-like foundation layer, only adding ETL to build a performance star schema layer if it was absolutely necessary. We covered this in a series of posts on Agile Data Warehousing with Exadata, and the focus of this method was performance – by adding Exadata, and the metadata flexibility in OBIEE’s RPD, we could deliver agile projects where Exadata gave us the performance even when we reported directly against a third-normal form data source.

NewImage

 

 

And this approach worked well for our customers; if they’d invested in Exadata, and were open to the idea of agile, iterative development, we could typically deliver a working system in just a few months, and at all times what the users got was what they’d requested in their user story backlog. But there were still ways in which we could improve this method; not everyone has access to Exadata, for example, and reporting directly against a source system makes it tricky to add DW features like history, and surrogate keys, so recently we introduced the successor to this approach, in the form of an OBIEE development method we called “ExtremeBI”. Building our previous agile work, ExtremeBI introduced an integration element, using GoldenGate and ODI to replicate in real time any source systems we were interested in to the DW foundation layer, add the table metadata that DW systems expect, and then provide a means to transform the logical to physical RPD mappings into ODI ETL specifications.

NewImage

But in a way, all the technical stuff is by-the-by; what this means in practice for customers is that we deliver working systems from the first iteration; initially, by reporting directly against a replicated copy of their source system (with replication and metadata enhancement by GoldenGate, and optionally ODI),and then over subsequent iterations adding more end-user functionality, OR hardened ODI ETL code, all the while driven by end-user stories and not some technical design signed-off months ago and which no longer reflects what users actually want.

NewImage

What we’ve found though from several ExtremeBI customer engagements, is that it’s not just down to the technology and how well ODI, OBIEE and GoldenGate work; the major factors in successful projects are firstly, having the project properly pre-qualified at the start; not every project, and not every client, suits agile working, and agile works best if you’re “all in” as opposed to just agreeing to work in sprints but still having a set-in-stone set of requirements which have to be met at a certain time. The second important success factor is proper project organisation; we’ve grown from just a couple of guys with laptops back in 2007 to a fully-fledged, end-to-end development organisation, with full-time delivery managers,a managed services desk and tools such as JIRA, and you need to have this sort of thing in place, particularly a project management structure that’s agile-friendly and a good relationship with the customer where they’re fully-signed up to the agile approach. As such, we’ve found the most success where we’ve used ExtremeBI for fairly technically-savvy customers, for example a MIS department, who’ve been tasked with delivering something for reasonable price and over a short amount of months, who understand that not all requirements can be delivered, but really want their system to get adopted, delight their customer and focus its features on what’s important to end-users.

As well as processes and a method, we’ve also developed utilities and accelerators to help speed-up the initial setup, and ensure the initial foundation and staging layers are built consistently, with GoldenGate mappings already put in place, and ready for our developers to start delivering reports against the foundation layer, or use these foundation-layer tables as the basis of a data mart or warehouse build-out. The screenshot below shows this particular tool, built using Groovy and run from within the ODI Studio user interface, where the developer selects a set of source tables from an ODI model, and then the utility builds out the staging and foundation layers automatically, typically saving days over the manual method.

NewImage

We’ve also built custom KMs for ExtremeBI, including one that uses Oracle Database’s flashback query feature to pull historical transactions from the UNDO log, as an alternative to Oracle Streams or Oracle GoldenGate when these aren’t available on the project.

All together, using Rittman Mead’s ExtremeBI method along with OBIEE, ODI and optionally GoldenGate has meant we’ve been able to deliver working OBIEE systems for customers over just a few months, typically for a budget less than £50k. Coupled with cloud hosting, where we can get the customer up-and-running immediately rather than having to wait for their IT department to provision servers, we think this the best way for most OBIEE11g projects to be delivered in the future. If you’re interested, we’ve got more details on our “ExtremeBI in the Cloud” web page, or you can contact me via email – mark.rittman@rittmanmead.com – if you’d like to discuss it more,

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 1: Getting Started

Rittman Mead Consulting - Fri, 2014-05-30 14:13

Over the years, I’ve blogged quite a bit about integration between Oracle Data Integrator and GoldenGate, and how to make it all work with the Oracle Reference Architecture. With the release of the 12c versions of ODI and GoldenGate last October, and a soon-to-be-updated reference architecture, it’s time to write a few posts on the subject again.

Getting Started with 12c

First, let me describe the new Journalizing Knowledge Module (JKM) that has been introduced in ODI 12c for integration with GoldenGate: JKM Oracle to Oracle Consistent (OGG Online). This JKM now allows GoldenGate to be setup in an “online” mode, meaning the GoldenGate parameter files and process groups will be configured and installed on the source and target GoldenGate servers. ODI communicates with the GoldenGate JAgent to perform the installation. The “offline” mode still exists as it did in the 11g version of the JKM, in which the parameter files, etc. are created in a temporary location, and then manually moved to the source and target. I’ll use the “online” JKM throughout this series of posts.

Another change to how Journalizing is implemented for GoldenGate in ODI is the Model to which the JKM is applied. In ODI 11g, the GoldenGate JKM was always applied to the Model containing the target tables, leaving the source table metadata completely out of the picture. This made sense, as GoldenGate handled everything on the source side. Now in ODI 12c, the source tables are reverse engineered and the JKM applied to the source Model. This allows the source table to be used in a single mapping for both the initial load and incremental load of the performance layer through the use of ODI 12c deployment specifications. The target, or fully replicated table, is no longer necessary in the metadata. We’ll talk through this concept in more detail later on.

Reference Architecture Update

Finally, before we get into the details, let’s go over the latest (yet to be released, I might add) version of the Oracle Information Management Reference Architecture. It was first presented by Stewart Bryson and Andrew Bond (Oracle) at the Rittman Mead BI Forum in Brighton (which is why I was given the OK to mention it pre-release!).

Information Management - Logical View

This latest reference architecture is not much different than previous versions. The main difference that pertains to this blog post is that the Staging Layer has been renamed the Raw Data Reservoir. If you look through the presentation by Stewart and Andrew, you’ll see that the many of the principles remain the same. They also describe more about an Agile approach to implementing the reference architecture, using GoldenGate, OBIEE against transactional schemas, and eventually ETL development using ODI, a methodology we here at Rittman Mead use with our clients, and call ExtremeBI. Look for the official release of the latest Information Management Reference Architecture in the next couple of months.

In this blog series, we’ll look at how to load the Raw Data Reservoir and Foundation Layer using GoldenGate, and subsequently load the Access and Performance Layer with Oracle Data Integrator Mappings. If you recall from my 11g posts on the subject, we don’t need to load these layers in sequence. GoldenGate will allow the extract of source data once and replication to multiple targets in parallel.

load-raw-data-fnd

Now that we’ve gone through some of the updated concepts for 12c and the Reference Architecture, let’s look at the high-level steps that must be taken in order to implement GoldenGate and ODI integration.

  • Install GoldenGate on the source and target servers – including JAgent configuration
  • Edit the “JKM Oracle to Oracle Consistent OGG (Online)” Knowledge Module (enable Foundation Layer load)
  • Setup ODI Topology (database schema and GoldenGate connections)
  • Setup and start Journalizing on the source Model
  • Develop Mappings for initial load and incremental load
  • Perform initial load and start replication

There is quite a lot of detail to add to these steps, so let’s get right into it.

GoldenGate 12c Installation and JAgent Configuration

The install of GoldenGate 12c is pretty straight-forward, so I don’t plan on going into much detail here. A step-by-step guide can be found on the DBASolved blog, with the installation setting up and starting the manager process and creating the necessary subdirectories. We then need to configure the JAgent on both the source and target GoldenGate installations, enabling ODI to communicate with GoldenGate during the “online” JKM start journalizing process, which will automatically configure and start the GoldenGate process groups. Setting up the JAgent for ODI integration is essentially the same as if you were setting up Oracle Enterprise Manager integration with GoldenGate.

First, you’ll notice that the file jagent.prm exists in the dirprm directory after installation completes. This parameter file will be used by the jagent process once started in GGSCI.

JAgent.prm

Next, we need to enable monitoring of GoldenGate by adding an entry, ENABLEMONITORING, to the GLOBALS file. Create the GLOBALS file (with no extension) in the GoldenGate home directory and open it in your favorite text editor. Simply add the line to enable monitoring, close and save the file.

Edit GLOBALS file

To allow secure communication between ODI and GoldenGate, we must create an Oracle Wallet with a password for JAgent. From the GoldenGate install directory, run the password agent. This will create the cwallet.sso file in the dirwlt subdirectory.

Create Oracle wallet

We’re almost there! Now we need to make a slight change to the Config.properties file in the cfg directory under the GoldenGate home. Edit this file and make the following changes:

Set the agent type to Oracle Enterprise Manager. If left as the default OGGMON, the JAgent will attempt to register with the Monitor server, which most likely is not installed.

agent.type.enabled=OEM

Under the JMX Username, add the line to signify that SSL is not being used by JAgent (unless SSL is actually being used!).

monitor.jmx.username=cmroot
jagent.ssl=false

Finally, ensure the JAgent port is unique across all installations, and on the server in general. In my example, I’m using a single Virtual Machine to host both the source and target GoldenGate installations, so I need to be careful about which ports are in use.

jagent.rmi.port=5571

Before starting the JAgent, go ahead and stop, then start the Manager process to ensure all changes have been initialized. Then, start the JAgent and check to ensure both Manager and JAgent are running. That completes the GoldenGate installation and JAgent configuration.

Start JAgent

If you want to just skip all of this installation work and get right to it, you can always download the Prebuilt Machine for Oracle Data Integrator 12c. It’s a VirtualBox VM with ODI 12c and GoldenGate 12c already installed and configured to work with the Getting Started Guide. The JAgent is already configured on the source and target GoldenGate installations, making it easy to get up and running. This is a great resource that the Oracle Data Integration product team has provided, and it sounds like they plan to continue adding to it in the future.

In Part 2 of the blog post series, we’ll edit the JKM to enable parallel load of the Raw Data Reservoir and Foundation Layer, as well as begin setup of the ODI Topology and metadata.

Categories: BI & Warehousing

Visual Regression Testing of OBIEE with PhantomCSS

Rittman Mead Consulting - Fri, 2014-05-23 09:20

Earlier this year I wrote a couple of blogs posts (here and here) discussing the topic of automated Regression Testing and OBIEE. One of the points that I was keen make was that OBIEE is a stack of elements and depending on the change being tested, it may be sensible to focus on certain elements in the stack instead of all of it. For example, if you are changing the RPD, there is little value in doing a web-based test when you can actually test for the vast majority of regressions using the nqcmd tool alone.

I also argued that testing the front end of OBIEE using tools such as Selenium is difficult to do comprehensively, it can be inflexible, time-consuming and in some cases just not a sensible use of effort. These tools work around the idea of parsing the web page that is served up and checking for presence (or absence) of a particular piece of text or an element on a web page. So for example, you could run a test and tell it to fail if it finds the text “Error” on the page, or you could say only pass the test if some known-content is present, such as a report title or data figure. This type of testing is prone to a great deal of false-negatives, because to efficiently build any kind of test case you must focus on something to check for in the page, but you cannot code for every possible error or failure. It is also usually based heavily on the internal IDs of elements on the page in locating the ‘something’ to check for. As the OBIEE Document Object Model (DOM) is undocumented code, Oracle are at presumably at liberty to change it whenever they feel like it, and thus any tests written based on it may fail. Finally, OBIEE 11g still defaults to serving up graphs as Flash objects, which Selenium et al just cannot handle, and so cannot be tested.

So, what do we do about regression testing the OBIEE front end?

What do we need to test in the front end?

There is still a strong case for regression testing the OBIEE front end. Analyses get changed, Dashboards break, permissions are updated – all these things can cause errors or problems for the end user, but which are something that testing further down the OBIEE stack (using something like nqcmd) will not cover.

Consider a simple dashboard:

If one of the dashboard pages that are linked to in the central section get moved in the Presentation Catalog, then this happens:

OK, so Invalid Link Path: is pretty easy to code in as an error check into Selenium. But, what about if the permissions on an analysis used in the dashboard get changed and the user can no longer access it when running the dashboard?

This is a different problem altogether. We need to check for the absence of something. There’s no error, there just isn’t the analysis that ought to be present. One way around this would be to code for the presence of the analysis title text or content – but that is not going to scale nor be maintainable to do for every dashboard being tested.

Another thing that is important to check in the front end is that authorisations are enforced as they should be. That is, a user can see the dashboards that they should be able to, and that they cannot see the ones they’re not. Changes made in the LDAP directory holding users and their groups, or a configuration change in the Application Roles, could easily mean that a user can no longer see the dashboards they should be able to. We could code for this specific issue using something like Web Services to programatically check each and every actual permission – but that could well be overkill.

What I would like to introduce here is the idea of testing OBIEE for regressions visually - but automated, of course.

Visual Regression Testing

Driven by the huge number of applications that are accessed solely on the web (sorry, “Cloud”), a new set of tools have been developed to support the idea of testing web pages for regressions visually. Instead of ‘explaining’ to the computer specifically what to look for in a page (no error text, etc), visual regression testing uses a process to compare images of a web page, comparing a baseline to a sample taken afterwards. This means that the number of false-negatives (missing genuine errors because the test didn’t detect them) drops drastically because instead of relying on coding a test program to parse the Document Object Model (DOM) of an OBIEE web page (which is extremely complex), instead it is simply considering if two snapshots of the resulting rendered page look the same.

The second real advantage of this method is that typically the tools (including the one I have been working with and will demonstrate below, PhantomCSS) are based on the actual engine that drives the web browsers in use by real end-users. So it’s not a case of parsing the HTML and CSS that the web server sends us and trying to determine if there’s a problem or not – it is actually rendering it the same as Chrome etc and taking a snapshot of it. PhantomCSS uses PhantomJS, which uses the engine that Safari is built on, WebKit.

Let’s Pretend…

So, we’ve got a tool – that I’ll demonstrate shortly – that can programatically fetch and snapshot OBIEE pages, and compare the snapshots to check for any changes. But what about graphs rendered in flash? These are a blindspot usually. Well here we can be a bit cheeky. If you pretend (in the User-Agent HTTP request header) to be an iPhone or iPad (devices that don’t support flash) then OBIEE obligingly serves up PNG graphs plus some javascript to do the hover tooltips. Because it’s a PNG image that means that it will be rendered correctly in our “browser”, and so included in the snapshot for comparison.

CasperJS

Let’s see this scripting in action. Some clarification of the programs we’re going to use first:

  • PhantomJS is the core functionality we’re using, a headless browser sporting Javascript (JS) APIs
  • CasperJS provides a set of APIs on top of PhantomJS that make working with web page forms, navigation etc much easier
  • PhantomCSS provides the regression testing bit, taking snapshots and running code to compare them and report differences.

We’ll consider a simple CasperJS example first, and come on to PhantomCSS after. Because PhantomCSS uses CasperJS for its core interactions, it makes sense to start with the basics.

Here is a bare-bones script. It loads the login page for OBIEE, echoes the page title to the console, takes a snapshot, and exits:

var casper = require('casper').create();

casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

casper.run();

I run it from the command line:

$ casperjs casper_example_01.js
Oracle Business Intelligence Sign In
$

As you can see, it outputs the title of the page, and then in the screenshots folder I have this:

I want to emphasise again to make clear why this is so useful: I ran this from the commandline only. I didn’t run a web browser, I didn’t take any snapshots by hand – it was all automatic.

Now, let’s build a bit of a bigger example, where we login to OBIEE and see what dashboards are available to us:

// Set the size of the browser window as part of the 
// Casper instantiation
var casper = require('casper').create({viewportSize: {
        width: 800,
        height: 600
    }});

// Load the login page
casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

// Do login
casper.then(function(){
  this.fill('form#logonForm', { NQUser: 'weblogic' ,
                                NQPassword: 'Password01'
                              }, true);
}).
waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?bieehome',function(){
  this.echo('Logged into OBIEE','INFO')
  this.capture('casper_screenshots/afterlogin.png');
  });

// Now "click" the Dashboards menu
casper.then(function() {
  this.echo('Clicking Dashboard menu','INFO')
  casper.click('#dashboard');
  this.waitUntilVisible('div.HeaderPopupWindow', function() {
    this.capture('casper_screenshots/dashboards.png');
  });
});

casper.run();

So I now get a screenshot of after logging in:

and after “clicking” the Dashboard menu:

The only bit of the script above that isn’t self-explanatory is where I am referencing elements. The references are as CSS3 selectors and are easily found using something like Chrome Developer Tools. Where the click on Dashboards is simulated, there is a waitUntilVisible function, which is crucial for making sure that the page has rendered fully. For a user clicking the menu, they’d obviously wait until it appears but computers work much faster so functions like this are important for reining them back.

To round off the CasperJS script, let’s add to the above navigating to a Dashboard, snapshotting it (with graphs!), and then logging out.

[...]
casper.then(function(){
  this.echo('Navigating to GCBC Dashboard','INFO')
  casper.clickLabel('GCBC Dashboard');
})

casper.waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?dashboard', function() {
  casper.waitWhileVisible('div.AjaxLoadingOpacity', function() {
    casper.waitWhileVisible('div.ProgressIndicatorDiv', function() {
      this.capture('casper_screenshots/dashboard.png');
    })
  })
});

casper.then(function() {
  this.echo('Signing out','INFO')
  casper.clickLabel('Sign Out');
});

Again, there’s a couple of waitWhileVisible functions in there, necessary to get CasperJS to wait until the dashboard has rendered properly. The dashboard rendered is captured thus:

PhantomCSS

So now let’s see how we can use the above CasperJS code in conjunction with PhantomCSS to generate a viable regression test scenario for OBIEE.

The script remains pretty much the same, except CasperJS’s capture gets replaced with a phantomcss.screenshot based on an element (html for the whole page), and there’s some extra code “footer” to include that executes the actual test.

So let’s see how the proposed test method holds up to the examples above – broken links and disappearing reports.

First, we run the baseline capture, the “known good”. The console output shows that this is the first time it’s been run, because there are no existing images against which to compare:

In the screenshots folder is the ‘baseline’ image for each of the defined snapshots:

Now let’s break something! First off I’ll rename the target page for one of the links in the central pane of the dashboard, which will cause the ‘Invalid Link Path’ message to display.

Now I run the same PhantomCSS test again, and this time it tells me there’s a problem:

When an image is found to differ, a composite of the two highlighting the differences is created:

OK, so first test passed (or rather, failed), but arguably this could have been picked up simply by parsing the page returned from the OBIEE server for known error strings. But what about a disappearing analysis – that’s more difficult to ascertain from the page source alone.

Again, PhantomCSS picks up the difference, and highlights it nice and clearly in the generated image:

For the baseline image that you capture it would be against a “gold” version of a dashboard – no point including ad-hoc reports or dashboards under development. You’d also want to work with data that was unchanging, so where available a time filter fixed at a point in the past, rather than ‘current day’ which will be changing frequently.

Belts and Braces?

So visual regression testing is a great thing, but I think a hybrid approach, of parsing the page contents for text too, is worthwhile. CasperJS provides its own test APIs (which PhantomCSS uses), and we can write simple tests such as the following:

this.test.assertTextDoesntExist('Invalid Link Path', 'Check for error text on page');
this.test.assertTextDoesntExist('View Display Error', 'Check for error text on page');
phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

So check for a couple of well-known errors, and then snapshot the page too for subsequent automatic comparison. If an assertion is failed, it shows in the console:

This means that what is already be being done in Selenium (or for which Selenium is an assumed default tool) could even be brought into the same single test rig based around CasperJS/PhantomCSS.

Frame of Reference

The eagle-eyed of you will have noticed that the snapshots generated by PhantomCSS above are not the entire OBIEE webpage, whereas the ones from CasperJS earlier in this article are. That is because PhantomCSS deliberately wants to focus on an area of the page to test, identified using a CSS3 selector. So if you are testing a dashboard, then considering the toolbar is irrelevant and can only lead to false-positives.

phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

Similarly, considering the available dashboard list (to validate enforced authorisations) just needs to look at the list itself, not the rest of the page.  (and yes, that does say “Protals” – even developers have fat fingers sometimes ;-) )

phantomcss.screenshot('div.HeaderSharedProtals','Dashboard list');

Using this functionality means that the generated snapshots used for comparison can be done to exclude things like the alerts bar (which may appear or disappear between tests).

The Devil’s in the Detail

I am in no doubt that the method described above has definitely got its place in the regression testing arsenal for OBIEE. What I am yet to be fully convinced of is quite to what extent. My beef with Selenium et al is the level of detail one has to get in to when writing tests – identifying strings to test for, their location in the DOM, and so on. Yet above in my CasperJS/PhantomCSS examples, I have DOM selectors too, so is this just the same problem? At the moment, I don’t think so. For Selenium, to build a comprehensive test, you have to dissect the DOM for every single test you want to build. Whereas with CasperJS/PhantomCSS I think there is the need to write a basic framework for OBIEE (the basics of which are provided in this post; you’re welcome), which can then be parameterised based on dashboard name and page only. Sure, additional types of tests may need new code, but it would be more reusable.

Given that OBIEE doesn’t come with an out of the box test rig, whatever we build to test it is going to be bespoke, whether its nqcmd, Selenium, JMeter, LoadRunner, OATS, QTP, etc etc — the smart money is picking the option that will be the most flexible, more scalable, easiest to maintain, and take the least effort to develop. There is no one “program to rule them all” – an accurate, comprehensive, and flexible test suite is invariably going to utilise multiple components focussing on different areas.

In the case of regression testing – what is the aim of the testing? What are you looking to validate hasn’t broken after what kind of change?  If all that’s changed in the system is the DBAs adding some indexes or partitioning to the data, I really would not be going anywhere near the front end of OBIEE. However, more complex changes affecting the Presentation Catalog and the RPD can be well covered by this technique in conjunction with nqcmd. Visual regression testing will give you a pass/fail, but then it’s up to you to decipher the images, whereas nqcmd will give you a pass/fail but also an actual set of data to show what has changed.

Don’t forget that other great tool — you! Or rather, you and your minions, who can sit at OBIEE for 5 minutes and spot certain regressions that would take magnitudes of order greater in time to build a test to locate. Things like testing for UI/UX changes between OBIEE versions is something that is realistically handled manually. The testing of the dashboards can be automated, but faster than I can even type the requirement, let alone build a test to validate it – does clicking on the save icon bring up the save box? Well go click for yourself – done? Next test.

Summary

I have just scratched the surface of what is possible with headless browser scripting for testing OBIEE. Being able to automate and capture the results of browser interactions as we’ve seen above is hugely powerful. You can find the CasperJS API reference here if you want to find out more about how it is possible to interact with the web page as a “user”.

I’ve put the complete PhantomCSS script online here. Let me know in the comments section or via twitter if you do try it out!

Thanks to Christian Berg and Gianni Ceresa for reading drafts of this article and providing valuable feedback. 

Categories: BI & Warehousing

Trickle-Feeding Log Data into the HBase NoSQL Database using Flume

Rittman Mead Consulting - Wed, 2014-05-21 16:02

The other day I posted an article on the blog around using Flume to transport Apache web log entries from our website into Hadoop, with the final destination for the entries being an HDFS file – with the HDFS file essentially mirroring the contents of the webserver log file. Once you’ve set this transport mechanism up, you could create a Hive table over the HDFS files, for example, or further transform the data using Pig, Spark or some other mechanism.

When you load data into HDFS files though, there are a couple of things you need to be aware of; HDFS is optimised for large, streaming reads of files stored in very large disk blocks, with the classic use-case being MapReduce transformations that crunch large sets of incoming data and hand-off the results to another process. What it’s not good at is random retrievals of single file records, something you’ll notice if you try and return a single row from a Hive table request. Moreover, HDFS files are write-once, no updates or overwrites, which is why Hive only supports SELECTS and not UPDATES or DELETES. Altogether, whilst HDFS is great for landing and then processing large chunks of data, if you’re looking for more granular, database-type storage on Hadoop, you’ll need to think of something else.

And within the context of Cloudera Hadoop, that other thing is HBase, a “NoSQL” database that’s also open-source and runs on the Hadoop framework. Whilst you can work with HBase in similar ways to how you work with relational databases – you can create columns, load data into it, insert and update data and so forth – HBase and NoSQL are in lots of ways the complete opposite of relational databases like Oracle Database, as they trade-off things we normally take for granted but that have performance and scalability impacts – ACID transactions, the ability to support complex table relationships, very rich query languages and application support – for extreme scalability and flexibility. If you’re scared of losing your data then HBase is one of the better NoSQL databases, with strong (rather than “eventual”) consistency, automatic shading and lots of high-availability features, but it’s not designed for running your payroll (yet).

One reason we might want to land data in HBase or another NoSQL database, rather than in regular HDFS files, is if we then want to do fast individual record lookups within the landed data. Another reason would be HBase’s support for complex record types, making it easy to store for example nested XML datasets, and its ability – like the Endeca Server – to hold completely different sets of “columns” for each row in the database, and even version those rows giving us almost a “multi-dimensional” database. Internally, HBase stores data as key-value pairs giving it the ability to hold completely different data in each database row, and under the covers HBase data is in turn stored in indexed “StoreFiles” within HDFS, giving it HDFS’s scalability and access to the Hadoop framework, but adding fast random access to individual records.

NewImage

Where HBase (and most NoSQL databases) get complicated though is that there’s no SQL*Developer or TOAD to create tables, and no SQL or PL/SQL to load and manipulate them – it’s all done through Java and custom code – this article by Lars George who gave the Hadoop Masterclass as last week’s BI Forum goes into a bit more detail, along with his HBase slides and his book, “HBase: The Definitive Guide”.

So let’s look at a simple example of loading Apache CombinedLogFormat log file entries into HBase, using Flume to transport and ingest the data from our webserver into Hadoop and put together again by Nelio Guimaraes from the RM team. We’ll start by defining the HBase table, which like regular relational tables has rows but which has the concept of column families and column qualifiers rather than just columns. In practice, a column family + qualifier name makes what we’d normally think of as a column, but crucially under the covers column within families are stored together on disk, like column-store relational databases, making them fast to query and randomly access. Like a spreadsheet or OLAP database each combination of row and column family/qualifier is called a “cell”, and moreover only populated cells are stored on disk, with the added bonus of cell entries being timestamped, giving us the ability to retrieve previous versions of cell entries, like the temporal query feature in Oracle Database 12c.

NewImage

For more details on how HBase stores data, and how HBase schemas are defined, the white paper “Introduction to HBase Schema Design” by Cloudera’s Amandeep Khurana is a good reference point and introduction. So let’s go into the HBase shell and create a table to contain our log data; we’ll define as containing three column families (“common”,”http” and “misc”), with the actual column qualifiers defined at the point we load data into the table – one of the key features of HBase, and most NoSQL databases, is that you can introduce new columns into a store at the point of loading, just by declaring them, with each row potentially containing its own unique selection of columns – which is where Endeca Server gets its ability to store “jagged datasets” with potentially different attribute sets held for groups of rows.

[root@cdh5-node1 ~]# hbase shell
14/05/21 06:00:07 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell

hbase(main):001:0> list
TABLE                                                                           
0 row(s) in 2.8030 seconds

=> []
hbase(main):002:0> create 'apache_access_log', 
hbase(main):003:0* {NAME => 'common'},
hbase(main):004:0* {NAME => 'http'},
hbase(main):005:0* {NAME => 'misc'}
0 row(s) in 0.5460 seconds

In this example, the way we’re going to populate the HBase table is to use Flume; like the Flume and HDFS example the other day, we’ll use a “sink”, in this case a HBase sink, to take the incoming Flume activity off the channel and load it into the HBase table. Flume actually has two HBase sinks; one called HBaseSink which writes synchronously (more straightforward but slower) and another called AysncHBaseSink which writes asynchronously, potentially with higher overall throughput than synchronous writes and with full consistency even if there’s a failure (based on replaying the channel data), but with a slightly more complex serialisation approach. We’ll use the asynchronous sink in this example, and assuming you’ve already got the source configuration file set-up (see the previous blog post on Flume and HDFS for an example), the target Flume conf file in our case looked like this:

## TARGET AGENT ##
## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent-hbase.conf -n collector

#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2 mc3

## Channels ##
## Source writes to 3 channels, one for each sink
collector.channels = mc1 mc2 mc3

#http://flume.apache.org/FlumeUserGuide.html#memory-channel

collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 1000

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 1000

collector.channels.mc3.type = memory
collector.channels.mc3.capacity = 1000

## Sinks ##
collector.sinks = LocalOut HadoopOut HbaseOut

## Write copy to Local Filesystem 
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}/%d%m%Y
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

###############################################################
# HBase sink config 
###############################################################
collector.sinks.HbaseOut.type = org.apache.flume.sink.hbase.AsyncHBaseSink
collector.sinks.HbaseOut.channel = mc3
collector.sinks.HbaseOut.table = apache_access_log
collector.sinks.HbaseOut.columnFamily = common
collector.sinks.HbaseOut.batchSize = 5000
collector.sinks.HbaseOut.serializer = com.hbase.log.util.AsyncHbaseLogEventSerializer
collector.sinks.HbaseOut.serializer.columns = common:rowKey,common:hostname,common:remotehost,common:remoteuser,common:eventtimestamp,http:requestmethod,http:requeststatus,http:responsebytes,misc:referrer,misc:agent

A few points to note:

  • The collector.sinks.HbaseOut.type setting determines the sink type we’ll use, in this case org.apache.flume.sink.hbase.AsyncHBaseSink
  • collector.sinks.HbaseOut.table sets the HBase table name we’ll load, “apache_access_log”
  • collector.sinks.HbaseOut.serializer.columns actually defines the column qualifiers, in this case mapping incoming serialised log file rows into a set of HBase column families and qualifiers
  • collector.sinks.HbaseOut.serializer is the most important bit – and tells HBase how to turn the incoming Flume data into HBase loads, through a Java program called the “serializer”.

And its this serializer, the Java program that does the actual loading of the HBase table, that’s the final piece of the jigsaw. There are standard templates to use when writing this piece of code, and in our case the serializer looked like this:

package com.hbase.log.util;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.SimpleHbaseEventSerializer.KeyType;
import org.apache.flume.sink.hbase.AsyncHbaseEventSerializer;

import com.google.common.base.Charsets;
/**
 * A serializer for the AsyncHBaseSink, which splits the event body into
 * multiple columns and inserts them into a row whose key is available in
 * the headers
 *
 * Originally from https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
 */
public class AsyncHbaseLogEventSerializer implements AsyncHbaseEventSerializer 
{
    private byte[] table;
    private byte[] colFam;
    private Event currentEvent;
    private byte[][] columnNames;
    private final List<PutRequest> puts = new ArrayList<PutRequest>();
    private final List<AtomicIncrementRequest> incs = new ArrayList<AtomicIncrementRequest>();
    private byte[] currentRowKey;
    private final byte[] eventCountCol = "eventCount".getBytes();
    // private String delim;

    @Override
    public void initialize(byte[] table, byte[] cf) 
    {
        this.table = table;
        this.colFam = cf;
    }

    @Override
    public void setEvent(Event event) 
    {
        // Set the event and verify that the rowKey is not present
        this.currentEvent = event;
        String rowKeyStr = currentEvent.getHeaders().get("rowKey");
        //if (rowKeyStr == null) {
        //  throw new FlumeException("No row key found in headers!");
        //}
        //currentRowKey = rowKeyStr.getBytes();
    }

    public String[] logTokenize(String event)
    {

        String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"";
        Pattern p = Pattern.compile(logEntryPattern);
        Matcher matcher = p.matcher(event);

        if (!matcher.matches()) 
        {
            System.err.println("Bad log entry (or problem with RE?):");
            System.err.println(event);
            return null;
        }

        String[] columns = new String[matcher.groupCount()+1];
        
        columns[0]= Long.toString(System.currentTimeMillis());
        
        for (int i = 1; i <= matcher.groupCount(); i++) 
        {
            columns[i] = matcher.group(i);
        }

        return columns;

    }

    @Override
    public List<PutRequest> getActions() 
    {
        // Split the event body and get the values for the columns
        String eventStr = new String(currentEvent.getBody());
        long unixTime = System.currentTimeMillis();
        //String[] cols = eventStr.split(",");
        //String[] cols = eventStr.split(regEx);
        //String[] cols = eventStr.split("\\s+");
        //String[] cols = eventStr.split("\\t");
        //String[] cols = eventStr.split(delim);
        String[] cols = logTokenize(eventStr);
        puts.clear();
        String[] columnFamilyName;
        byte[] bCol;
        byte[] bFam;
        for (int i = 0; i < cols.length; i++) 
        {
            //Generate a PutRequest for each column.
            columnFamilyName = new String(columnNames[i]).split(":");
            bFam = columnFamilyName[0].getBytes();
            bCol = columnFamilyName[1].getBytes();

            if (i == 0) 
            {
                currentRowKey = cols[i].getBytes();
            }
            //PutRequest req = new PutRequest(table, currentRowKey, colFam,
            //columnNames[i], cols[i].getBytes());
            PutRequest req = new PutRequest(table, currentRowKey, bFam,
            bCol, cols[i].getBytes());
            puts.add(req);
        }
        return puts;
    }

    @Override
    public List<AtomicIncrementRequest> getIncrements() 
    {
        incs.clear();
        //Increment the number of events received
        incs.add(new AtomicIncrementRequest(table, "totalEvents".getBytes(), colFam, eventCountCol));
        return incs;
    }

    @Override
    public void cleanUp() 
    {
        table = null;
        colFam = null;
        currentEvent = null;
        columnNames = null;
        currentRowKey = null;
    }

    @Override
    public void configure(Context context) 
    {
        //Get the column names from the configuration
        String cols = new String(context.getString("columns"));
        String[] names = cols.split(",");
        columnNames = new byte[names.length][];
        int i = 0;
        
        for(String name : names) 
        {
            columnNames[i++] = name.getBytes();
        }
        
        //delim = new String(context.getString("delimiter"));
    }

    @Override
    public void configure(ComponentConfiguration conf) {}
}

HBase, rather than supporting the regular SELECT and INSERTS we’re used to with Oracle, instead uses “get” and “put” methods to retrieve, and store, data – along with “delete” and “scan”. The regular synchronous HBase sync uses these methods directly, taking data off the Flume channel and inserting it into the HBase table (or indeed, updating existing rows based on the row key), whilst the asychnronous method uses a layer in-between the incoming data and the write, allowing data (or “events”) to continue streaming in even if all the downstream data hasn’t get been committed. It’s this code though that maps each incoming bit of data – in this case, a parsed log file – to column families and qualifiers in the HBase table, and you’d need to write new code like this, or amend the exiting one, if you wanted to load other HBase tables in your Hadoop cluster – a long way from the point-and-click ETL approach we get with ODI, but a lot more flexible too (if that’s what you want).

Then it’s a case of compiling the Java code, like this:

mkdir com; mkdir com/hbase; mkdir com/hbase/log; mkdir com/hbase/log/util
vi com/hbase/log/util/AsyncHbaseLogEventSerializer.java
export CLASSPATH=/usr/lib/flume-ng/lib/*
javac com/hbase/log/util/AsyncHbaseLogEventSerializer.java
jar cf LogEventUtil.jar com
jar tf LogEventUtil.jar com
chmod 775 LogEventUtil.jar
cp LogEventUtil.jar /usr/lib/flume-ng/lib

Next, we had to run the following command before enabling Flume with this setup, because of an issue we found with Zookeeper stopping Flume working in this setup:

mv /etc/zookeeper/conf/zoo.cfg /etc/zookeeper/conf/zoo.cfg-unused

and finally, we start up the Flume target server agent, followed by the source one (again see the previous article for setting up the source Flume agent):

flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent-hbase.conf -n collector

Then, after a while, log data starts getting loaded into the HBase table. You can check on it using Hue, and the HBase Browser:

NewImage

Or you can go back into the HBase shell and run the scan command to view the data, with each row representing a cell in the overall table storage:

hbase(main):001:0> scan 'apache_access_log'
ROW                   COLUMN+CELL                                               
 1400628560331        column=common:eventtimestamp, timestamp=1400628560350, val
                      ue=20/May/2014:15:28:06 +0000                             
 1400628560331        column=common:hostname, timestamp=1400628560335, value=89.
                      154.89.101                                                
 1400628560331        column=common:remotehost, timestamp=1400628560336, value=-
 1400628560331        column=common:remoteuser, timestamp=1400628560338, value=-
 1400628560331        column=common:rowKey, timestamp=1400628560333, value=14006
                      28560331                                                  
 1400628560331        column=http:requestmethod, timestamp=1400628560352, value=
                      GET / HTTP/1.1                                            
 1400628560331        column=http:requeststatus, timestamp=1400628560356, value=
                      200                                                       
 1400628560331        column=http:responsebytes, timestamp=1400628560358, value=
                      9054                                                      
 1400628560331        column=misc:agent, timestamp=1400628560377, value=Mozilla/
                      5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.
                      14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14     
 1400628560331        column=misc:referrer, timestamp=1400628560359, value=-    
 1400628560344        column=common:eventtimestamp, timestamp=1400628560383, val
                      ue=20/May/2014:15:28:06 +0000

This is all great, and a good starting point if you plan to process your data with other Java programs as the next step. But what if you want to view the data in a more convenient way, perhaps as a regular table? To do that you can use Hive again, this time using Hive’s HBase integration features to tell it the data is stored in HBase format, and to let it know how to display the various HBase column families and qualifiers. In our case, the DDL to create the corresponding Hive table looks like this:

DROP TABLE IF EXISTS hive_apache_access_log;
CREATE EXTERNAL TABLE hive_apache_access_log
(
unixtimestamp string,
eventtimestamp string,
hostname string,
remotehost string,
remoteuser string,
requestmethod string,
requeststatus string,
responsebytes string,
agent string,
referrer string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,common:eventtimestamp,common:hostname,common:remotehost,common:remoteuser,http:requestmethod,http:requeststatus,http:responsebytes,misc:agent,misc:referrer')
TBLPROPERTIES ('hbase.table.name' = 'apache_access_log');

giving us the ability, either from the Hive shell like this, or from tools like OBIEE and ODI, to query the NoSQL database and brings its data into more regular, relational data stores.

hive> select * from hive_apache_access_log;
OK
1400628560331   20/May/2014:15:28:06 +0000  89.154.89.101   -   -   GET / HTTP/1.1  200 9054    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 -
1400628560344   20/May/2014:15:28:06 +0000  89.154.89.101   -   -   GET /wp-content/plugins/crayon-syntax-highlighter/css/min/crayon.min.css?ver=2.5.0 HTTP/1.1 304 -   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 http://www.rittmanmead.com/
1400628560345   20/May/2014:15:28:06 +0000  89.154.89.101   -   -   GET /wp-content/plugins/jetpack/modules/widgets/widgets.css?ver=20121003 HTTP/1.304 -   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 http://www.rittmanmead.com/
...

We’ll be covering more on HBase, and Oracle’s NoSQL Database, in future articles on the blog.

Categories: BI & Warehousing

Photos and Presentation Downloads from the Rittman Mead BI Forum 2014

Rittman Mead Consulting - Mon, 2014-05-19 14:59

Well we’re back in Brighton, UK now after the second successful week of the Rittman Mead BI Forum 2014. In Week 1, we went to the Seattle Hotel in Brighton (read the recap here), and then straight-after we flew over to Atlanta, GA, to run the second week – and it was possibly even better than Brighton ;-)

Congratulations to Omri Traub, winner of the Atlanta Best Speaker award – the first person from Oracle to win, in fact – and to all of the other presenters who helped put together an excellent event. If you’re interested, I’ve uploaded a bunch of photos from both Brighton and Atlanta to Flickr, and you can view the BI Forum 2014 photo set here.

Biforumusa

As usual, where we’ve got permission (or the PDF) from the presenter, we’re making all of the presentations available for download to both attendees and non-attendees – not everyone can make it to the event, but we don’t want you to miss-out. We’re also very grateful to Lars George and Cloudera for making their Hadoop Masterclass slides available too – thanks everyone.

Other than that – thanks again to everyone who attended, and hopefully we’ll see you all again next year!

Categories: BI & Warehousing

List of our Recent “Getting Started” Hadoop Articles

Rittman Mead Consulting - Sun, 2014-05-18 04:54

We’ve published a number of “getting started with Hadoop” articles over the past few months, but these aren’t always easy to find on the blog. I’ve therefore compiled a list of the more recent ones, which you’ll find below:

Categories: BI & Warehousing

Trickle-Feeding Log Files to HDFS using Apache Flume

Rittman Mead Consulting - Sun, 2014-05-18 03:20

In some previous articles on the blog I’ve analysed Apache webserver log files sitting on a Hadoop cluster using Hive, Pig and most recently, Apache Spark. In all cases the log files have already been sitting on the Hadoop cluster, SFTP’d to my local workstation and then uploaded to HDFS, the Hadoop distributed filesystem, using Hue, and the only way to add to them is to repeat the process and manually copy them across from our webserver. But what if I want these log files to be copied-across automatically, in a kind of “trickle-feed” process similar to how Oracle GoldenGate trickle-feeds database transactions to a data warehouse? Enter Apache Flume, a component within Hadoop and the Cloudera CDH4/5 distribution of Hadoop, which does exactly this.

Flume is an Apache project within the overall Hadoop ecosystem that provides a reliable, distributed mechanism for collecting aggregating and moving large amounts of log data. Similar to GoldenGate it has transaction collectors, mechanisms to reliably transmit data from source to target, and mechanisms to write those log events to a centralised data store, for example HDFS. It’s free and comes with Cloudera CDH, and coupled with something at the target end to then process and work with the incoming log entries, is a pretty powerful and flexible way to transmit log-type entries from (potentially) multiple source providers to a central Hadoop cluster.

To take our example, we’ve got a webserver that’s generating our Apache CombinedLogFormat log entries as users generate activity on the website. We then set up Flume agents on the source webserver, and then the Hadoop client node that’s going to receive the log entries, which then writes the log entries to HDFS just like any other file activity. The Flume agent on the webserver source “tail”s the Apache access.log file copying across entries as they’re made (more or less), so that the target HDFS log file copies are kept up to date with individual log entries, not just whole log files as they’re closed off, with the diagram below showing the overall schematic:

Flume Topology

Down at the Flume component level, Flume consists of agents, Java processes that sit on the source, target and any intermediate servers; channels, intermediate staging points that can persist log entries to disk, database or memory; and sinks, processes that take log transactions out of a channel and write them to disk. Flume is designed to be distributed and resilient, and won’t take the source down if the target Hadoop environment isn’t available; if this type of situation occurs, transactions will slowly fill-up the channel used by the source agent until such time as it runs out of space, and then further log transactions are lost until the target comes back up and the source agent’s channel regains some spare space. The diagram below, from the Cloudera blog about the latest generation of Flume (Flume NG, for “Next Generation”) shows the Flume product topology:

NewImage

whilst the next diagram shows how Flume can collect and aggregate log entries from multiple servers, and then combine them into one log stream sent to a single target.

NewImage

In our example, that’s all there is to it; in more complex examples, perhaps where the source is sending XML log entries, you’d need a downstream processor on the target platform to decode, deserialise or parse the incoming log files – Flume is just a transport mechanism and doesn’t do any transformation itself. You can also choose how the log entries are held by each of the agents’ channels; in the example we’re going to use, channel data is just held in-memory which is fast to run and setup, but you’d lose all of your data in the process if the server went down. Other, more production-level processes would persist the channel entries to file, or even a mySQL database.

For our setup we need to two agents, one on the source and one on the target server, each of which has its own configuration file. The source agent configuration file looks like this, with key entries called-out underneath it:

## SOURCE AGENT ##
## Local instalation: /home/ec2-user/apache-flume
## configuration file location:  /home/ec2-user/apache-flume/conf
## bin file location: /home/ec2-user/apache-flume/bin
## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent

# http://flume.apache.org/FlumeUserGuide.html#exec-source
source_agent.sources = apache_server
source_agent.sources.apache_server.type = exec
source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log
source_agent.sources.apache_server.batchSize = 1
source_agent.sources.apache_server.channels = memoryChannel
source_agent.sources.apache_server.interceptors = itime ihost itype

# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
source_agent.sources.apache_server.interceptors.itime.type = timestamp

# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
source_agent.sources.apache_server.interceptors.ihost.type = host
source_agent.sources.apache_server.interceptors.ihost.useIP = false
source_agent.sources.apache_server.interceptors.ihost.hostHeader = host

# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
source_agent.sources.apache_server.interceptors.itype.type = static
source_agent.sources.apache_server.interceptors.itype.key = log_type
source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
source_agent.channels = memoryChannel
source_agent.channels.memoryChannel.type = memory
source_agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on Hadoop Node
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
source_agent.sinks = avro_sink
source_agent.sinks.avro_sink.type = avro
source_agent.sinks.avro_sink.channel = memoryChannel
source_agent.sinks.avro_sink.hostname = 81.155.163.172
source_agent.sinks.avro_sink.port = 4545

  • Source is set to “apache_server”, i.e. an Apache HTTP server
  • The capture mechanism is the Linux “tail” command
  • Log entries are held by the channel mechanism in-memory, rather than to file or database
  • Timestamp is used by the source collector to tell which entries are new
  • The agent then sends the log entries to a corresponding Flume agent on the Hadoop Cluster, in this case an IP address that corresponds to my network’s external IP address, with Flume network traffic then NATted by my router to cdh4-node1.rittmandev.com, the client node in my CDH4.6 Hadoop cluster running on VMWare.

The target server in my Hadoop cluster then has a corresponding configuration file set up, looking like this:

## TARGET AGENT ##
## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2

#http://flume.apache.org/FlumeUserGuide.html#memory-channel

collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}/%y%m%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

Key entries in this log file are:

  • Apache AVRO is the file format we’re using to transmit the data, and Flume is working on port 4545
  • There’s two sink collector channels defined – “mc1” for writing file entries to the local server filesystem, and one to HDFS
  • The maximum number of events (log entries) Flume will store in the various channels (log entry persistence stores) is 100, meaning that if the target platform goes down and more than 100 log transactions back-up, then further ones will get lost until we can clear the channel down. Of course this limit can be increased, assuming there’s memory or disk spare.

I then SSH into the target Hadoop node and start the Flume agent, like this:

[root@cdh4-node1 ~]# flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector
Info: Including Hadoop libraries found via (/usr/bin/hadoop) for HDFS access
Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.6.1.jar from class path
...
14/05/18 18:15:29 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
14/05/18 18:15:29 INFO hdfs.BucketWriter: Creating /user/root/flume-channel/apache_access_combined/18052014/FlumeData.1400433329254.tmp

and then repeat the step for the source webserver, like this:

[ec2-user@ip-10-35-143-131 apache-flume]$ sudo bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent
Warning: JAVA_HOME is not set!
+ exec /usr/bin/java -Xmx20m -cp '/home/ec2-user/apache-flume/conf:/home/ec2-user/apache-flume/lib/*' -Djava.library.path= org.apache.flume.node.Application -f conf/flume-src-agent.conf -n source_agent

Finally, moving across to Hue I can see new log entries being written to the HDFS file system:

NewImage

So there you go – simple transport of webserver log entries from a remote server to my Hadoop cluster, via Apache Flume – thanks again to Nelio Guimaraes from the RM team for setting the example up.

Categories: BI & Warehousing

Mobile App Designer mis-configuration error

Rittman Mead Consulting - Wed, 2014-05-14 09:21

I’ve been doing some work recently with OBIEE’s new Mobile App Designer (MAD). It’s a great bit of software that I’m genuinely impressed with, but it’s got its little v1 quirks, and helpful error messages are not its forte. I hit a MADdening (sorry) problem with it that Google and My Oracle Support both drew blanks on, so I’m posting it here in case it helps out others with the same problem.

Setting up MAD is a bit of a fiddly process involving patching OBIEE (regardless of the base version you’re on – hopefully in the future it will get rolled into the patchsets) and performing other bits of setup detailed in the documentation. The problem that I hit manifested itself twofold:

  1. Publishing an App to the Apps Library worked fine, but updating an existing App threw an error in the browser:
    Failed to publish /~weblogic/GCBC Mobile - Phone.xma:oracle.xdo.webservice.exception.AccessDeniedException: PublicReportService::executeUpdateTemplateForReport Failure: user has no access to report[/Apps Library//GCBC Mobile - Phone.xma] due to [Ljava.lang.StackTraceElement;@4e6106df
  2. Trying to subscribe to any App threw a generic error in the browser: “Error occurred while accessing server. Please contect administrator.” with the corresponding bipublisher.log showing: 
    [2014-05-13T16:49:53.449+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] User (weblogic) with session id: q2fq8fkh66f85ghamsq164u9qs98itvnk0c826i is looking for object in biee path: /shared/Apps Library//GCBC.xma/_mreport.xma[[
    Object Error [Context: 0, code: QM3V3HLV, message: Invalid path (/shared/Apps Library//GCBC.xma/_mreport.xma) -- ]
    ]]
    [2014-05-13T16:49:53.450+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] oracle.xdo.XDOException: Target app not found in the repository :/Apps Library//GCBC.xma[[
        at oracle.xdo.online.AppStoreIO.doPost_subscribeApp(AppStoreIO.java:311)
        at oracle.xdo.online.AppStoreIO.doPost(AppStoreIO.java:120)
    [...]

One of my esteemed Rittman Mead colleagues, Francesco Tisiot, pointed out that the path referenced in the errors has a double slash in it. On checking my configuration, I had indeed fat-fingered one of the settings. APPS_LIBRARY_FOLDER_LOCAL is defined in the <DOMAIN_HOME>/config/bipublisher/repository/Admin/Configuration/xmlp-server-config.xml file, and mine looked like this:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library/"/>

All I needed to do was to remove the trailing slash after Library:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library"/>

After restarting the bimad application deployment all was good again with the MAD world and I could republish and subscribe to Apps happily.

 

Categories: BI & Warehousing

Exploring Apache Spark on the New BigDataLite 3.0 VM

Rittman Mead Consulting - Mon, 2014-05-12 20:34

The latest version of Oracle’s BigDataLite VirtualBox VM went up on OTN last week, and amongst other things it includes the latest CDH5.0 version of Cloudera Distribution including Hadoop, as featured on Oracle Big Data Appliance. This new version comes with an update to MapReduce, moving it to MapReduce 2.0 (with 1.0 still there for backwards-compatibility) and with YARN as the replacement for the Hadoop JobTracker. If you’re developing on CDH or the BigDataLite VM you shouldn’t notice any differences with the move to YARN, but it’s a more forward-looking, modular way of tracking and allocating resources on these types of compute clusters that also opens them up to processing models other than MapReduce.

The other new Hadoop feature that you’ll notice with BigDataLite VM and CDH5 is an updated version of Hue, the web-based development environment you use for creating Hive queries, uploading files and so on. As the version of Hue shipped is now Hue 3.5, there’s proper support for quoted CSV files in the Hue / Hive uploader (hooray) along with support for stripping the first, title, line, and an updated Pig editor that prompts you for command syntax (like the Hortonworks Pig editor).

NewImage

This new version of BigDataLite also seems to have had Cloudera Manager removed (or at least, it’s not available as usual at http://bigdatalite:7180), with instead a utility provided on the desktop that allows you to stop and start the various VM services, including the Oracle Database and Oracle NoSQL database that also come with the VM. Strictly-speaking its actually easier to use than Cloudera Manager, but it’s a shame it’s gone as there’s lots of monitoring and configuration tools in the product that I’ve found useful in the past.

NewImage

CDH5 also comes with Apache Spark, a cluster processing framework that’s being positioned as the long-term replacement for MapReduce. Spark is technically a programming model that allows developers to create scripts, or programs, that bring together operators such as filters, aggregators, joiners and group-bys using languages such as Scala and Python, but crucially this can all happen in-memory – making Spark potentially much faster than MapReduce for doing both batch, and ad-hoc analysis.

This article on the Cloudera blog goes into more detail on what Apache Spark is and how it improves over MapReduce, and this article takes a look at the Spark architecture and how it’s approach avoids the multi-stage execution model that MapReduce uses (something you’ll see if you ever do a join in Pig or Hive). But what does some basic Spark code look like, using the default Scala language most people associate Spark with? Let’s take a look at some sample code, using the same Rittman Mead Webserver log files I used in the previous Pig and Hive/Impala articles.

You can start up Spark in interactive mode, like you do with Pig and Grunt, by opening a Terminal session and typing in “spark-shell”:

[oracle@bigdatalite ~]$ spark-shell
14/05/12 20:56:50 INFO HttpServer: Starting HTTP Server
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 0.9.0
/_/

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
14/05/12 20:56:56 INFO Slf4jLogger: Slf4jLogger started
14/05/12 20:56:56 INFO Remoting: Starting remoting
14/05/12 20:56:56 INFO Remoting: Remoting started; 
...

14/05/12 20:56:57 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140512205657-0002
14/05/12 20:56:57 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala>

Spark has the concept of RDDs, “Resilient Distributed Datasets” that can be thought of as similar to relations in Pig, and tables in Hive, but which crucially can be cached in RAM for improved performance when you need to access their dataset repeatedly. Like Pig, Spark features “lazy execution”, only processing the various Spark commands when you actually need to (for example, when outputting the results of a data-flow”, so let’s run two more commands to load in one of the log files on HDFS, and then count the log file lines within it.

scala> val logfile = sc.textFile("logs/access_log")
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] = MappedRDD[31] at textFile at <console>:15

scala> logfile.count()
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
14/05/12 21:19:06 INFO SparkContext: Job finished: count at <console>:18, took 0.192536694 s
res7: Long = 154563

So the file contains 154563 records. Running the logfile.count() command again, though, brings back the count immediately as the RDD has been cached; we can explicitly cache RDDs directly in these commands if we like, by using the “.cache” method:

scala> val logfile = sc.textFile("logs/access_log").cache

So let’s try some filtering, retrieving just those log entries where the user is requesting our BI Apps 11g homepage (“/biapps11g/“):

scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403

Or I can create a dataset containing just those records that have a 404 error code:

scala> val errors = logfile.filter(_.contains("404"))
errors: org.apache.spark.rdd.RDD[String] = FilteredRDD[36] at filter at <console>:17
scala> errors.count()
...
res11: Long = 1577

Spark, using Scala as the language, has routines for filtering, joining, splitting and otherwise transforming data, but something that’s quite common in Spark is to create Java JAR files, typically from compiled Scala code, to encapsulate certain common data transformations, such as this Apache CombinedFormat Log File parser available on Github from @alvinalexander, author of the Scala Cookbook. Once you’ve compiled this into a JAR file and added it to your SPARK_CLASSPATH (see his blog post for full details, and from where the Spark examples below were taken from), you can start to work with the individual log file elements, like we did in the Hive and Pig examples where we parsed the log file using Regexes.

scala> import com.alvinalexander.accesslogparser._
import com.alvinalexander.accesslogparser._

scala> val p = new AccessLogParser
p: com.alvinalexander.accesslogparser.AccessLogParser = com.alvinalexander.accesslogparser.AccessLogParser@6d32bc14

Then I can access the HTTP Status Code using its own property, like this:

def getStatusCode(line: Option[AccessLogRecord]) = {
  line match {
    case Some(l) => l.httpStatusCode
    case None => "0"
  }
}

log.filter(line => getStatusCode(p.parseRecord(line)) == "404").count 
...
res12: Long = 1233

Then we can use a similar method to retrieve all of the “request” entries in the log file where the user got a 404 error, starting off by defining two methods that will help with the request parsing – note the use of the :paste command which allows you to block-paste a set of commands into the scala-shell:

scala> :paste
// Entering paste mode (ctrl-D to finish)
def getRequest(rawAccessLogString: String): Option[String] = {
  val accessLogRecordOption = p.parseRecord(rawAccessLogString)
  accessLogRecordOption match {
    case Some(rec) => Some(rec.request)
    case None => None
  }
}
def extractUriFromRequest(requestField: String) = requestField.split(" ")(1)
// Exiting paste mode, now interpreting.

getRequest: (rawAccessLogString: String)Option[String]
extractUriFromRequest: (requestField: String)String

Now we can run the code to output the URIs that have been generating 404 errors:

scala> :paste
// Entering paste mode (ctrl-D to finish)
log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).count
val recs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_))
val distinctRecs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404")
                      .map(getRequest(_))
                      .collect { case Some(requestField) => requestField }
                      .map(extractUriFromRequest(_))
                      .distinct
distinctRecs.count
distinctRecs.foreach(println)
...
/wp2/wp-content/uploads/2009/11/fin5.jpg/
wp2/wp-content/uploads/2009/08/qv10.jpg/
wp2/wp-content/uploads/2010/02/image32.png/2013/08/inside-my-home-office-development-lab-vmware-os-x-server/
wp-content/themes/optimize/thumb.php 
...

So Spark is commonly-considered the successor to MapReduce, and you can start playing around with it on the new BigDataLite VM. Unless you’re a Java (or Scala, or Python) programmer, Spark isn’t quite as easy as Pig or Hive to get into, but the potential benefits over MapReduce are impressive and it’d be worth taking a look. Hopefully we’ll have more on Spark on the blog over the next few months, as we get to grips with it properly.

Categories: BI & Warehousing

The State of the OBIEE11g World as of May 2014

Rittman Mead Consulting - Mon, 2014-05-12 03:00

I’m conscious I’ve posted a lot on this blog over the past few months about hot new topics like big data, Hadoop and Oracle Advanced Analytics, and not so much about OBIEE, which traditionally has been the core of Rittman Mead’s business and what we’ve written about most historically. Part of this is because there’s a lot of innovative stuff coming out of the big data world, but a part of it is because there’s not been a big new OBIEE11g release this year, as we had last year with 11.1.1.7, before that 11.1.1.6, and so on. But there’s actually a lot interesting going on in the OBIEE11g world at the moment without a big headline release, and what with the Brighton RM BI Forum 2014 taking place last week and the product keynotes it gave us, I thought it’d be worth talking a look back at where we are in May 2014, where the innovation is happening and what’s coming up in the next few months for OBIEE.

Product Versions and Capabilities

As of the time of writing (May 11th 2014) we’re currently on the 11.1.1.7.x version of OBIEE, updated with a few patch sets since the original April 2013 release to include features such as Mobile App Designer. OBIEE 11.1.1.7.x saw a UI update to the new FusionFX theme, replacing the theme used from the 11.1.1.3 release, and brought in new capabilities such as Hadoop/Hive integration as well as a bunch of “fit-and-finish” improvements, such that at the time I referred to it as “almost like 11g Release 2”, in terms of usability, features and general “ready-for-deployment” quality.

NewImage

The other major new capability OBIEE 11.1.1.7 brought in was better integration with Essbase and the Hyperion-derived products that are now included in the wider Oracle BI Foundation 11g package. Earlier versions of OBIEE gave you the ability to install Essbase alongside OBIEE11g for the purposes of aggregate persistence into Essbase cubes, but the 11.1.1.7 release brought in a single combined security model for both Essbase and OBIEE, integration of EPM Workspace into the OBIEE environment and the re-introduction of Smartview as OBIEE (and Essbase’s) MS Office integration platform.

Outside of core OBIEE11g but complementing it, and the primary use-case for a lot of OBIEE customers, are the Oracle BI Applications and 2013 saw the release of Oracle BI Applications 11.1.1.7.1, followed just a few days ago by the latest update, OBIA 11.1.1.8.1. What these new releases brought in was the replacement of Informatica PowerCenter by Oracle Data Integrator, and a whole new platform for configuring and running BI Apps ETL jobs based around JEE applications running in WebLogic Server. Whilst at the time of OBIA 11.1.1.7.1’s release most people (including myself) advised caution in using this new release and said most new customers should still use the old 7.9.x release stream – because OBIA 11g skills would be scarce and relatively speaking, it’d have a lot of bugs compared to the more mature 7.9.x stream – in fact I’ve only heard about 11g implementations since then, and they mostly seem to have gone well. OBIA 11.1.1.8.1 came out in early May 2014 and seems to be mostly additional app content, bug fixes and Endeca integration, and there’s still no upgrade path or 11g release for Informatica users, but the 11g release of BI Apps seems to be a known-quantity now and Rittman Mead are getting a few implementations under our belt, too.

Oracle BI Cloud Service (BICS)

So that’s where we are now … but what about the future? As I said earlier, there hasn’t been a major release of OBIEE 11g this year and to my mind, where Oracle’s energy seems to have gone is the “cloud” release of OBIEE11g, previewed back at Oracle Openworld 2013 and due for release in the next few months. You can almost think of this as the “11.1.1.8 release” for this year with the twist being it’s cloud-only, but what’ll be interesting about this version of OBIEE11g is that it’ll probably be updated with new functionality on a much more regular basis than on-premise OBIEE, as Oracle (Cloud) will own the platform and be in a much better position to push-through upgrades and control the environment than for on-premise installs.

NewImage

Headline new capabilities in this cloud release will include:

  • Rapid provisioning, with environments available “at the swipe of a credit card” and with no need to download and install the software yourself
  • Built-in storage, with Oracle’s schema-as-a-service/ApEx database environment backing the product and giving you a place to store data for reporting
  • A consumer-style experience, with wizards and other helper features aimed at getting users familiar with on-premise OBIEE11g up and started on this new cloud version
  • Access to core OBIEE11g features such as Answers, dashboards, mobile and a web-based repository builder

It’s safe to say that “cloud” is a big deal for Oracle at the moment, and it’s probably got as much focus within the OBIEE development team as Fusion Middleware / Fusion Apps integration had back at the start of OBIEE 11g. Part of this is technology trends going on outside of BI, and OBIEE – customers are moving their IT platforms into the cloud anyway, so it makes sense for your BI to be there too, rather than being the only thing left back on-premise, but a bit part of it is the benefits it gives Oracle, and the OBIEE product team – they can own and control much more of the end-to-end experience, giving them control over quality and much more customers on the latest version, and of course the recurring revenues Oracle gets from selling software-as-a-service in the cloud are valued much higher by the market than the one-off license sales they’ve relied on in the past.

But for customers, too, running BI and OBIEE in the cloud brings quite a few potential benefits – both in terms of Oracle’s official “BI in the Cloud Service”, and the wider set of options when you consider running full OBIEE in a public cloud such as Amazon AWS – see my Collaborate’14 presentation on the topic on Slideshare. There’s none of the hassle and cost of actually setting up the software on your own premises, and then doing upgrades and applying patches over time – “empty calories” that have to be spent but don’t bring any direct benefit to the business.  OBIEE in the Cloud also promises to bring a bit of independence to the business from IT, as they’ll be able to spin-up cloud BI instances without having to go through the usual procurement/provisioning cycle, and it’ll be much easier to create temporary or personal-use OBIEE environments for tactical or short-lived work particularly as you’ll only have to license OBIEE for the users and months you actually need it for, rather than buying perpetual licenses which might then sit on the shelf after the immediate need has gone.

Data Visualization, and the Competition from Tableau

It’s probably safe to say that, when OBIEE 11.1.1.3 came out back in 2010, its main competitors were other full-platform, big vendor BI products such as SAP Business Objects and IBM Cognos. Now, in 2014, what we’re hearing anecdotally and from our own sales activity around the product, the main competitor we hear OBIEE 11g coming up against is Tableau. Tableau’s quite a different beast to OBIEE – like QlikTech’s QlikView it’s primarily a desktop BI tool that over the years has been giving some server-based capabilities, but what it does well is get users started fast and give them the ability to create compelling and beautiful data visualisations, without spending days and weeks building an enterprise metadata layer and battling with their IT department.

Of course we all know that as soon as any BI tool gets successful, its inevitable that IT will have to get involved at some point, and you’re going to have to think about enterprise definitions of metrics, common dimensions and so forth, and it’s this area that OBIEE does so well, primarily (in my mind) selling well to the IT department, and with Oracle focusing most of their attention recently on the integration element of the BI world, making it easy to link your ERP and CRM applications to your BI stack, and the whole lot playing well with your corporate security and overall middleware stack. But none of that stuff is important to end users, who want a degree of autonomy from the IT department and something they can use to quickly and painlessly knock-together data visualisations in order to understand the data they’re working with.

So to my mind there’s two aspects to what Tableau does well, that OBIEE needs to have an answer for; ease of setting-up and getting started, and its ability to create data visualisations beyond the standard bar charts and line charts people most associate with OBIEE. And there’s a couple of initiatives already in place, and coming down the line, from Oracle that aim to address this first point; BI Publisher, for example, now gives users the option to create a report directly off-of data in the RPD without the intermediate requirement to create a separate data model, and presents a list of commonly-used report formats at report creation to make the process a bit more “one-stop”.

NewImage

Another initiative that’ll probably come along first as part of the BI in the Cloud Service is personal data-mashups; what this is is a way for users to upload, from spreadsheets or CSV files, data that they want to add to their standard corporate metrics to allow them to produce reports that aren’t currently possible with the standard curated RPD from corporate IT. Metrics users add in this way will have their data stored (probably) in the catalog but marked in a way that it’s clear they’re not “gold-standard” ones, with the aim of the feature being to avoid the situation where users export their base data from OBIEE into Excel and then bring in the additional data there. It does beg a few questions in terms of where the data goes, how it all gets stored and how well it’d work on an on-premise install, but if you work on the basis that users are going to do this sort of thing anyway, it’s best they do it within the overall OBIEE environment than dump it all to Excel and do their worst there (so to speak).

Another even-more intriguing new product capability that’s coming along, and is technically possible with the current 11.1.1.7 release, is the concept of “mini-apps”. Mini-apps are something Philippe Lion’s “SampleApp” team have been working on for a while now, and are extensions to core OBIEE that are enabled via Javascript and allow developers to create self-contained applications, including table creation scripts, to solve a particular user problem or requirement. This Youtube video from one of Philippe’s team goes through the basic concept, with custom Javascript used to unpack a mini-app setup archive and then create tables, and set up the analysis views, to support requirements such as linear regression and trend analysis.

NewImage

It’s likely the BI Cloud Service will take this concept further and introduce a more formalised way of packaging-up BI mini-applications and deploying them quickly to the cloud, and also maybe introduce the concept of a BI App Store or Marketplace where pre-built analytic solutions can be selected and deployed faster even than if the user tried to built the same themselves using Excel (or Tableau, even).

Of course the other aspect to Tableau is its data visualisation capabilities, and while OBIEE 11.1.1.7 improved in this area a bit – with trellis charts being introduced and a new visualisation suggestion engine – it’s probably fair to say that OBIEE 11g has dropped behind the industry-state-of-the-art in this area. What’s been interesting to see though, over the past twelve months, is the widespread adoption of technologies such as D3 and other third-part visualisation tools as additional ways to add graphs and other visuals to OBIEE, with Accenture’s Kevin McGinley showcasing the art of the possible on his blog recently (parts 1, 2 and 3) and presenting on this topic at the Atlanta Rittman Mead BI Forum later this week. Techniques such as those described by Kevin involve deploying separate third-party visualisation libraries such as D3 and Flot to the WebLogic server running OBIEE, and then calling those libraries using custom code contained within narrative views; while these aren’t as developer-friendly as built-in visualisation features in the tool, they do give you the ability to go beyond the standard graphs and tables provided by OBIEE 11g, as Tom Underhill from our team explained in a blog post on OBIEE11g and D3 back in 2013.

NewImage

The upcoming 2014 OBIEE11g SampleApp will most probably feature some more third-party and externally-produced visualisations along these lines, including new HTML5 and Javascript integration capabilities for 11.1.1’7’s mapping feature:

NewImage

and an example of integration ADF charts – which have far more options and capabilities that the subset used in OBIEE 11g – into the OBIEE dashboard. All of this is possible with OBIEE 11.1.1.7 and standard Jdeveloper/ADF, with the video previewing the SampleApp PoC Demo going through the integration process at the end.

NewImage

Community Development of OBIEE Best Practices, Techniques, Product Add-Ons

One of the advantages of OBIEE now being a mature and known product is that best practices are starting to emerge around deployment, development, performance optimisation and so-on around the product. For example, our own Stewart Bryson has been putting a lot of thought into agile development and OBIEE, and topics such as automated deployment of OBIEE RPDs using Git and scripting, giving us a more industry-standard way of building and deploying RPDs now that we’ve got the ability to work with repository metadata in a more atomic format. Robin Moffatt, also from Rittman Mead, has published many articles over the past few years on monitoring, measuring and testing OBIEE performance, again giving us a more industry-standard way of regression testing OBIEE reports and monitoring the overall OBIEE experience using open-source tools.

There’s even a third-party add-on industry for OBIEE, with Christian Screen’s / Art of BI’s “BI Teamwork” being the showcase example; OBIEE still doesn’t have any collaboration or social features included in the base product, unless you count wider integration with WebCenter as the answer for this, and Christian’s BI Teamwork product fills this gap by integrating collaboration, social and SaaS integration features into the core product including localisation into key overseas OBIEE markets.

NewImage

Hadoop and Big Data Integration

You’ll probably have guessed from the amount of coverage we’ve given the topic on the blog over the past few months, but we think Hadoop and big data, and particularly the technologies that will spin-off from this movement, are quite a big deal and will revolutionise what we think-of as analytics and BI over the next few years. Most of this activity has taken place outside the core world of OBIEE using tools such as Cloudera Impala, R and Tableau as the default visualisation tool, but OBIEE will play a role too, primarily through its ability to incorporate big data insights and visualisations into the core enterprise semantic model and corporate dashboards.

What this means in-practice is that OBIEE needs to be able to connect to Hadoop data sources such as Hive and Impala, and also provide a means to incorporate, visualise and explore data from non-traditional sources such as NoSQL and document databases. OBIEE 11.1.1.7 made a first step in this direction with its ability to use Apache Hive as a datasource, but this really is a minimal step-one in support for big data sources, as Hive is generally considered too-slow for ad-hoc query use and the HiveServer1 ODBC driver OBIEE 11.1.1.7 ships with no longer being compatible with recent Cloudera Hadoop (CDH 4.5+) releases. What’s really needed is support for Impala – an in-memory version of Hive – as a datasource, something we hacked-together with a workaround but most probably coming as a supported data source in a future version of OBIEE. What would be very interesting though is support for document-style databases such as MongoDB, giving OBIEE (or most probably, Endeca) the capability to create 360 degree-views of customer activity, including unstructured data held in these NoSQL-style databases.

Exalytics and Engineered Systems

I’d almost forgotten Exalytics from this round-up, which is ironic given its prominence in Oracle BI product marketing over the past couple of years, but not all that surprising given the lack of real innovation around the product recently. There’s certainly been a number of Exalytics updates in terms of product certification – the graphic below shows the software evolution of Exalytics since launch, going up to autumn last year when we presented on it at Enkitec E4:

NewImage

whilst the Exalytics hardware over the same period has seen RAM double, and SSD disk added to improve TimesTen and Essbase startup-times.

NewImage

What Exalytics has lacked, though, is something game-changing that’s only available as part of this platform. There’s a central dilemma for Oracle over Exalytics; do they develop something for OBIEE that only works on OBIEE, that’s substantial and that they hold-back from the on-premise version, or do they largely release the same software for both Exalytics, and non-Exalytics OBIEE and rely on performance tweaks which are hard to quantify for customers, and are hard for Oracle salespeople to use as differentiation for the product. So far they’ve gone for the latter option, making Exalytics – if we’re honest – a bit underwhelming for the moment, but what would be really interesting is some capability that clearly can only be supported on Exalytics – some form of in-memory analysis or processing that needs 1TB+ of RAM for enterprise datasets, possibly based on an as-yet unreleased new analytic engine, maybe based on Essbase or Oracle R technology, maybe even incorporating something from Endeca (or even – left-field – something based on Apache Spark?)

My money however is on this differentiation growing over time, and Exalytics being used extensively by Oracle to power their BI in the Cloud Service, with less emphasis over time on on-premise sales of the products and more on “powered by Exalytics” cloud services. All of that said, my line with customers when talking about Exalytics has always been – you’re spending X million $/£ on OBIEE and the BI Apps, you might as well run it on the hardware its designed for, and which in the scheme of things is only a small proportion of the overall cost; the performance difference might not be noticeable now, but over time OBIEE will be more-and-more optimised for this platform, so you might as well be on it now and also take advantage of the manageability / TCO benefits.

So anyway, that’s my “state-of-the-nation” for OBIEE as I see it today – and if you’re coming along to the Atlanta RM BI Forum event later this week, there’ll be futures stuff from Oracle that we can’t really discuss on here, beyond the 3-6 month timeline, that’ll give you a greater insight into what’s coming in late 2014 and beyond.

Categories: BI & Warehousing

RM BI Forum 2014 Brighton is a Wrap – Now on to Atlanta!

Rittman Mead Consulting - Sun, 2014-05-11 17:34

I’m writing this sitting in my hotel room in Atlanta, having flown over from the UK on Saturday following the end of the Rittman Mead BI Forum 2014 in Brighton. I think it’s probably true to say that this year was our best ever – an excellent masterclass on the Wednesday followed by even-more excellent sessions over the two main days, and now we’re doing it all again this week at the Renaissance Atlanta Midtown Hotel in Atlanta, GA.

Wednesday’s guest masterclass was by Cloudera’s Lars George, and covered the worlds of Hadoop, NoSQL and big data analytics over a frantic six-hour session. Lars was a trooper; despite a mistake over the agenda where I’d listed his sessions as being just an hour each when he’d planned (and been told by me) that they were an hour-and-a-half each, he managed to cover all of  the main topics and take the audience through Hadoop basics, data loading and processing, NoSQL and analytics using Hive, Impala, Pig and Spark. Roughly half the audience had some experience with Hadoop with the others just being vaguely acquainted with it, but Lars was an engaging speaker and stuck around for the rest of the day to answer any follow-up questions.

NewImage

For me, the most valuable parts to the session were Lars’ real-world experiences in setting up Hadoop clusters, and his views on what approaches were best to analyse data in a BI and ETL context – with Spark clearly being in-favour now compared to Pig and basic MapReduce. Thanks again Lars, and to Justin Kestelyn from Cloudera for organsising it, and I’ll get a second-chance to sit through it again at the event in Atlanta this week.

The event itself proper kicked-off in the early evening with a drinks reception in the Seattle bar, followed by the Oracle keynote and then dinner. Whilst the BI Forum is primarily a community (developer and customer)-driven event, we’re very pleased to have Oracle also take part, and we traditionally give the opening keynote over to Oracle BI Product Management to take us through the latest product roadmap. This year, Matt Bedin from Oracle came over from the States to deliverer the Brighton keynote, and whilst the contents aren’t under NDA there’s an understanding we don’t blog and tweet the contents in too much detail, which then gives Oracle a bit more leeway to talk about futures and be candid about where their direction is (much like other user group events such as BIWA and ODTUG).

NewImage

I think it’s safe to say that the current focus for OBIEE over the next few months is the new BI in the Cloud Service (see my presentation from Collaborate’14 for more details on what this contains), but we were also given a preview of upcoming functionality for OBIEE around data visualisation, self-service and mobile – watch this space, as they say. Thanks again to Matt Bedin for coming over from the States to delver the keynote, and for his other session later in the week where he demo’d BI in the Cloud and several usage scenarios.

We were also really pleased to be joined by some some of the top OBIEE, Endeca and ODI developers around the US and Europe, including Michael Rainey (Rittman Mead) and Nick Hurt (IFPI), Truls Bergensen, Emiel van Bockel (CB), Robin Moffatt (Rittman Mead), Andrew Bond (Oracle) and Stewart Bryson (Rittman Mead), and none-other than Christian Berg, an independent OBIEE / Essbase developer who’s well-known to the community through his blog and through his Twitter handle, @Nephentur – we’ll have all the slides from the sessions up on the blog once the US event is over, and congratulations to Robin for winning the “Best Speaker” award for Brighton for his presentation “No Silver Bullets: OBIEE Performance in the Real World”.

NewImage

We had a few special overseas guests in Brighton too; Christian Screen from Art of BI Software came across (he’ll be in Atlanta too this week, presenting this time), and we were also joined by Oracle’s Reiner Zimmerman, who some of you from the database/DW-side will known from the Oracle DW Global Leaders’ Program. For me though one of the highlights was the joint session with Oracle’s Andrew Bond and our own Stewart Bryson, where they presented an update to the Oracle Information Management Reference Architecture, something we’ve been developing jointly with Andrew’s team and which now incorporates some of our thoughts around the agile deployment of this type of architecture. More on this on the blog shortly, and look out for the white paper and videos Andrew’s team are producing which should be out on OTN soon.

NewImage

So that’s it for Brighton this year – and now we’re doing it all again in Atlanta this week at the Renaissance Atlanta Midtown Hotel. We’ve got Lars George again delivering his masterclass, and an excellent – dare I say it, even better than Brighton’s – array of sessions including ones on Endeca, the In-Memory Option for the Oracle Database, TimesTen, OBIEE, BI Apps and Essbase. There’s still a few places left so if you’re interested in coming, you can book here and we’ll see you in Atlanta later this week!

 

Categories: BI & Warehousing

Adding Geocoding Capabilities to Pig through Custom UDFs

Rittman Mead Consulting - Sun, 2014-05-04 15:28

In the previous two posts in this series, I’ve used Hive and Pig to process and analyse data from log files generated by the Rittman Mead Blog webserver. In the article on Hive, I created a relational table structure over a directory containing the log files, using a regular expression SerDe to split each log line up into its constituent parts (IP address, page requested, status code and so on). I then brought in another table of data, containing IP address ranges and countries they were assigned to, so that I could determine what parts of the world were accessing our site over time.

In the second example, I took the same log files but processed them this time using Apache Pig. I used Pig’s dataflow-style approach to loading and analysing data to progressively filter, pivot and analyse the log file dataset, and then joined it to an export of pages and authors from WordPress, the CMS that we used to run the website, so I could determine who wrote the most popular blog articles over the period covered by the logs.

But the first example, where I joined the log file data to the geocoding table, had a bit of an issue that only came-up when I tested it with a larger set of data than I used at the end of that article. In the article example, I limited the amount of log file rows to just five, at the time to keep the example readable on the blog, but when I tried it later on with the full dataset, the query eventually failed with an out-of-memory error from the Hadoop cluster. Now in practice, I could probably have increased the memory (java heap space) or otherwise got the query through, but geo-tagging my data in this way – as a big table join, and using an in-memory database engine (Impala) to do it – probably isn’t the most sensible way to do a single value lookup as part of a Hadoop transformation – instead,  this is probably something better done through what’s called a “user-defined function”.

Both Hive and Pig support used-defined functions (UDFs), and a quick Google search brought up one for Hive called GeocodeIP, on Github, that looks like it might do the job. Sticking with Pig for the moment though, we thought this might be a good opportunity to see how UDFs for Pig are created, and so my colleague, Nelio Guimaraes, put together the following example to walk through how a typical one might be created. Before start though, a bit of background.

The problem we have is that we need to match an IP address in a webserver log file with an IP address range in a lookup table. For example, the IP address in the lookup table might be 124.232.100.105, any the lookup database would have an IP address range from, say, 124.232.100.0 to 124.232.100.255 which allocates to a particular country – Poland, for example. The lookup database itself comes from Maxmind, and there’s a formula they use to convert IP addresses to integers, like this:

NewImage

so that you can do a simple BETWEEN in an SQL join to locate the range that matches the incoming IP address.

NewImage

Except Pig, like Hive, can’t normally support non-equijoins, which leads us to UDFs and other approaches to getting the country for our IP address. Pig, again like Hive, is however extensible and its relatively easy to add Pig UDFs either yourself, or through UDF libraries like Pig’s Piggybank. The best language to write UDFs in is Java as it gives access to the largest amount of Pig native functionality (such as the ability to write custom data loaders and unloaders), but for what we’re doing Python will be fine, so let’s put one together in Python to do our IP address Geocoding and show how the process works.

Going back to the Pig scripts we put together yesterday, they started-off by declaring a relation that loaded the raw log files in from a directory in HDFS, and then used another relation to parse the first one via a regular expression, so we had each of the log file elements in its own column, like this:

raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
 GENERATE FLATTEN
 (
 REGEX_EXTRACT_ALL
 (
 line,
 '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
 )
 )
 AS
 (
 remoteAddr: chararray, remoteLogname: chararray, user: chararray,
 time: chararray, request: chararray, status: int, bytes_string: chararray,
 eferrer: chararray, browser: chararray
 );

At this point we can run-off a list of the top 5 browser type based on page access, for example:

grunt> by_browser_group = GROUP logs_base BY browser;                              
grunt> by_browser_group_count = FOREACH by_browser_group  
>> GENERATE group, COUNT($1);
grunt> by_browser_group_sorted = ORDER by_browser_group_count BY $1 DESC;
grunt> by_browser_group_top_5 = LIMIT by_browser_group_sorted 5;
grunt> dump by_browser_group_top_5
...
(Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36,675161)
(-,394101)
(Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com),371078)
(Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0,201861)
(Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0,167851)

See the previous post, and this other blog article, for more background on how you do grouping and aggregating in Pig, if you’re not familiar with the syntax.

But what if we want to group by country? That’s where the geocoding comes in, and the Pig UDF we’re going to create. As we’re going to create this example using Python, we’ll be using the pygeoip Python API that you install through the pip, the Python package manager, and the GeoLite Country database (.dat) file from Maxmind, who make this basic version available for download free, then follow these steps to set the Python UDF up:

1. On the master node on your Hadoop cluster (i’m using a three-node CDH4.6 cluster) where Pig and Hive run, install pip, and then download the pygeoip API, like this:

wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py 
pip install pygeoip

2. Copy the GeoIP.dat file to somewhere on the Hadoop master node, for example /home/nelio/. Make a note of the full path to the GeoIP.dat file, and then copy the file to the same location on all of the worker nodes – there’s probably a way to cache this or otherwise automatically distribute the file (suggestions welcome), but for now this will ensure that each worker node can get access to a copy of this file.

3. Using a text editor, create the python script that will provide the Python UDF, like this, substituting the path to your GeoIP.dat file if it’s somewhere else. Once done, save the file as python_geoip.py to the same directory (/home/nelio, in this example) – note that this file only needs to go on the main, master node in your cluster, and Pig/MapReduce will distribute it to the worker nodes when we register it in our Pig script, later on.

#!/usr/bin/python

import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pygeoip

@outputSchema("country:chararray")
def getCountry(ip):
    gi = pygeoip.GeoIP('/home/nelio/GeoIP.dat')
    country = gi.country_name_by_addr(ip) 
    return country

Note that the sys.path.append line in the script is so that Jython knows to look in the place were the new python module, pygeoip, when starting up.

4. Let’s start another Pig session now and try and use this UDF. I exit back to the OS command prompt, and change directory to where I’ve stored the python file and the GeoIP.dat file, and start another Grunt shell session:

[root@cdh4-node1 ~]# cd /home/nelio
[root@cdh4-node1 nelio]# pig

Now if I want to use this Python Pig UDF in a Grunt shell session, I need to register it as an scripting UDF either at the Grunt shell or in a Pig script I run, so let’s start with that, and then bring in the log data as before:

grunt> register 'python_geoip.py' using jython as pythonGeoIP;
...
grunt> raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray);
grunt> logs_base = FOREACH raw_logs
>>  GENERATE FLATTEN
>>         (
>>                 REGEX_EXTRACT_ALL
>>                 (
>>                         line,
>>                         '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
>>                 )
>>         )
>>         AS
>>         (
>>                 remoteAddr: chararray, remoteLogname: chararray, user: chararray,
>>                 time: chararray, request: chararray, status: int, bytes_string: chararray,
>>                 eferrer: chararray, browser: chararray
>>         );

I’ll now define a new relation (alias) projecting just the IP addresses from the combined log files, and then filter that into another alias so we only have valid IP addresses to work with:

grunt> ipaddress =  FOREACH logs_base GENERATE remoteAddr;
grunt> clean_ip  = FILTER ipaddress BY (remoteAddr matches '^.*?((?:\\d{1,3}\\.){3}\\d{1,3}).*?$');

Now we get to use the Python UDF. We’ll pass to it these IP addressees, with the UDF then returning the country that the IP address is located in, based on Maxmind’s IP address ranges.

grunt> country_by_ip = FOREACH clean_ip GENERATE pythonGeoIP.getCountry(remoteAddr);
2014-05-05 05:22:50,141 [MainThread] INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'country:chararray' defined for func getCountry
grunt> describe country_by_ip
country_by_ip: {country: chararray}

So this function will have converted all of the IP addresses in the logs to country names; let’s now group, count, order and select the top five from that list.

grunt> group_by_country = GROUP country_by_ip BY country;
grunt> count_by_country = FOREACH group_by_country GENERATE FLATTEN(group) as country, COUNT(country_by_ip) AS (hits_per_country:long);
grunt> order_by_access = ORDER count_by_country BY hits_per_country DESC;
grunt> top5_country = LIMIT order_by_access 5;

Of course all this has done so far is set-up a data-flow in Pig, telling it how to move the data through the pipeline and arrive at the final output I’m interested in; let’s now run the process by using the “dump” command:

grunt> dump top5_country
...
(United States,2012311)
(India,785585)
(United Kingdom,459422)
(Germany,231386)
(France,168319)

So that’s a simple example of a Pig UDF, in this instance written in Python. There’s other ways to extend Pig beyond UDFs – Pig Streaming is the obvious alternative, where the entire relation goes through the streaming interface to be processed and then output back into Pig, and hopefully we’ll cover this at some point in the future – or then again, maybe it’s now time to take a proper look at Spark.

Categories: BI & Warehousing