Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 10 hours 29 min ago

OBIEE Enterprise Security

Fri, 2014-12-19 05:35

The Rittman Mead Global Services team have recently been involved in a number of security architecture implementations and produced a security model which meets a diverse set of requirements.  Using our experience and standards we have been able to deliver a robust model that addresses the common questions we routinely receive around security, such as :

“Whats considerations do I need to make when exposing Oracle BI to the outside world?”

or

“How can I make a flexible security model which is robust enough to meet the demands of my organisation but easy to maintain?”

The first question is based on a standard enterprise security model where the Oracle BI server is exposed by a web host, enabling SSL and tightening up access security.  This request can be complex to achieve but is something that we have implemented many times now.

The second question is much harder to answer, but our experience has led us to develop a multi-dimensional inheritance security model, with numerous clients that has yielded excellent results.

What is a Multi-dimensional Inheritance Security Model?

The wordy title is actually a simple concept that incorporates 5 key areas:

  • Easy to setup and maintain
  • Flexible
  • Durable
  • Expandable
  • Be consistent throughout the product

While there numerous ways of implementing a security model in Oracle BI, by sticking to the key concepts above, we ensure we get it right.  The largest challenge we face in BI is the different types of security required, and all three need to work in harmony:

  • Application security
  • Content security
  • Data security
Understanding the organisation makeup

The first approach is to consider the makeup of a common organisation and build our security around it.

1

This diagram shows different Departments (Finance, Marketing, Sales) whose data is specific to them, so normally the departmental users should only see their own data that is relevant to them.  In contrast the IT department who are developing the system need visibility across all data and so do the Directors.

 

What types of users do I have?

Next is to consider the types of users we have:

  1. BI Consumer: This will be the most basic and common user who needs to access the system for information.
  2. BI Analyst: As an Analyst the user will be expected to generate more bespoke queries and need ways to represent them. They will also need an area to save these reports.
  3. BI Author: The BI Author will be able to create content and publish that content for the BI Consumers and BI Analysts.
  4. BI Department Admin: The BI Department Admin will be responsible for permissions for their department as well as act as a focal point user.
  5. BI Developer: The BI Developer can be thought of as the person(s) who creates models in the RPD and will need additional access to the system for testing of their models. They might also be responsible for delivering Answers Requests or Dashboards in order to ‘Prove’ the model they created.
  6. BI Administrator:  The Administrator will be responsible for the running of the BI system and will have access to every role.  Most Administrator Task will not require Skills in SQL/Data Warehouse and is generally separated from the BI Developer role.

The types of users here are a combination of every requirement we have seen and might not be required by every client.  The order they are in shows the implied inheritance, so the BI Analyst inherits permissions and privileges from the BI Consumer and so on.

What Types do I need?

Depending on the size of organization determines what types of user groups are required. By default Oracle ships with:

  1. BI Consumer
  2. BI Author
  3. BI Administrator

Typically we would recommend inserting the BI Analyst into the default groups:

  1. BI Consumer
  2. BI Analyst
  3. BI Author
  4. BI Administrator

This works well when there is a central BI team who develop content for the whole organization. The structure would look like this:

2

 

For larger organizations where dashboard development and permissions is handled across multiple BI teams then the BI Administrator group can be used.  Typically we see the BI team as a central Data Warehouse team who deliver the BI model (RPD) to the multiple BI teams.  In a large Organization the administration of Oracle BI should be handled by someone who isn’t the BI Developer, the structure could look like:

3

 

 

Permissions on groups

Each of the groups will require different permissions, at a high level the permissions would be:

 

Name Permissions BI Consumer
  • View Dashboards
  • Save User Selections
  • Subscribe to Ibots
BI Analyst
  • Access to Answers and standard set of views
  • Some form of storage
  • Access to Subject areas
BI Author
  • Access to Create/Modify Dashboards
  • Save Predefined Sections
  • Access to Action Links
  • Access to Dashboard Prompts
  • Access to BI Publisher
BI Department Admin
  • Ability to apply permissions and manage the Web Catalog
BI Developer
  • Advance access to answers
  • Access to all departments
BI Administrator
  • Everything

 

Understanding the basic security mechanics in 10g and 11g

In Oracle BI 10g the majority of the security is handled in the Oracle BI Server.  This would normally be done through initialisation blocks, which would authenticate the user from a LDAP server, then run a query against a database tables to populate the user into ‘Groups’ used in the RPD and ‘Web Groups’ used in the presentation server.  These groups would have to match in each level; Database, Oracle BI Server and Oracle BI Presentation Server.

With the addition of Enterprise Manager and Weblogic the security elements in Oracle BI 11g radically changed.  Authenticating the user is in the Oracle BI server is no longer the recommended way and is limited in Linux. While the RPD Groups and Presentation Server Web Groups still exist they don’t need to be used.  Users are now authenticated against Weblogic.  This can be done by using Weblogic’s own users and groups or by plugging it into a choice of LDAP servers.  The end result will be Groups and Users that exist in Weblogic.  The groups then need to be mapped to Application Roles in Enterprise Manager, which can be seen by the Oracle BI Presentation Services and Oracle BI Server.  It is recommended to create a one to one mapping for each group.

4

 

What does all this look like then?

Assuming this is for an SME size organization where the Dashboard development (BI Author) is done by the central BI team the groups would like:

 

5

 

The key points are:

  • The generic BI Consumer/Analyst groups give their permissions to the department versions
  • No users should be in the generic BI Consumer/Analyst groups
  • Only users from the BI team should be in the generic BI Author/Administrator group
  • New departments can be easily added
  • the lines denote the inheritance of permissions and privileges

 

Whats next – The Web Catalog?

The setup of the web catalog is very important to ensure that it does not get unwieldy, so it needs to reflect the security model and we would recommend setting up some base folders which look like:

6

 

Each department has their own folder and 4 sub folders. The permissions applied to each department’s root folder is BI Administrators so full control is possible across the top.  This is also true for every folder below however they will have additional explicit permissions described to ensure that the department cannot create any more than the four sub folders.

  • The Dashboard folder is where the dashboards go and the departments BI Developers group will have Full control and the departments BI consumer will have read . This will allow the departments BI Developers to create dashboards,  the departments BI Administrators to apply permissions and the departments consumers and analysts the ability to view.
  • The same permissions are applied to the Dashboard Answers folder to the same effect.
  • The Development Answers folder has Full control given to the departments BI Developers and no access to for the departments BI Analysts or BI Consumers. This folder is mainly for the departments BI Developers to store answers when in the process of development.
  • The Analyst folder is where the departments BI Analysts can save Answers. Therefore they will need full control of this folder.

I hope this article gives some insight into Security with Oracle BI.  Remember that our Global Services products offer a flexible support model where you can harness our knowledge to deliver your projects in a cost effective manner.

Categories: BI & Warehousing

OBIEE and ODI on Hadoop : Next-Generation Initiatives To Improve Hive Performance

Thu, 2014-12-18 16:26

The other week I posted a three-part series (part 1, part 2 and part 3) on going beyond MapReduce for Hadoop-based ETL, where I looked at a typical Apache Pig dataflow-style ETL process and showed how Apache Tez and Apache Spark can potentially make these processes run faster and make better use of in-memory processing. I picked Pig as a data processing environment as the multi-step data transformations creates translate into lots of separate MapReduce jobs in traditional Hadoop ETL environments, but run as a single DAG (directed acyclic graph) under Tez and Spark and can potentially use memory to pass intermediate results between steps, rather than writing all those intermediate datasets to disk.

But tools such as OBIEE and ODI use Apache Hive to interface with the Hadoop world, not Pig, so its improvements to Hive that will have the biggest immediate impact on the tools we use today. And what’s interesting is the developments and work thats going on around Hive in this area, with four different “next-generation Hive” initiatives going on that could end-up making OBIEE and ODI on Hadoop run faster:

  • Hive-on-Tez (or “Stinger”), principally championed by Hortonworks, along with Stinger.next which will enable ACID transactions in HiveQL
  • Hive-on-Spark, a more limited port of Hive to run on Spark and backed by Cloudera amongst others
  • Spark SQL within Apache Spark, which enables SQL queries against Spark RDDs (and Hive tables), and exposes a HiveServer2-compatible Thrift Server for JDBC access
  • Vendor initiatives that build on Hive but are mainly around integration with their RDBMS engines, for example Oracle Big Data SQL

Vendor initiatives like Oracle’s Big Data SQL and Cloudera Impala have the benefit of working now (and are supported), but usually come with some sort of penalty for not working directly within the Hive framework. Oracle’s Big Data SQL, for example, can read data from Hive (very efficiently, using Exadata SmartScan-type technology) but then can’t write-back to Hive, and currently pulls all the Hive data into Oracle if you try and join Oracle and Hive data together. Cloudera’s Impala, on the other hand, is lightening-fast and works directly on the Hadoop platform, but doesn’t support the same ecosystem of SerDes and storage handlers that Hive supports, taking away one of the key flexibility benefits of working with Hive. 

So what about the attempts to extend and improve Hive, or include Hive interfaces and compatibility in Spark? In most cases an ETL routine written as a series of Hive statements isn’t going to be as fast or resource-efficient as a custom Spark program, but if we can make Hive run faster or have a Spark application masquerade as a Hive database, we could effectively give OBIEE and ODI on Hadoop a “free” platform performance upgrade without having to change the way they access Hadoop data. So what are these initiatives about, and how usable are they now with OBIEE and ODI?

Probably the most ambitious next-generation Hive project is the Stinger initiative. Backed by Hortonworks and based on the Apache Tez framework that runs on Hadoop 2.0 and YARN. Stinger aimed first to port Hive to run on Tez (which runs MapReduce jobs but enables them to potentially run as a single DAG), and then add ACID transaction capabilities so that you can UPDATE and DELETE from a Hive table as well as INSERT and SELECT, using a transaction model that allows you to roll-back uncommitted changes (diagram from the Hortonworks Stinger.next page)

NewImage

Tez is more of a set of developer APIs rather than the full data discovery / data analysis platform that Spark aims to provide, but it’s a technology that’s available now as part of Hortonworks HDP2.2 platform and as I showed in the blog post a few days ago, an existing Pig script that you run as-is on a Tez environment typically runs twice as fast as when its using MapReduce to move data around (with usual testing caveats applying, YMMV etc). Hive should be the same as well, giving us the ability to take Hive transformation scripts and run them unchanged except for specifying Tez at the start as the execution engine.

NewImage

Hive on Tez is probably the first of these initiatives we’ll see working with ODI and OBIEE, as ODI has just been certified for Hortonworks HDP2.1, and the new HDP2.2 release is the one that comes with Tez as an option for Pig and Hive query execution. I’m guessing ODI will need to have its Hive KMs updated to add a new option to select Tez or MapReduce as the underlying Hive execution engine, but otherwise I can see this working “out of the box” once ODI support for HDP2.2 is announced.

Going back to the last of the three blog posts I wrote on going beyond MapReduce, many in the Hadoop industry back Spark as the successor to MapReduce rather than Tez as its a more mature implementation that goes beyond the developer-level APIs that Tez aims to provide to make Pig and Hive scripts run faster. As we’ll see in a moment Spark comes with its own SQL capabilities and a Hive-compatible JDBC interface, but the other “swap-out-the-execution-engine” initiative to improve Hive is Hive on Spark, a port of Hive that allows Spark to be used as Hive’s execution engine instead of just MapReduce.

Hive on Spark is at an earlier stage in development than Hive on Tez with the first demo being given at the recent Strata + Hadoop World New York, and specific builds of Spark and versions of Hive needed to get it running. Interestingly though, a post went on the Cloudera Blog a couple of days ago announcing an Amazon AWS AMI machine image that you could use to test Hive on Spark, which though it doesn’t come with a full CDH or HDP installation or features such as a HiveServer JDBC interface, comes with a small TPC-DS dataset and some sample queries that we can use to get a feeling for how it works. I used the AMI image to create an Amazon AWS m3.large instance and gave it a go.

By default, Hive in this demo environment is configured to use Spark as the underlying execution engine. Running a couple of the TPC-DS queries first using this Spark engine, and then switching back to MapReduce by running the command “set hive.execution.engine=mr” within the Hive CLI, I generally found queries using Spark as the execution engine ran 2-3x faster than the MapReduce ones.

NewImage

You can’t read too much into this timing as the demo AMI is really only to show off the functional features (Hive using Spark as the execution engine) and no work on performance optimisation has been done, but it’s encouraging even at this point that it’s significantly faster than the MapReduce version.

Long-term the objective is to have both Tez and Spark available as options as execution engines under Hive, along with MapReduce, as the diagram below from a presentation by Cloudera’s Szenon Ho shows; the advantage of building on Hive like this rather than creating your own new SQL-on-Hadoop engine is that you can make use of the library of SerDes, storage handlers and so on that you’d otherwise need to recreate for any new tool.

NewImage

The third major SQL-on-Hadoop initiative I’ve been looking at is Spark SQL within Apache Spark. Unlike Hive on Spark which aims to swap-out the compiler and execution engine parts of Hive but otherwise leave the rest of the product unchanged, Apache Spark as a whole is a much more freeform, flexible data query and analysis environment that’s aimed more at analysts that business users looking to query their dataset using SQL. That said, Spark has some cool SQL and Hive integration features that make it an interesting platform for doing data analysis and ETL.

In my Spark ETL example the other day, I loaded log data and some reference data into RDDs and then filtered and transformed them using a mix of Scala functions and Spark SQL queries. Running on top of the set of core Spark APIs, Spark SQL allows you to register temporary tables within Spark that map onto RDDs, and give you the option of querying your data using either familiar SQL relational operators, or the more functional programming-style Scala language

NewImage

You can also create connections to the Hive metastore though, and create Hive tables within your Spark application for when you want to persist results to a table rather than work with the temporary tables that Spark SQL usually creates against RDDs. In the code example below, I create a HiveContext as opposed to the sqlContext that I used in the example on the previous blog, and then use that to create a table in my Hive database, running on a Hortonworks HDP2.1 VM with Spark 1.0.0 pre-built for Hadoop 2.4.0:

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveContext.hql("CREATE TABLE posts_hive (post_id int, title string, postdate string, post_type string, author string, post_name string, generated_url string) row format delimited fields terminated by '|' stored as textfile")
scala> hiveContext.hql("LOAD DATA INPATH '/user/root/posts.psv' INTO TABLE posts_hive")

If I then go into the Hive CLI, I can see this new table listed there alongside the other ones:

hive> show tables;
OK
dummy
posts
posts2
posts_hive
sample_07
sample_08
src
testtable2
Time taken: 0.536 seconds, Fetched: 8 row(s)

What’s even more interesting is that Spark also comes with a HiveServer2-compatible Thrift Server, making it possible for tools such as ODI that connect to Hive via JDBC to run Hive queries through Spark, with the Hive metastore providing the metadata but Spark running as the execution engine.

NewImage

This is subtly different to Hive-on-Spark as Hive’s metastore, support for SerDes and storage handlers runs under the covers but Spark provides you with a full programmatic environment, making it possible to just expose Hive tables through the Spark layer, or mix and match data from RDDs, Hive tables and other sources before storing and then exposing the results through the Hive SQL interface. For example then, you could use Oracle SQL*Developer 4.1 with the Cloudera Hive JDBC drivers to connect to this Spark SQL Thrift Server and query the tables just like any other Hive source, but crucially the Hive execution is being done by Spark, rather than MapReduce as would normally happen.

NewImage

Like Hive-on-Spark, Spark SQL and Hive support within Spark SQL are at early stages, with Spark SQL not yet being supported by Cloudera whereas the core Spark API is. From the work I’ve done with it, it’s not yet possible to expose Spark SQL temporary tables through the HiveServer2 Thrift Server interface, and I can’t see a way of creating Hive tables out of RDDs unless you stage the RDD data to a file in-between. But it’s clearly a promising technology and if it becomes possible to seamlessly combine RDD data and Hive data, and expose Spark RDDs registered as tables through the HiveServer2 JDBC interface it could make Spark a very compelling platform for BI and data analyst-type applications. Oracle’s David Allen, for example, blogged about using Spark and the Spark SQL Thrift Server interface to connect ODI to Hive through Spark, and I’d imagine it’d be possible to use the Cloudera HiveServer2 ODBC drivers along with the Windows version of OBIEE 11.1.1.7 to connect to Spark in this way too – if I get some spare time over the Christmas break I’ll try and get an example working.

Categories: BI & Warehousing

Rittman Mead BI Forum 2015 Call for Papers Now Open!

Wed, 2014-12-17 03:00

I’m very pleased to announce that the Call for Papers for the Rittman Mead BI Forum 2015 is now open, with abstract submissions open to January 18th 2015. As in previous years the BI Forum will run over consecutive weeks in Brighton, UK and Atlanta, GA, with the provisional dates and venues as below:

  • Brighton, UK : Hotel Seattle, Brighton, UK : May 6th – 8th 2015
  • Atlanta, GA : Renaissance Atlanta Midtown Hotel, Atlanta, USA : May 13th-15th 2015

Now on it’s seventh year, the Rittman Mead BI Forum is the only conference dedicated entirely to Oracle Business Intelligence, Oracle Business Analytics and the technologies and processes that support it – data warehousing, data analysis, data visualisation, big data and OLAP analysis. We’re looking for session around tips & techniques, project case-studies and success stories, and sessions where you’ve taken Oracle’s BI products and used them in new and innovative ways. Each year we select around eight-to-ten speakers for each event along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Charles Elliott and Jordan Meyer.

NewImage

Last year we had a big focus on cloud, and a masterclass and several sessions on bringing Hadoop and big data to the world of OBIEE. This year we’re interested in project stories and experiences around cloud and Hadoop, and we’re keen to hear about any Oracle BI Apps 11g implementations or migrations from the earlier 7.9.x releases. Getting back to basics we’re always interested in sessions around OBIEE, Essbase and data warehouse data modelling, and we’d particularly like to encourage session abstracts on data visualization, BI project methodologies and the incorporation of unstructured, semi-structured and external (public) data sources into your BI dashboards. For an idea of the types of presentations that have been selected in the past, check out the BI Forum 2014, 2013 and 2012 homepages, or feel free to get in touch via email at mark.rittman@rittmanmead.com

The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues if you can speak at both. All session this year will be 45 minutes long, all we’ll be publishing submissions and inviting potential attendees to vote on their favourite sessions towards the end of January. Other than that – have a think about abstract ideas now, and make sure you get them in by January 18th 2015.

Categories: BI & Warehousing

Rittman Mead’s Bar Optimizer

Tue, 2014-12-16 09:08

bar-optimizer-600px

At Rittman Mead R&D, we have the privilege of solving some of our clients’ most challenging data problems. We recently built a set of customized data products that leverage the power of Oracle and Cloudera platforms and wanted to share some of the fun we’ve had in creating unique user experiences. We’ve been thinking about how we can lean on our efforts to help make the holidays even more special for the extended Rittman Mead family. With that inspiration, we had several questions on our minds:

  • How can we throw an amazing holiday party?
  • What gifts can we give that we can be sure our coworkers, friends, and family will enjoy?
  • What gifts would we want for ourselves?

After a discussion over drinks, the answers became clear. We decided to create a tool that uses data analytics to help you create exceptional cocktails for the holidays.

Here is how we did it. First, we analyzed the cocktail recipes of three world-renowned cocktail bars: PDT, Employees Only, and Death & Co. We then turned their drink recipes into data and got to work on the Bar Optimizer, which uses analytics on top of that data to help you make the holiday season tastier than ever before.

To use the Bar Optimizer, enter the liquors and other ingredients that you have on hand to see what drinks you can make. It then recommends additional ingredients that let you create the largest variety of new drinks. You can also use this feature to give great gifts based on others’ liquor cabinets. Finally, try using one of our optimized starter kits to stock your bar for a big holiday party. We’ve crunched the numbers to find the fewest bottles that can make the largest variety of cocktails.

Click the annotated screenshot above for details, and contact us if you would like more information about how we build products that take your data beyond dashboards.

Categories: BI & Warehousing

Linux cluster sysadmin — OS metric monitoring with colmux

Mon, 2014-12-15 12:44

In this mini-series of blog posts I’m taking a look at a few very useful tools that can make your life as the sysadmin of a cluster of Linux machines. This may be a Hadoop cluster, or just a plain simple set of ‘normal’ machines on which you want to run the same commands and monitoring.

First we looked at using SSH keys for intra-machine authorisation, which is a pre-requisite executing the same command across multiple machines using PDSH, as well as what we look at in this article – monitoring OS metrics across a cluster with colmux.

Colmux is written by Mark Seger, the same person who wrote collectl. It makes use of collectl on each target machine to report back OS metrics across a cluster to a single node.

colmux03

Install collectl across the cluster

Using pdsh we can easily install collectl on each node (if it’s not already), which is a pre-requisite for colmux:

pdsh -w root@rnmcluster02-node0[1-4] "yum install -y collectl && service collectl start && chkconfig collectl on"

NB by enabling the collectl service on each node it will capture performance data to file locally, which colmux can replay centrally.

Then install colmux itself, which you can download from Sourceforge. It only needs to be actually installed on a single host, but obviously we could push it out across the cluster with pdsh if we wanted to be able to invoke it on any node at will. Note that here I’m running it on a separate linux box (outside of the cluster) rather than on my Mac:

cd /tmp
# Make sure you get the latest version of collectl-utils, from https://sourceforge.net/projects/collectl-utils/files/
# This example is hardcoded to a version and particular sourceforge mirror
curl -O http://garr.dl.sourceforge.net/project/collectl-utils/collectl-utils-4.8.2/collectl-utils-4.8.2.src.tar.gz
tar xf collectl-utils-4.8.2.src.tar.gz
cd collectl-utils-4.8.2
sudo ./INSTALL
# collectl-utils also includes colplot, so if you might want to use it restart
# apache (assuming it's installed)
sudo service httpd restart

Colmux and networking

Couple of important notes:

  1. The machine you run colmux from needs to have port 2655 open in order for each node’s collectl to send back the data to it.

  2. You also may encounter an issue if you have any odd networking (eg NAT on virtual machines) that causes colmux to not work because it picks the ‘wrong’ network interface of the host to tell collectl on each node to send its data to. Details and workaround here.

colmux in action – real-time view, point in time

Command

colmux -addr 'rnmcluster02-node0[1-4]' -username root

Output

# Mon Dec  1 22:20:40 2014  Connected: 4 of 4
#                    <--------CPU--------><----------Disks-----------><----------Network---------->
#Host                cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
rnmcluster02-node01    1   1    28     36      0      0      0      0      0      2      0       2
rnmcluster02-node04    0   0    33     28      0      0     36      8      0      1      0       1
rnmcluster02-node03    0   0    15     17      0      0      0      0      0      1      0       1
rnmcluster02-node02    0   0    18     18      0      0      0      0      0      1      0       1

colmux03

Real-time view, persisted

-cols puts the hosts across the top and time as rows. Specify one or more columns from the output without -cols. In this example it is the values for cpu value, along with the disk read/write (columns 1, 5 and 7 of the metrics as seen above):

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -cols 1,5,7

Output

cpu                            KBRead                         KBWrit
 node01 node02 node03 node04 |  node01 node02 node03 node04 |  node01 node02 node03 node04
      0      0      0      0 |       0      0      0      0 |      12     28      0      0
      0      0      0      0 |       0      0      0      0 |      12     28      0      0
      1      0      1      0 |       0      0      0      0 |       0      0      0      0
      0      0      0      0 |       0      0      0      0 |       0      0      0      0
      0      0      0      0 |       0      0      0      0 |       0      0      0      0
      0      0      0      0 |       0      0      0      0 |       0     20      0      0
      0      0      0      0 |       0      0      0      0 |      52      4      0      0
      0      0      0      2 |       0      0      0      0 |       0      0      0      0
      1      0      0      0 |       0      0      0      0 |       0      0      0      0
     15     16     15     15 |       0      4      4      4 |      20     40     32     48
      0      0      1      1 |       0      0      0      0 |       0      0      4      0
      1      0      0      0 |       0      0      0      0 |       0      0      0      0

colmux04

To check the numbers of the columns that you want to reference, run the command with the --test argument:

colmux -addr 'rnmcluster02-node0[1-4]' -user root --test

>>> Headers <<<
#                    <--------CPU--------><----------Disks-----------><----------Network---------->
#Host                cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut

>>> Column Numbering <<<
 0 #Host   1 cpu     2 sys     3 inter   4 ctxsw   5 KBRead  6 Reads   7 KBWrit
 8 Writes  9 KBIn   10 PktIn  11 KBOut  12 PktOut

And from there you get the numbers of the columns to reference in the -cols argument.

To include the timestamp, use -oT in the -command and offset the column numbers by 1:

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -cols 2,6,8 -command '-oT'

Output

sys                            Reads                          Writes
#Time    node01 node02 node03 node04 |  node01 node02 node03 node04 |  node01 node02 node03 node04
22:24:50      0      0      0      0 |       0      0      0      0 |       0      0      0      0
22:24:51      1      0      0      0 |       0      0      0      0 |       0      0      0      0
22:24:52      0      0      0      0 |       0      0      0      0 |       0     16      0     16
22:24:53      1      0      0      0 |       0      0      0      0 |      36      0     16      0
22:24:54      0      0      0      1 |       0      0      0      0 |       0      0      0      0
22:24:55      0      0      0      0 |       0      0      0      0 |       0     20     32     20

NB There’s a bug with colmux 4.8.2 that prevents you accessing the first metric with -cols when you also enable timestamp -oTdetails here.

Specifying additional metrics

Collectl (which is what colmux calls to get the data) can fetch metrics from multiple subsystems on a node. You can access all of these through colmux too. By default when you run colmux you get cpu, disk and network but you can specify others using the -s argument followed by the subsystem identifier.

To examine the available subsystems run collectl on one of the target nodes:

[root@rnmcluster02-node01 ~]# collectl --showsubsys
The following subsystems can be specified in any combinations with -s or
--subsys in both record and playbackmode.  [default=bcdfijmnstx]

These generate summary, which is the total of ALL data for a particular type
  b - buddy info (memory fragmentation)
  c - cpu
  d - disk
  f - nfs
  i - inodes
  j - interrupts by CPU
  l - lustre
  m - memory
  n - network
  s - sockets
  t - tcp
  x - interconnect (currently supported: OFED/Infiniband)
  y - slabs

From the above list we can see that if we want to also show memory detail alongside CPU we need to include m and c in the subsystem list:

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-scm'

Output

# Tue Dec  2 08:02:38 2014  Connected: 4 of 4
#                    <--------CPU--------><-----------Memory----------->
#Host                cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map
rnmcluster02-node02    1   0    19     18  33M  15M 345M 167M  30M  56M
rnmcluster02-node04    0   0    30     24  32M  15M 345M 167M  30M  56M
rnmcluster02-node03    0   0    30     36  32M  15M 345M 165M  30M  56M
rnmcluster02-node01    0   0    16     16  29M  15M 326M 167M  27M  81M

Changing the sample frequency

To change the sample frequency use the -i syntax in -command:

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-scm -i10 -oT' -cols 2,4

Samples every 10 seconds:

sys                            ctxsw
#Time    node01 node02 node03 node04 |  node01 node02 node03 node04
08:06:29     -1     -1     -1     -1 |      -1     -1     -1     -1
08:06:39     -1     -1     -1     -1 |      -1     -1     -1     -1
08:06:49      0      0      0      0 |      14     13     15     19
08:06:59      0      0      0      0 |      13     13     17     21
08:07:09      0      0      0      0 |      19     18     15     24
08:07:19      0      0      0      0 |      13     13     15     19
08:07:29      0      0      0      0 |      13     13     14     19
08:07:39      0      0      0      0 |      12     13     13     19

Column width

Add the -colwidth argument

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-scm' -cols 1 -colwidth 20

Output

cpu
  rnmcluster02-node01  rnmcluster02-node02  rnmcluster02-node03  rnmcluster02-node04
                   -1                   -1                   -1                   -1
                   -1                   -1                   -1                   -1
                    1                    0                    0                    0
                    0                    0                    0                    0
                    0                    1                    0                    0
                    0                    0                    1                    0
                    1                    0                    1                    0
                    0                    1                    0                    0

Playback

As well as running interactively, collectl can run as a service and record metric samples to disk. Using colmux you can replay these from across the cluster.

Within the -command, include -p and the path to the collectl log files (assumes that it is the same on each host). As with real-time mode, for different subsystems change the flags after -s

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-p /var/log/collectl/*20141201* -scmd -oD'

Output

[...]
# 21:48:50  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:48:50    0   0    17     15  58M  10M 340M 162M  30M  39M      0      0      1      0
rnmcluster02-node03 20141201 21:48:50    0   0    11     13  58M  10M 340M 160M  30M  39M      0      0      0      0
rnmcluster02-node02 20141201 21:48:50    0   0    11     15  58M  10M 340M 163M  29M  39M      0      0      1      0
rnmcluster02-node01 20141201 21:48:50    0   0    12     14  33M  12M 342M 157M  27M  63M      0      0      1      0

# 21:49:00  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:49:00    0   0    17     15  58M  10M 340M 162M  30M  39M      0      0      4      0
rnmcluster02-node03 20141201 21:49:00    0   0    13     14  58M  10M 340M 160M  30M  39M      0      0      5      0
rnmcluster02-node02 20141201 21:49:00    0   0    12     14  58M  10M 340M 163M  29M  39M      0      0      1      0
rnmcluster02-node01 20141201 21:49:00    0   0    12     15  33M  12M 342M 157M  27M  63M      0      0      6      0

# 21:49:10  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:49:10    0   0    23     23  58M  10M 340M 162M  30M  39M      0      0      1      0
rnmcluster02-node03 20141201 21:49:10    0   0    19     24  58M  10M 340M 160M  30M  39M      0      0      2      0
rnmcluster02-node02 20141201 21:49:10    0   0    18     23  58M  10M 340M 163M  29M  39M      0      0      2      1
rnmcluster02-node01 20141201 21:49:10    0   0    18     24  33M  12M 342M 157M  27M  63M      0      0      1      0
[...]

Restrict the time frame by adding to -command the arguments -from and/or -thru

[oracle@rnm-ol6-2 ~]$ colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-p /var/log/collectl/*20141201* -scmd -oD --from 21:40:00 --thru 21:40:10'
# 21:40:00  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:40:00    0   0    16     14  59M  10M 340M 162M  30M  39M      0      0      0      0
rnmcluster02-node03 20141201 21:40:00    0   0    12     14  58M  10M 340M 160M  30M  39M      0      0      8      1
rnmcluster02-node02 20141201 21:40:00    0   0    12     15  59M  10M 340M 162M  30M  39M      0      0      6      1
rnmcluster02-node01 20141201 21:40:00    0   0    13     16  56M  11M 341M 156M  27M  42M      0      0      7      1

# 21:40:10  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:40:10    0   0    26     33  59M  10M 340M 162M  30M  39M      1      0     10      2
rnmcluster02-node03 20141201 21:40:10    0   0    20     31  58M  10M 340M 160M  30M  39M      0      0      4      1
rnmcluster02-node02 20141201 21:40:10    0   0    23     35  59M  10M 340M 162M  30M  39M      3      0      9      2
rnmcluster02-node01 20141201 21:40:10    0   0    23     37  56M  11M 341M 156M  27M  42M      4      1      4      1


[oracle@rnm-ol6-2 ~]$

colmux reference

You can find more about colmux from the website:

as well as the built in man page man colmux

Visualising collectl data with colplot

As a little bonus to the above, colmux is part of the collectl-utils package, which also includes colplot, a gnuplot-based web tool that renders collectl data into graphs. It’s pretty easy to set up, running under Apache just fine and just needing gnuplot installed if you haven’t already. It can report metrics across a cluster if you make sure that you first make each node’s collectl data available locally to colplot.

Navigating to the web page shows the interface from which you can trigger graph plots based on the collectl data available:

colplot’s utilitarian graphs are a refreshing contrast to every webapp that is built nowadays promising “beautiful” visualisations (which no doubt the authors are “passionate” about making “awesome”):

The graphs are functional and can be scaled as needed, but each change is a trip back to the front page to tweak options and re-render:

colplot

For me, colplot is an excellent tool for point-in-time analysis and diagnostics, but for more generalised monitoring with drilldown into detail, it is too manual to be viable and I’ll be sticking with collectl -> graphite -> grafana with its interactive and flexible graph rendering:

grafana

Do note however that colplot specifically does not drop data points, so if there is a spike in your data you will see it. Other tools (possibly including graphite but I’ve not validated this) will, for larger timespans average out data series so as to provide a smoother picture of a metric (eg instead of a point every second, maybe every ten seconds). If you are doing close analysis of a system’s behaviour in a particular situation this may be a problem. If you are wanting more generalised overview of a system’s health, with the option to drill into data historical as needed, it will be less of an issue.

Summary

When working with multiple Linux machines I would first and foremost make sure SSH keys are set up in order to ease management through password-less logins.

After SSH keys, I would recommend pdsh for parallel execution of the same SSH command across the cluster. It’s a big time saver particularly when initially setting up the cluster given the installation and configuration changes that are inevitably needed.

To monitor a cluster I would always recommend collectl as the base metric collector. colmux works excellently for viewing these metrics from across the cluster in a single place from the commandline. For viewing the metrics over the longer term you can either store them in (or replay them into) Graphite/Carbon, and render them in Grafana. You have the option of colplot too since this is installed as part of colmux.

So now your turn – what particular tools or tips do you have for working with a cluster of Linux machines? Leave your answers in the comments below, or tweet them to me at @rmoff.

Categories: BI & Warehousing

Linux cluster sysadmin — Parallel command execution with PDSH

Tue, 2014-12-09 11:40

In this series of blog posts I’m taking a look at a few very useful tools that can make your life as the sysadmin of a cluster of Linux machines easier. This may be a Hadoop cluster, or just a plain simple set of ‘normal’ machines on which you want to run the same commands and monitoring.

Previously we looked at using SSH keys for intra-machine authorisation, which is a pre-requisite what we’ll look at here — executing the same command across multiple machines using PDSH. In the next post of the series we’ll see how we can monitor OS metrics across a cluster with colmux.

PDSH is a very smart little tool that enables you to issue the same command on multiple hosts at once, and see the output. You need to have set up ssh key authentication from the client to host on all of them, so if you followed the steps in the first section of this article you’ll be good to go.

The syntax for using it is nice and simple:

  • -w specifies the addresses. You can use numerical ranges [1-4] and/or comma-separated lists of hosts. If you want to connect as a user other than the current user on the calling machine, you can specify it here (or as a separate -l argument)
  • After that is the command to run.

For example run against a small cluster of four machines that I have:

robin@RNMMBP $ pdsh -w root@rnmcluster02-node0[1-4] date

rnmcluster02-node01: Fri Nov 28 17:26:17 GMT 2014
rnmcluster02-node02: Fri Nov 28 17:26:18 GMT 2014
rnmcluster02-node03: Fri Nov 28 17:26:18 GMT 2014
rnmcluster02-node04: Fri Nov 28 17:26:18 GMT 2014

PDSH can be installed on the Mac under Homebrew (did I mention that Rittman Mead laptops are Macs, so I can do all of this straight from my work machine… :-) )

brew install pdsh

And if you want to run it on Linux from the EPEL yum repository (RHEL-compatible, but packages for other distros are available):

yum install pdsh

You can run it from a cluster node, or from your client machine (assuming your client machine is mac/linux).

Example – install and start collectl on all nodes

I started looking into pdsh when it came to setting up a cluster of machines from scratch. One of the must-have tools I like to have on any machine that I work with is the excellent collectl. This is an OS resource monitoring tool that I initially learnt of through Kevin Closson and Greg Rahn, and provides the kind of information you’d get from top etc – and then some! It can run interactively, log to disk, run as a service – and it also happens to integrate very nicely with graphite, making it a no-brainer choice for any server.

So, instead of logging into each box individually I could instead run this:

pdsh -w root@rnmcluster02-node0[1-4] yum install -y collectl
pdsh -w root@rnmcluster02-node0[1-4] service collectl start
pdsh -w root@rnmcluster02-node0[1-4] chkconfig collectl on

Yes, I know there are tools out there like puppet and chef that are designed for doing this kind of templated build of multiple servers, but the point I want to illustrate here is that pdsh enables you to do ad-hoc changes to a set of servers at once. Sure, once I have my cluster built and want to create an image/template for future builds, then it would be daft if I were building the whole lot through pdsh-distributed yum commands.

Example – setting up the date/timezone/NTPD

Often the accuracy of the clock on each server in a cluster is crucial, and we can easily do this with pdsh:

Install packages

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] yum install -y ntp ntpdate

Set the timezone:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] ln -sf /usr/share/zoneinfo/Europe/London /etc/localtime

Force a time refresh:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] ntpdate pool.ntp.org
rnmcluster02-node03: 30 Nov 20:46:22 ntpdate[27610]: step time server 176.58.109.199 offset -2.928585 sec
rnmcluster02-node02: 30 Nov 20:46:22 ntpdate[28527]: step time server 176.58.109.199 offset -2.946021 sec
rnmcluster02-node04: 30 Nov 20:46:22 ntpdate[27615]: step time server 129.250.35.250 offset -2.915713 sec
rnmcluster02-node01: 30 Nov 20:46:25 ntpdate[29316]: 178.79.160.57 rate limit response from server.
rnmcluster02-node01: 30 Nov 20:46:22 ntpdate[29316]: step time server 176.58.109.199 offset -2.925016 sec

Set NTPD to start automatically at boot:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] chkconfig ntpd on

Start NTPD:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] service ntpd start

Example – using a HEREDOC (here-document) and sending quotation marks in a command with PDSH

Here documents (heredocs) are a nice way to embed multi-line content in a single command, enabling the scripting of a file creation rather than the clumsy instruction to “open an editor and paste the following lines into it and save the file as /foo/bar”.

Fortunately heredocs work just fine with pdsh, so long as you remember to enclose the whole command in quotation marks. And speaking of which, if you need to include quotation marks in your actual command, you need to escape them with a backslash. Here’s an example of both, setting up the configuration file for my ever-favourite gnu screen on all the nodes of the cluster:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] "cat > ~/.screenrc <<EOF
hardstatus alwayslastline \"%{= RY}%H %{kG}%{G} Screen(s): %{c}%w %=%{kG}%c  %D, %M %d %Y  LD:%l\"
startup_message off
msgwait 1
defscrollback 100000
nethack on
EOF
"

Now when I login to each individual node and run screen, I get a nice toolbar at the bottom:

Combining commands

To combine commands together that you send to each host you can use the standard bash operator semicolon ;

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] "date;sleep 5;date"
rnmcluster02-node01: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node03: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node04: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node02: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node01: Sun Nov 30 20:57:11 GMT 2014
rnmcluster02-node03: Sun Nov 30 20:57:11 GMT 2014
rnmcluster02-node04: Sun Nov 30 20:57:11 GMT 2014
rnmcluster02-node02: Sun Nov 30 20:57:11 GMT 2014

Note the use of the quotation marks to enclose the entire command string. Without them the bash interpretor will take the ; as the delineator of the local commands, and try to run the subsequent commands locally:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] date;sleep 5;date
rnmcluster02-node03: Sun Nov 30 20:57:53 GMT 2014
rnmcluster02-node04: Sun Nov 30 20:57:53 GMT 2014
rnmcluster02-node02: Sun Nov 30 20:57:53 GMT 2014
rnmcluster02-node01: Sun Nov 30 20:57:53 GMT 2014
Sun 30 Nov 2014 20:58:00 GMT

You can also use && and || to run subsequent commands conditionally if the previous one succeeds or fails respectively:

robin@RNMMBP $ pdsh -w root@rnmcluster02-node[01-4] "chkconfig collectl on && service collectl start"

rnmcluster02-node03: Starting collectl: [  OK  ]
rnmcluster02-node02: Starting collectl: [  OK  ]
rnmcluster02-node04: Starting collectl: [  OK  ]
rnmcluster02-node01: Starting collectl: [  OK  ]

Piping and file redirects

Similar to combining commands above, you can pipe the output of commands, and you need to use quotation marks to enclose the whole command string.

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] "chkconfig|grep collectl"
rnmcluster02-node03: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node01: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node04: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node02: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off

However, you can pipe the output from pdsh to a local process if you want:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] chkconfig|grep collectl
rnmcluster02-node02: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node04: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node03: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node01: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off

The difference is that you’ll be shifting the whole of the pipe across the network in order to process it locally, so if you’re just grepping etc this doesn’t make any sense. For use of utilities held locally and not on the remote server though, this might make sense.

File redirects work the same way – within quotation marks and the redirect will be to a file on the remote server, outside of them it’ll be local:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] "chkconfig>/tmp/pdsh.out"
robin@RNMMBP ~ $ ls -l /tmp/pdsh.out
ls: /tmp/pdsh.out: No such file or directory

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] chkconfig>/tmp/pdsh.out
robin@RNMMBP ~ $ ls -l /tmp/pdsh.out
-rw-r--r--  1 robin  wheel  7608 30 Nov 19:23 /tmp/pdsh.out

Cancelling PDSH operations

As you can see from above, the precise syntax of pdsh calls can be hugely important. If you run a command and it appears ‘stuck’, or if you have that heartstopping realisation that the shutdown -h now you meant to run locally you ran across the cluster, you can press Ctrl-C once to see the status of your commands:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] sleep 30
^Cpdsh@RNMMBP: interrupt (one more within 1 sec to abort)
pdsh@RNMMBP:  (^Z within 1 sec to cancel pending threads)
pdsh@RNMMBP: rnmcluster02-node01: command in progress
pdsh@RNMMBP: rnmcluster02-node02: command in progress
pdsh@RNMMBP: rnmcluster02-node03: command in progress
pdsh@RNMMBP: rnmcluster02-node04: command in progress

and press it twice (or within a second of the first) to cancel:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] sleep 30
^Cpdsh@RNMMBP: interrupt (one more within 1 sec to abort)
pdsh@RNMMBP:  (^Z within 1 sec to cancel pending threads)
pdsh@RNMMBP: rnmcluster02-node01: command in progress
pdsh@RNMMBP: rnmcluster02-node02: command in progress
pdsh@RNMMBP: rnmcluster02-node03: command in progress
pdsh@RNMMBP: rnmcluster02-node04: command in progress
^Csending SIGTERM to ssh rnmcluster02-node01
sending signal 15 to rnmcluster02-node01 [ssh] pid 26534
sending SIGTERM to ssh rnmcluster02-node02
sending signal 15 to rnmcluster02-node02 [ssh] pid 26535
sending SIGTERM to ssh rnmcluster02-node03
sending signal 15 to rnmcluster02-node03 [ssh] pid 26533
sending SIGTERM to ssh rnmcluster02-node04
sending signal 15 to rnmcluster02-node04 [ssh] pid 26532
pdsh@RNMMBP: interrupt, aborting.

If you’ve got threads yet to run on the remote hosts, but want to keep running whatever has already started, you can use Ctrl-C, Ctrl-Z:

robin@RNMMBP ~ $ pdsh -f 2 -w root@rnmcluster02-node[01-4] "sleep 5;date"
^Cpdsh@RNMMBP: interrupt (one more within 1 sec to abort)
pdsh@RNMMBP:  (^Z within 1 sec to cancel pending threads)
pdsh@RNMMBP: rnmcluster02-node01: command in progress
pdsh@RNMMBP: rnmcluster02-node02: command in progress
^Zpdsh@RNMMBP: Canceled 2 pending threads.
rnmcluster02-node01: Mon Dec  1 21:46:35 GMT 2014
rnmcluster02-node02: Mon Dec  1 21:46:35 GMT 2014

NB the above example illustrates the use of the -f argument to limit how many threads are run against remote hosts at once. We can see the command is left running on the first two nodes and returns the date, whilst the Ctrl-C – Ctrl-Z stops it from being executed on the remaining nodes.

PDSH_SSH_ARGS_APPEND

By default, when you ssh to new host for the first time you’ll be prompted to validate the remote host’s SSH key fingerprint.

The authenticity of host 'rnmcluster02-node02 (172.28.128.9)' can't be established.
RSA key fingerprint is 00:c0:75:a8:bc:30:cb:8e:b3:8e:e4:29:42:6a:27:1c.
Are you sure you want to continue connecting (yes/no)?

This is one of those prompts that the majority of us just hit enter at and ignore; if that includes you then you will want to make sure that your PDSH call doesn’t fall in a heap because you’re connecting to a bunch of new servers all at once. PDSH is not an interactive tool, so if it requires input from the hosts it’s connecting to it’ll just fail. To avoid this SSH prompt, you can set up the environment variable PDSH_SSH_ARGS_APPEND as follows:

export PDSH_SSH_ARGS_APPEND="-q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"

The -q makes failures less verbose, and the -o passes in a couple of options, StrictHostKeyChecking to disable the above check, and UserKnownHostsFile to stop SSH keeping a list of host IP/hostnames and corresponding SSH fingerprints (by pointing it at /dev/null). You’ll want this if you’re working with VMs that are sharing a pool of IPs and get re-used, otherwise you get this scary failure:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
00:c0:75:a8:bc:30:cb:8e:b3:8e:e4:29:42:6a:27:1c.
Please contact your system administrator.

For both of these above options, make sure you’re aware of the security implications that you’re opening yourself up to. For a sandbox environment I just ignore them; for anything where security is of importance make sure you are aware of quite which server you are connecting to by SSH, and protecting yourself from MitM attacks.

PDSH Reference

You can find out more about PDSH at https://code.google.com/p/pdsh/wiki/UsingPDSH

Summary

When working with multiple Linux machines I would first and foremost make sure SSH keys are set up in order to ease management through password-less logins.

After SSH keys, I would recommend pdsh for parallel execution of the same SSH command across the cluster. It’s a big time saver particularly when initially setting up the cluster given the installation and configuration changes that are inevitably needed.

In the next article of this series we’ll see how the tool colmux is a powerful way to monitor OS metrics across a cluster.

So now your turn – what particular tools or tips do you have for working with a cluster of Linux machines? Leave your answers in the comments below, or tweet them to me at @rmoff.

Categories: BI & Warehousing

Linux cluster sysadmin — SSH keys

Tue, 2014-12-09 05:34

In this short series of blog posts I’m going to take a look at a few very useful tools that can make your life as the sysadmin of a cluster of Linux machines easier. This may be a Hadoop cluster, or just a plain simple set of ‘normal’ machines on which you want to run the same commands and monitoring.

To start with, we’re going to use the ever-awesome ssh keys to manage security on the cluster. After that we’ll look at executing the same command across multiple machines at the same time using PDSH, and then monitoring OS metrics across a cluster with colmux.

In a nutshell, ssh keys enable us to do password-less authentication in a secure way. You can find a detailed explanation of them in a previous post that I wrote, tips and tricks for OBIEE Linux sysadmin. Beyond the obvious time-saving function of not having to enter a password each time we connect to a machine, having SSH keys in place enable the use of the tools we discuss later, pdsh and colmux.

Working with SSH keys involves taking the public key from a pair, and adding that to another machine in order to allow the owner of the pair’s private key to access that machine. What we’re going to do here is generate a unique key pair that will be used as the identity across the cluster. So each node will have a copy of the private key, in order to be able to authenticate to any other node, which will be holding a copy of the public key (as well as, in turn, the same private key).

In this example I’m going to use my own client machine to connect to the cluster. You could easily use any of the cluster nodes too if a local machine would not be appropriate.
As a side-note, this is another reason why I love the fact that Rittman Mead standard-issue laptop is a MacBook, and just under the covers of Mac OS is a *nix-based command-line meaning that a lot of sysadmin work can be done natively without needing additional tools that you would on Windows (e.g. PuTTY, WinSCP, Pageant, etc etc).

SSH key strategy

We’ve several ways we could implement the SSH keys. Because it’s a purely sandbox cluster, I could use the same SSH key pair that I generate for the cluster on my machine too, so the same public/private key pair is distributed thus:

If we wanted a bit more security, a better approach might be to distribute my personal SSH key’s public key across the cluster too, and leave the cluster’s private key to truly identify cluster nodes alone. An additional benefit of this approach is that is the client does not need to hold a copy of the cluster’s SSH private key, instead just continuing to use their own.

For completeness, the extreme version of the key strategy would be for each machine to have its own ssh key pair (i.e. its own security identity), with the corresponding public keys distributed to the other nodes in the cluster:

But anyway, here we’re using the second option – a unique keypair used across the cluster and the client’s public ssh key distributed across the cluster too.

Generating the SSH key pair

First, we need to generate the key. I’m going to create a folder to hold it first, because in a moment we’re going to push it and a couple of other files out to all the servers in the cluster and it’s easiest to do this from a single folder.

mkdir /tmp/rnmcluster02-ssh-keys

Note that in the ssh-keygen command below I’m specifying the target path for the key with the -f argument; if you don’t then watch out that you don’t accidentally overwrite your own key pair in the default path of ~/.ssh.

The -q -N "" flags instruct the key generation to use no passphrase for the key and to not prompt for it either. This is the lowest friction approach (you don’t need to unlock the ssh key with a passphrase before use) but also the least secure. If you’re setting up access to a machine where security matters then bear in mind that without a passphrase on an ssh key anyone who obtains it can therefore access any machine to which the key has been granted access (i.e. on which its public key has been deployed).

ssh-keygen -f /tmp/rnmcluster02-ssh-keys/id_rsa -q -N ""

This generates in the tmp folder two files – the private and public (.pub) keys of the pair:

robin@RNMMBP ~ $ ls -l /tmp/rnmcluster02-ssh-keys
total 16
-rw-------  1 robin  wheel  1675 30 Nov 17:28 id_rsa
-rw-r--r--  1 robin  wheel   400 30 Nov 17:28 id_rsa.pub

Preparing the authorized_keys file

Now we’ll prepare the authorized_keys file which is where the public SSH key of any identity permitted to access the machine is stored. Note that each user on a machine has their own authorized_keys file, in ~/.ssh/. So for example, the root user has the file in /root/.ssh/authorized_keys and any public key listed in that file will be able to connect to the server as the root user. Be aware the American [mis-]spelling of “authorized” – spell it [correctly] as “authorised” and you’ll not get any obvious errors, but the ssh key login won’t work either.

So we’re going to copy the public key of the unique pair that we just created for the cluster into the authorized_keys file. In addition we will copy in our own personal ssh key (and any other public key that we want to give access to all the nodes in the cluster):

cp /tmp/rnmcluster02-ssh-keys/id_rsa.pub /tmp/rnmcluster02-ssh-keys/authorized_keys
# [optional] Now add any other keys (such as your own) into the authorized_keys file just created
cat ~/.ssh/id_rsa.pub >> /tmp/rnmcluster02-ssh-keys/authorized_keys
# NB make sure the previous step is a double >> not > since the double appends to the file, a single overwrites.

Distributing the SSH artefacts

Now we’re going to push this set of SSH files out to the .ssh folder of the target user on each node, which in this case is the root user. From a security point of view it’s probably better to use a non-root user for login and then sudo as required, but we’re keeping things simple (and less secure) to start with here. So the files in our folder are:

  • id_rsa – the private key of the key pair
  • id_rsa.pub – the public key of the key pair. Strictly speaking this doesn’t need distributing to all nodes, but it’s conventional and handy to hold it alongside the private key.
  • authorized_keys – this is the file that the sshd daemon on each node will look at to validate an incoming login request’s offered private key, and so needs to hold the public key of anyone who is allowed to access the machine as this user.

To copy the files we’ll use scp, but how you get them in place doesn’t really matter so much, so long as they get to the right place:

scp -r /tmp/rnmcluster02-ssh-keys root@rnmcluster02-node01:~/.ssh

At this point you’ll need to enter the password for the target user, but rejoice! This is the last time you’ll need to enter it as subsequent logins will be authenticated using the ssh keys that you’re now configuring.

Run the scp for all nodes in the cluster. If you’ve four nodes in the cluster your output should look something like this:

$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node01:~/.ssh
root@rnmcluster02-node01's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00
$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node02:~/.ssh
Warning: Permanently added the RSA host key for IP address '172.28.128.7' to the list of known hosts.
root@rnmcluster02-node02's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00
$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node03:~/.ssh
root@rnmcluster02-node03's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00
$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node04:~/.ssh
root@rnmcluster02-node04's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00

Testing login authenticated through SSH keys

The moment of truth. From your client machine, try to ssh to each of the cluster nodes. If you are prompted for a password, then something is not right – see the troubleshooting section below.

If you put your own public key in authorized_keys when you created it then you don’t need to specify which key to use when connecting because it’ll use your own private key by default:

robin@RNMMBP ~ $ ssh root@rnmcluster02-node01
Last login: Fri Nov 28 17:13:23 2014 from 172.28.128.1



[root@localhost ~]#

There we go – logged in automagically with no password prompt. If we’re using the cluster’s private key (rather than our own) you need to specify it with -i when you connect.

robin@RNMMBP ~ $ ssh -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01
Last login: Fri Nov 28 17:13:23 2014 from 172.28.128.1



[root@localhost ~]#

Troubleshooting SSH key connections

SSH keys are one of the best things in a sysadmin’s toolkit, but when they don’t work can be a bit tricky to sort out. The first thing to check is that on the target machine the authorized_keys file that does all the magic (by listing the ssh keys that are permitted to connect inbound on a host to the given user) is in place:

[root@localhost .ssh]# ls -l ~/.ssh/authorized_keys
-rw-r--r-- 1 root root 775 Nov 30 18:55 /root/.ssh/authorized_keys

If you get this:

[root@localhost .ssh]# ls -l ~/.ssh/authorized_keys
ls: cannot access /root/.ssh/authorized_keys: No such file or directory

then you have a problem.

One possible issue in this specific instance could be that the above pre-canned scp assumes that the user’s .ssh folder doesn’t already exist (since it doesn’t, on brand new servers) and so specifies it as the target name for the whole rnmcluster02-ssh-keys folder. However if it does already exist then it ends up copying the rnmcluster02-ssh-keys folder into the .ssh folder:

[root@localhost .ssh]# ls -lR
.:
total 12
-rw------- 1 root root 1675 Nov 22  2013 id_rsa
-rw-r--r-- 1 root root  394 Nov 22  2013 id_rsa.pub
drwxr-xr-x 2 root root 4096 Nov 30 18:49 rnmcluster02-ssh-keys

./rnmcluster02-ssh-keys:
total 12
-rw-r--r-- 1 root root  775 Nov 30 18:49 authorized_keys
-rw------- 1 root root 1675 Nov 30 18:49 id_rsa
-rw-r--r-- 1 root root  394 Nov 30 18:49 id_rsa.pub
[root@localhost .ssh]#

To fix this simply move the authorized_keys from rnmcluster02-ssh-keys back into .ssh:

[root@localhost .ssh]# mv ~/.ssh/rnmcluster02-ssh-keys/authorized_keys ~/.ssh/

Other frequent causes of problems are file/folder permissions that are too lax on the target user’s .ssh folder (which can be fixed with chmod -R 700 ~/.ssh) or the connecting user’s ssh private key (fix: chmod 600 id_rsa). The latter will show on connection attempts very clearly:

robin@RNMMBP ~ $ ssh -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0777 for '/tmp/rnmcluster02-ssh-keys/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/rnmcluster02-ssh-keys/id_rsa

Another one that has bitten me twice over time – and that eludes the troubleshooting I’ll demonstrate in a moment – is that SELinux gets stroppy about root access using ssh keys. I always just take this as a handy reminder to disable selinux (in /etc/selinux/config, set SELINUX=disabled), having never had cause to leave it enabled. But, if you do need it enabled you’ll need to hit the interwebs to check the exact cause/solution for this problem.

So to troubleshoot ssh key problems in general do two things. Firstly from the client side, specify verbosity (-v for a bit of verbosity, -vvv for most)

ssh -v -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01

You should observe ssh trying to use the private key, and if the server rejects it it’ll fall back to any other ssh private keys it can find, and then password authentication:

[...]
debug1: Offering RSA public key: /tmp/rnmcluster02-ssh-keys/id_rsa
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password
debug1: Next authentication method: password

Quite often the problem will be on the server side, so assuming that you can still connect to the server (eg through the physical console, or using password authentication) then go and check /var/log/secure where you’ll see all logs relating to attempted connections. Here’s the log file corresponding to the above client log, where ssh key authentication is attempted but fails, and then password authentication is used to successfully connect:

Nov 30 18:15:05 localhost sshd[13156]: Authentication refused: bad ownership or modes for file /root/.ssh/authorized_keys
Nov 30 18:15:15 localhost sshd[13156]: Accepted password for root from 172.28.128.1 port 59305 ssh2
Nov 30 18:15:15 localhost sshd[13156]: pam_unix(sshd:session): session opened for user root by (uid=0)

Now we can see clearly what the problem is – “bad ownership or modes for file /root/.ssh/authorized_keys”.

The last roll of the troubleshooting dice is to get sshd (the ssh daemon that runs on the host we’re trying to connect to) to issue more verbose logs. You can either set LogLevel DEBUG1 (or DEBUG2, or DEBUG3) in /etc/ssh/sshd_config and restart the ssh daemon (service sshd restart), or you can actually run a (second) ssh daemon from the host with specific logging. This would be appropriate on a multi-user server where you can’t just go changing sshd configuration. To run a second instance of sshd you’d use:

/usr/sbin/sshd -D -d -p 2222

You have to run sshd from an absolute path (you’ll get told this if you try not to). The -D flag stops it running as a daemon and instead runs interactively, so we can see easily all the output from it. -d specifies the debug logging (-dd or -ddd for greater levels of verbosity), and -p 2222 tells sshd to listen on port 2222. Since we’re doing this on top of the existing sshd, we obviously can’t use the default ssh port (22) so pick another port that is available (and not blocked by a firewall).

Now on the client retry the connection, but pointing to the port of the interactive sshd instance:

ssh -v -p 2222 -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01

When you run the command on the client you should get both the client and host machine debug output go crackers for a second, giving you plenty of diagnostics to pore through and analyse the ssh handshake etc to get to the root of the issue.

Hopefully you’ve now sorted your SSH keys, because in the next article we’re going to see how we can use them to run commands against multiple servers at once using pdsh.

Summary

When working with multiple Linux machines I would first and foremost make sure SSH keys are set up in order to ease management through password-less logins.

We’ll see in the next couple of articles some other tools that are useful when working on a cluster:

  • pdsh
  • colmux

I’m interested in what you think – what particular tools or tips do you have for working with a cluster of Linux machines? Leave your answers in the comments below, or tweet them to me at @rmoff.

Categories: BI & Warehousing

Going Beyond MapReduce for Hadoop ETL Pt.3 : Introducing Apache Spark

Tue, 2014-12-09 02:00

In the first two posts in this three part series on going beyond MapReduce for Hadoop ETL, we looked at why MapReduce and Hadoop 1.0 was only really suitable for batch processing, and how the new Apache Tez framework enabled by Apache YARN on the Hadoop 2.0 platform can be swapped-in for MapReduce to improve the performance of existing Pig and Hive scripts. Today though in the final post I want to take a look at Apache Spark, the next-gen compute framework that Cloudera are backing as the long-term successor to MapReduce.

Like Tez, Apache Spark supports DAGs that describe the entire dataflow process, not just individual map and reduce jobs. Like Pig, it has a concept of datasets (Pig’s aliases and relations), but crucially these datasets (RDDs, or “resilient distributed datasets”) can be cached in-memory, fail-back gracefully to disk and can be rebuilt using a graph that says how to reconstruct. With Tez, individual jobs in the DAG can now hand-off their output to the next job in-memory rather than having to stage in HDFS, but Spark uses memory for the actual datasets and is a much better choice for the types of iterative, machine-learning tasks that you tend to do on Hadoop systems. Moreover, Spark has arguably a richer API and when used with Scala, a functional programming-orientated language that uses Java libraries and whose collections framework maps well on to the types of operations you’d want to make use of with dataflow-type applications on a cluster.

Spark can run standalone, on YARN or on other cluster management platforms, and comes with a handy command-line interpreter that you can use to interactively load, filter, analyse and work with RDDs. Cloudera CDH5.2 comes with Spark 1.0.1 and can either be configured standalone or to run on YARN, with Spark as a service added to nodes in the cluster using Cloudera Manager. 

NewImage

So looking back at the Pig example, we create the dataflow using a number of aliases in that case, that we progressively filter, transform, join together and then aggregate to get to the final top ten set of pages from the website logs. Translating that dataflow to Spark we end up with a similar set of RDDs that take our initial set of logs, apply transformations and join the datasets to store the final aggregated output back on HDFS.

NewImage

Spark supports in-memory sharing of data within a single DAG (i.e. RDD to RDD), but also between DAGs running in the same Spark instance. As such, Spark becomes a great framework for doing iterative and cyclic data analysis, and can make much better use of the memory on cluster servers whilst still using disk for overflow data and persistence.

Moreover, Spark powers a number of higher-level tools build on the core Spark engine to provide features like real-time loading and analysis (Spark Streaming), SQL access and integration with Hive (Spark SQL), machine learning (MLib) and so forth. In fact, as well as Hive and Pig being reworked to run on Tez there’s also projects underway to port them both to Spark, though to be honest they’re both at early stages compared to Tez integration and most probably you’ll be using Scala, Java or Python to work with Spark now.

NewImage

So taking the Pig script we had earlier and translating that to the same dataflow in Spark and Scala, we end up with something like this:

package com.cloudera.analyzeblog
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
case class accessLogRow(host: String, identity: String, user: String, time: String, request: String, status: String, size: String, referer: String, agent: String)
case class pageRow(host: String, request_page: String, status: String, agent: String)
case class postRow(post_id: String, title: String, post_date: String, post_type: String, author: String, url: String, generated_url: String)
object analyzeBlog {
        def getRequestUrl(s: String): String = {
        try {
                s.split(' ')(1)
        } catch {
                case e: ArrayIndexOutOfBoundsException => { "N/A" }
        }
}
        def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("analyzeBlog"))
val sqlContext = new SQLContext(sc)
import sqlContext._
val raw_logs = "/user/mrittman/rm_logs"
//val rowRegex = """^([0-9.]+)\s([\w.-]+)\s([\w.-]+)\s(\[[^\[\]]+\])\s"((?:[^"]|\")+)"\s(\d{3})\s(\d+|-)\s"((?:[^"]|\")+)"\s"((?:[^"]|\")+)"$""".r
val rowRegex = """^([\d.]+) (\S+) (\S+) \[([\w\d:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) ([\d\-]+) "([^"]+)" "([^"]+)".*""".r

val logs_base = sc.textFile(raw_logs) flatMap {
                        case rowRegex(host, identity, user, time, request, status, size, referer, agent) =>
                                Seq(accessLogRow(host, identity, user, time, request, status, size, referer, agent))
                        case _ => Nil
                                }
val logs_base_nobots = logs_base.filter( r => ! r.request.matches(".*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*"))

val logs_base_page = logs_base_nobots.map { r =>
  val request = getRequestUrl(r.request)
  val request_formatted = if (request.charAt(request.length-1).toString == "/") request else request.concat("/")
  (r.host, request_formatted, r.status, r.agent)
}

val logs_base_page_schemaRDD = logs_base_page.map(p => pageRow(p._1, p._2, p._3, p._4))

logs_base_page_schemaRDD.registerAsTable("logs_base_page")

val page_count = sql("SELECT request_page, count(*) as hits FROM logs_base_page GROUP BY request_page").registerAsTable("page_count")

val postsLocation = "/user/mrittman/posts.psv"

val posts = sc.textFile(postsLocation).map{ line =>
        val cols=line.split('|')

        postRow(cols(0),cols(1),cols(2),cols(3),cols(4),cols(5),cols(6).concat("/"))
}

posts.registerAsTable("posts")

val pages_and_posts_details = sql("SELECT p.request_page, p.hits, ps.title, ps.author FROM page_count p JOIN posts ps ON p.request_page = ps.generated_url ORDER BY hits DESC LIMIT 10")

pages_and_posts_details.saveAsTextFile("/user/mrittman/top_10_pages_and_author4")

        }
}

I’ll do a code-walkthrough for this Spark application in a future post, but for now note the map and flatMap Scala collection functions used to transform RDDs, and the sql(“…”) function that allows us to register RDDs as tables and then manipulate the contents using SQL, including joining to other RDDs registered as tables. For now though, let’s run the application on the CDH5.2 using YARN and see how long it takes to process the same set of log files (remember, the Pig script on this CDH5.2 cluster took around 5 minutes to run, and the Pig on Tez version on the Hortonworks cluster was around 2.5 minutes:

[mrittman@bdanode1 analyzeBlog]$ spark-submit --class com.cloudera.analyzeblog.analyzeBlog --master yarn target/analyzeblog-0.0.1-SNAPSHOT.jar 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/spark-assembly-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/12/06 19:18:25 INFO SecurityManager: Changing view acls to: mrittman
14/12/06 19:18:25 INFO SecurityManager: Changing modify acls to: mrittman
...
14/12/06 19:19:41 INFO DAGScheduler: Stage 0 (takeOrdered at basicOperators.scala:171) finished in 3.585 s
14/12/06 19:19:41 INFO SparkContext: Job finished: takeOrdered at basicOperators.scala:171, took 53.591560036 s
14/12/06 19:19:41 INFO SparkContext: Starting job: saveAsTextFile at analyzeBlog.scala:56
14/12/06 19:19:41 INFO DAGScheduler: Got job 1 (saveAsTextFile at analyzeBlog.scala:56) with 1 output partitions (allowLocal=false)
14/12/06 19:19:41 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at analyzeBlog.scala:56)
14/12/06 19:19:41 INFO DAGScheduler: Parents of final stage: List()
14/12/06 19:19:41 INFO DAGScheduler: Missing parents: List()
14/12/06 19:19:41 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[15] at saveAsTextFile at analyzeBlog.scala:56), which has no missing parents
14/12/06 19:19:42 INFO MemoryStore: ensureFreeSpace(64080) called with curMem=407084, maxMem=278302556
14/12/06 19:19:42 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 62.6 KB, free 265.0 MB)
14/12/06 19:19:42 INFO MemoryStore: ensureFreeSpace(22386) called with curMem=471164, maxMem=278302556
14/12/06 19:19:42 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 21.9 KB, free 264.9 MB)
14/12/06 19:19:42 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on bdanode1.rittmandev.com:44486 (size: 21.9 KB, free: 265.3 MB)
14/12/06 19:19:42 INFO BlockManagerMaster: Updated info of block broadcast_5_piece0
14/12/06 19:19:42 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3 (MappedRDD[15] at saveAsTextFile at analyzeBlog.scala:56)
14/12/06 19:19:42 INFO YarnClientClusterScheduler: Adding task set 3.0 with 1 tasks
14/12/06 19:19:42 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 215, bdanode5.rittmandev.com, PROCESS_LOCAL, 3331 bytes)
14/12/06 19:19:42 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on bdanode5.rittmandev.com:13962 (size: 21.9 KB, free: 530.2 MB)
14/12/06 19:19:42 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 215) in 311 ms on bdanode5.rittmandev.com (1/1)
14/12/06 19:19:42 INFO YarnClientClusterScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool 
14/12/06 19:19:42 INFO DAGScheduler: Stage 3 (saveAsTextFile at analyzeBlog.scala:56) finished in 0.312 s
14/12/06 19:19:42 INFO SparkContext: Job finished: saveAsTextFile at analyzeBlog.scala:56, took 0.373096676 s

It ran in just over a minute in the end, and most of that was around submitting the job to YARN – not bad. We’ll be covering more of Spark on the blog over the next few weeks including streaming and machine learning examples, and connecting it to ODI and OBIEE via Hive on Spark, and Spark SQL’s own Hive-compatible Thrift server. I’ll also be taking a look at Pig on Spark (or “Spork”…) to see how well that works, and most interestingly how Pig and Hive on Spark compares to running them on Tez – watch this space as they say.

Categories: BI & Warehousing

Six Months an ACE…”Boy, that escalated quickly”

Mon, 2014-12-08 12:34

Boy, that escalated quickly. I mean, my path to becoming an Oracle ACE since joining Rittman Mead!

When I began with Rittman Mead back in March 2012, I wasn’t planning on joining the ACE program. In fact, I really wasn’t sure what an Oracle ACE even was. If you’re not familiar, the Oracle ACE program “highlights excellence within the global Oracle community by recognizing individuals who have demonstrated both technical proficiency and strong credentials as community enthusiasts and advocates”. That’s quite a mouthful! Now let’s find out how I got to this point in my career.

I wrote my first blog post ever shortly after joining Rittman Mead, sharing an innovative way to integrate Oracle Data Integrator and GoldenGate for data warehousing. This lead to submitting my first abstracts to several Oracle User Group conferences. My goal was to challenge myself to grow as a public speaker, something I had never been very good at in the past, and to share my data integration know-how, of course. To my amazement, the first abstract submitted, for UKOUG 2012, was accepted. Then, the other conferences continued to accept my abstracts! During the first couple of years, I kept blogging, speaking, wrote an article for RMOUG SQL>UPDATE Magazine, and joined in on a couple of ODTUG ODI expert webcasts and OTN ArchBeat podcasts over the past couple of years. Before I knew it, I had taken advantage of many knowledge sharing opportunities and built a decent list for my Oracle ACE nomination. I’ve found that networking and sharing your experiences are both wonderful ways to get noticed by other Oracle experts.

BI Forum 2014

So now what is an Oracle ACE to me? I think it is someone knowledgeable about an Oracle product (or products) who enjoys sharing their knowledge and loves contributing to the greater Oracle community. It’s that simple. One thing about the Rittman Mead organization, sharing what we know is a part of the ethos of the company. If you’re an avid reader this blog, you’re well aware that Mark Rittman and others love to share their technical knowledge. We also have an internal system that allows us to share our know-how amongst each other. That’s one of the great benefits of working at Rittman Mead. If I don’t know the answer to a question, I can easily reach out to the entire company and get a response from experts like Mark, James Coyle, Andy Rocha or others within minutes. I love that about this organization!

Well, around March or April timeframe, it was mentioned in a company meeting that if anyone has the goal of joining the ACE program, reach out to one of our current ACEs or ACE Directors and work on a plan to get there (again, a testament to how the RM organization works and thinks). I talked with Stewart Bryson, Oracle ACE Director. After reviewing my contribution to the Oracle community he agreed to submit the nomination for me. Again, to my amazement, and very much my delight, my nomination was accepted as an Oracle ACE! What began as several small goals of improving my writing and speaking skills, sharing technical content, and having an article published, led to the overall end goal (though not known to me at the time) of an Oracle ACE program invitation. I worked hard to earn the nomination, but I also know I was very fortunate to have been surrounded by, and mentored by, some great individuals (and I still am!).

oow14-badge-small

Since joining the ACE program, I’ve gained some additional Twitter followers, a great new ribbon on my conference badges, and some sweet ACE gear – but really it feels like not too much else has changed. Well, I do have an additional group of experts available to help me when I’m in a bind, because that’s what ACEs tend to do. I can now send a question via a tweet or email to someone in the ACE program and I’ll typically get an expert answer. It’s a community of like-minded individuals that love their specific Oracle technology – and love to share knowledge and help others. If this sounds like you, it’s time to work towards that ACE nomination!

Here at Rittman Mead we have several ACEs: myself, Venkat JanakiramanEdelweiss Kammermann, and our newest ACE, Robin Moffatt, as well as an ACE Director, Mark Rittman. And our ACEs don’t just sit back and blog or speak at conferences (not that blogging and speaking isn’t hard work – it is!). We also provide all of the Rittman Mead services offered as consultants: training, consulting services, managed services support – all of it! Which means that when you hire Rittman Mead, the 5 folks in the ACE program are included, along with the other 100+ BI experts throughout the world. If you need some help on your Business Intelligence, Data Integration, or Advanced Analytics project, drop us a line at info@rittmanmead.com and we’ll be happy to have a chat.

Now that I’ve met the goal of Oracle ACE, what’s ahead for me? Well, I’m constantly working to keep up on the latest and greatest from the Oracle Data Integration team, while also consulting full-time. I plan to continue speaking and writing blog posts, but also work towards some additional published content in newsletters and magazines for the Oracle community. And I’m going to keep learning, because education is a never-ending journey. “They’ve done studies, you know. 60 percent of the time, it works every time.” — Brian Fantana 

By the way, if you think Rittman Mead is the type of company you’d like to work for, give us a shout at careers@rittmanmead.com.

Categories: BI & Warehousing

Going Beyond MapReduce for Hadoop ETL Pt.2 : Introducing Apache YARN and Apache Tez

Mon, 2014-12-08 02:00

In the first post in this three part series on going beyond MapReduce for Hadoop ETL, I looked at how a typical Apache Pig script gets compiled into a series of MapReduce jobs, and those MapReduce jobs pass data between themselves by writing intermediate resultsets to disk (HDFS, the Hadoop cluster file system). As a reminder, here’s the Pig script we’re working with:

register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
raw_logs = LOAD '/user/mrittman/rm_logs' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
GENERATE FLATTEN
  (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')
)AS
  (remoteAddr: chararray, remoteLogname: chararray, user: chararray,time: chararray, request: chararray, status: chararray, bytes_string: chararray,referrer:chararray,browser: chararray);
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*');
logs_base_page = FOREACH logs_base_nobots GENERATE SUBSTRING(time,0,2) as day, SUBSTRING(time,3,6) as month, SUBSTRING(time,7,11) as year, FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray), remoteAddr, status;
logs_base_page_cleaned = FILTER logs_base_page BY NOT (SUBSTRING(request_page,0,3) == '/wp' or request_page == '/' or SUBSTRING(request_page,0,7) == '/files/' or SUBSTRING(request_page,0,12) == '/favicon.ico');
logs_base_page_cleaned_by_page = GROUP logs_base_page_cleaned BY request_page;
page_count = FOREACH logs_base_page_cleaned_by_page GENERATE FLATTEN(group) as request_page, COUNT(logs_base_page_cleaned) as hits;
page_count_sorted = ORDER page_count BY hits DESC;
page_count_top_10 = LIMIT page_count_sorted 10;
posts = LOAD '/user/mrittman/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() as (post_id:int,title:chararray,post_date:chararray,post_type:chararray,author:chararray,url:chararray,generated_url:chararray);
posts_cleaned = FOREACH posts GENERATE CONCAT(generated_url,'/') as page_url,author as author, title as title;
pages_and_post_details = JOIN page_count by request_page, posts_cleaned by page_url;
pages_and_posts_trim = FOREACH pages_and_post_details GENERATE page_count::request_page as request_page, posts_cleaned::author as author, posts_cleaned::title as title, page_count::hits as hits;
pages_and_posts_sorted = ORDER pages_and_posts_trim BY hits DESC;
pages_and_post_top_10 = LIMIT pages_and_posts_sorted 10;
store pages_and_post_top_10 into 'top_10s/pages';

When you submit the script for execution using the Pig client, it then parses the Pig Latin script, logically optimizes it and then compiles it into MapReduce programs; these programs are then sent in-turn to the Hadoop 1.0 Job Manager which then sends them for execution on the Hadoop cluster – in the case of our Pig script, there’s five MapReduce programs generated in-total.

NewImage

Each one of these MapReduce programs are what’s called “directed acyclic graphs”, or DAGs. A directed acyclic graph is a programming style within distributed computing where processing is broken down into functions that can be run independently of one another, as long as one is not an ancestor of another. In MapReduce terms, this means that all mappers can run independently of each other (and therefore on different nodes in a cluster) and it’s only the reducers that have to wait for their ancestors to finish before they can also start their work independently of the other reducers. It’s a great programming model for processing large amounts of data with fault tolerance across a cluster and it was the key insight that made MapReduce possible and the “big data” systems that we work with today.

NewImage

A Pig script that generates five MapReduce jobs then effectively creates five DAGs, with each one using files (HDFS) to persist and hand-off data between them and JVMs being spun-up for the individual map and reduce jobs. As such, there’s no way for this version of MapReduce and the Hadoop framework it runs on to consider the overall dataflow, and each DAG has to run in isolation.

NewImage

To address this issue, version 2.0 of Hadoop introduced a new feature called Apache YARN, or “yet another resource negotiator”. YARN took on the resource management and job scheduling/monitoring parts of Hadoop and made it so that YARN then effectively became an “operating system” for Hadoop that then allowed frameworks to run on it; initially MapReduce2 (reworked to run on YARN), but since then other ones like Apache Tez, and Apache Spark.

NewImage

YARN also crucially supported frameworks that used DAGs that describe the entire dataflow, not just individual MapReduce jobs, and a new framework that came out of this that used that new capability was Apache Tez. Tez is a generalisation of the MapReduce distributed compute framework that supports these dataflow-style DAGs and runs either MapReduce code unaltered, or has its own API for describing DAGs using vertexes  (logic and resources) or edges (connections). Both Hive and Pig have been ported to run on Tez, and in Pig’s case this means another type of compiler is added to the existing MapReduce one, and Pig scripts executed on Tez can now submit a DAG that encompasses all stages in the dataflow and the reducers and be linked-together via in-memory passing of intermediate steps, rather than having to write those intermediate steps to disk.

NewImage

In practice, what this means is that if your version of Pig or Hive has been updated to also run on Tez, you can run your code unaltered on Tez and typically see a 2-3x performance improvement without any changes to your code or application logic. Hortonworks have been the main Hadoop vendor backing Tez and their new Hortonworks Data Platform 2.2 comes with Tez support, so I took my Pig script and ran it on a five-node cluster to get an initial timing and it took 4 min 17 secs to run, again generating the same five MapReduce jobs that I saw on the Cloudera CDH5 cluster. Then, running it using Tez as the execution engine (pig -x tez analyzeblog.pig) the same script took 2mins 25secs to run, around twice as fast as when we ran it using regular MapReduce.

NewImage

 

And Tez is great for getting existing Pig and Hive scripts to run faster – if your platform supports it, you should use it instead of MapReduce as your execution engine; MapReduce code submitted “as is” will benefit from better YARN container re-use, and Hive and Pig scripts that run on the Tez execution engine can run as a single DAG and use memory, rather than disk, to pass data between jobs in the DAG. For ETL and analytic jobs that you’re creating from new though, Apache Spark is arguably the framework you should look to use instead of Tez, and tomorrow we’ll find out why.

Categories: BI & Warehousing

Going Beyond MapReduce for Hadoop ETL Pt.1 : Why MapReduce Is Only for Batch Processing

Sun, 2014-12-07 08:28

Over the previous few months I’ve been looking at the various ways you can load data into Hadoop, process it and then report on it using Oracle tools. We’ve looked at Apache Hive and how it provides a SQL layer over Hadoop, making it possible for tools like ODI and OBIEE to use their usual SQL set-based process approach to access Hadoop data; later on, we looked at another Hadoop tool, Apache Pig, which provides a more dataflow-type language over Hadoop for when you want to create step-by-step data pipelines for processing data. Under the covers, both Hive and Pig generate Java MapReduce code to actually move data around, with MapReduce then working hand-in-hand with the Hadoop framework to run your jobs in parallel across the cluster.

But MapReduce can be slow; it’s designed for very large datasets and batch processing, with overall analysis tasks broken-down into individual map and reduce tasks that start by reading data off disk, do their thing and then write the intermediate results back to disk again.

NewImage

Whilst this approach means the system is extremely fault-tolerant and effectively infinitely-scalable, this writing to disk of each step in the process means that MapReduce jobs typically take a long time to run and don’t really take advantage of the RAM that’s available in today’s commodity servers. Whilst this is a limitation most early adopters of Hadoop were happy to live with (in exchange for being able to analyse cheaply data on a scale previously unheard of), over the past few years as Hadoop adoption has broadened there’s been a number of initiatives to move Hadoop past it’s batch processing roots and into something more real-time that does more of its processing in-memory. Whilst there are whole bunch of projects and products out there that claim to improve the speed of Hadoop processing and bring in-memory capabilities – Apache Drill, Cloudera Impala, Oracle’s Big Data SQL are just some examples – the two that are probably of most interest to Hadoop customers working in an Oracle environment are called Apache Spark, and Apache Tez. But before we get into the details of Spark, Tez and how they improve over MapReduce, let’s take a look at why MapReduce can be slow.

MapReduce and Hadoop 1.0 – Scalable, Fault-Tolerant, but Aimed at Batch Processing

Going back to MapReduce and what’s now termed “Hadoop 1.0”, MapReduce works on the principle of breaking larger jobs down into lots of smaller ones, with each one running independently and persisting its results back to disk at the end to ensure data doesn’t get lost if a server node breaks down. To take an example, the Apache Pig script below reads in some webserver log files, parses and filters them, aggregates the data and then joins it to another Hadoop dataset before outputting the results to a directory in the HDFS storage layer:

register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
raw_logs = LOAD '/user/mrittman/rm_logs' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
GENERATE FLATTEN
  (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')
)AS
  (remoteAddr: chararray, remoteLogname: chararray, user: chararray,time: chararray, request: chararray, status: chararray, bytes_string: chararray,referrer:chararray,browser: chararray);
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*');
logs_base_page = FOREACH logs_base_nobots GENERATE SUBSTRING(time,0,2) as day, SUBSTRING(time,3,6) as month, SUBSTRING(time,7,11) as year, FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray), remoteAddr, status;
logs_base_page_cleaned = FILTER logs_base_page BY NOT (SUBSTRING(request_page,0,3) == '/wp' or request_page == '/' or SUBSTRING(request_page,0,7) == '/files/' or SUBSTRING(request_page,0,12) == '/favicon.ico');
logs_base_page_cleaned_by_page = GROUP logs_base_page_cleaned BY request_page;
page_count = FOREACH logs_base_page_cleaned_by_page GENERATE FLATTEN(group) as request_page, COUNT(logs_base_page_cleaned) as hits;
page_count_sorted = ORDER page_count BY hits DESC;
page_count_top_10 = LIMIT page_count_sorted 10;
posts = LOAD '/user/mrittman/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() as (post_id:int,title:chararray,post_date:chararray,post_type:chararray,author:chararray,url:chararray,generated_url:chararray);
posts_cleaned = FOREACH posts GENERATE CONCAT(generated_url,'/') as page_url,author as author, title as title;
pages_and_post_details = JOIN page_count by request_page, posts_cleaned by page_url;
pages_and_posts_trim = FOREACH pages_and_post_details GENERATE page_count::request_page as request_page, posts_cleaned::author as author, posts_cleaned::title as title, page_count::hits as hits;
pages_and_posts_sorted = ORDER pages_and_posts_trim BY hits DESC;
pages_and_post_top_10 = LIMIT pages_and_posts_sorted 10;
store pages_and_post_top_10 into 'top_10s/pages';

Pig works by defining what are called “relations” or “aliases”, similar to tables in SQL that contain data or pointers to data. You start by loading data into a relation from a file or other source, and then progressively define further relations take that initial dataset and apply filters, use transformations, re-orientate the data or join it to other relations until you’ve arrived at the final set of data you’re looking for. In this example we start with raw log data, parse it, filter out bit and spider activity, project just the columns we’re interested in and then remove further “noise” from the logs, then join it to reference data and finally return the top ten pages over that period based on total hits.

NewImage

Pig uses something called “lazy evaluation”, where relations you define don’t necessarily get created when they’re defined in the script; instead they’re used as a pointer to data and instructions on how to produce it if needed, with the Pig interpreter only materializing a dataset when it absolutely has to (for example, when you ask it to store a dataset on disk or output to console). Moreover, all the steps leading up to the final dataset you’ve requested are considered as a whole, giving Pig the ability to merge steps, miss out steps completely if they’re not actually needed to produce the final output, and otherwise optimize the flow data through the process.

Running the Pig script and then looking at the console from the script running the Grunt command-line interpreter, you can see that five separate MapReduce jobs were generated to load in the data, filter join and transform it, and then produce the output we requested at the end.

JobId                  Maps Reduces Alias                                                                                                               Feature Outputs
job_1417127396023_0145 12   2       logs_base,logs_base_nobots,logs_base_page,logs_base_page_cleaned,logs_base_page_cleaned_by_page,page_count,raw_logs GROUP_BY,COMBINER 
job_1417127396023_0146 2    1       pages_and_post_details,pages_and_posts_trim,posts,posts_cleaned                                                     HASH_JOIN 
job_1417127396023_0147 1    1       pages_and_posts_sorted                                                                                              SAMPLER 
job_1417127396023_0148 1    1       pages_and_posts_sorted                                                                                              ORDER_BY,COMBINER 
job_1417127396023_0149 1    1       pages_and_posts_sorted                                                                                              hdfs://bdanode1....pages2,

Pig generated five separate MapReduce jobs that loaded, parsed, filtered, aggregated and joined the datasets as part of an overall data “pipeline”, with the intermediate results staged to disk before the next MapReduce job took over. On my six-node CDH5.2 VM cluster it took just over five minutes to load, process and aggregate 5m records from our site’s webserver.

NewImage

Now the advantage of this approach is that its more or less infinitely scalable and certainly resilient, but whilst Pig can look at your overall dataflow “graph” and come up with an optimal efficient way to get to your end result, MapReduce treats every step as atomic and separate and insists on writing every intermediate step to disk before moving on.

What this means in-practice is that ETL routines that use Pig, Hive and MapReduce whilst scaling well, never really get to the point where you can run them as micro-batches or in real-time. For that type of scenario we need to look at moving away from MapReduce and breaking the link between Hadoop (the platform, the cluster management and resource handling part) and the processing that runs on it, so that we can run alternative execution engines on the Hadoop platform such as Apache Tez, which we’ll cover in tomorrow’s post.

Categories: BI & Warehousing

All You Need, and Ever Wanted to Know About the Dynamic Rolling Year

Thu, 2014-12-04 09:50

DRY_1

Among the many reporting requirements that have comprised my last few engagements has been what, from now on, I’ll be referring to as the ‘Dynamic Rolling Year’. That is, through the utilization of a dashboard prompt, the user is able to pick a year – month combination and the chosen view, in this case a line graph, will dynamically display a years’ worth of figures with the chosen year – month as the most current starting point. The picture above better demonstrates this feature. You can see that we’ve chosen the ‘201212’ year – month combination and as a result, our view displays revenue data from 201112 through to 201212. Let’s choose another combination and make sure things work properly and to demonstrate functionality.

DRY_2

From the results on the picture above, it seems we’ve gotten our view to do as expected. We’ve chosen ‘201110’ as our starting year – month combination and the view has changed to display revenue trend back to ‘201010’. Now how do we do this? A quick web search will offer up myriad complex and time-consuming solutions which, while completely feasible, can be onerous to implement. While it was tempting to reap the benefits of someone else’s perhaps hard fought solution, I just knew there had to be an easier way. As a note of caution, I’ll preface the next bit by saying that this only works on a dashboard through using a dashboard prompt in tandem with the target analysis. This solution was tested via a report prompt, however the desired results were not obtained, even when ‘hard coding’ default values for our presentation variables. As well, there are a few assumptions I’m making around what kinds of columns you have in your subject area’s time dimension, namely that you have some sort of Year and Month column within your hierarchy.

1. Establish Your Analysis and Prompt

DRY_3

I’ve brought in the Month (renamed it Year – Month) column from Sample App’s Sample Sales subject area along with the Revenue fact. For the sake of this exercise, we’re going to keep things fairly simple and straightforward. Our Month column exists in the ‘YYYY / MM’ format so we’re going have to dice it up a little bit to get it to where we need it to be. To get our dynamic rolling year to work, all we’re going to do is a little math. In a nutshell, we’re going to turn our month into an integer as ‘YYYYMM’ and simply subtract a hundred as a filter using the presentation variables that correspond to our new month column. To make the magic happen, bring in a dummy column from your time dimension, concatenate the year and month together, cast them as a char and then the whole statement as an integer (as seen in the following formula: CAST(substring(“Time”.”T02 Per Name Month”, 1, 4)||substring(“Time”.”T02 Per Name Month”, -2, 2) AS INT) ). Then, apply a ‘between’ filter to this column using two presentation variables.

DRY_4

Don’t worry that we’ve yet to establish them on our dashboard prompt as that will happen later. In the picture above, I’ve set the filter to be between two presentation variables entitled ‘YEARMONTH_VAR’. As you can see, I’ve subtracted a hundred as, in this case, my year – month concatenation exists in the ‘YYYYMM’ format. You may alter this number as needed. For example, a client already had this column established in their Time dimension, yet it was in the ‘YYYYMMM’ format with a leading zero before the month. In that case, we had to subtract 1000 in order to establish the rolling year. Moving on…. After you’ve set up the analysis, save it off and let’s build the dashboard prompt. The results tab will error out. This is normal. We’re going to be essentially doing the same thing here.

We need to get our analysis and prompt to work together (like they should) and so we’ll need to, once again, format the year – month column on the prompt the same way we did it on our analysis through the ‘prompt for column’ link. You can simply copy and paste the formula you used in the ‘Edit Formula’ column on your analysis into this screen. Don’t forget to assign a presentation variable to this column with whatever name you used on the filter in your analysis (in this case it was YEARMONTH_VAR).

DRY_6

DRY_7

DRY_8

2. Build the Dashboard

2014-12-04_10-04-56

Starting a new dashboard, bring in your prompt and analysis from the Shared Folder. In this exercise, I’ve placed the prompt below the analysis and have done some custom formatting. If you’d like to create this example exactly, then in addition to the default table view, you’ll need to create a line graph view that groups by our year – month column and uses the revenue column as the measure. And voila. Save and run the dashboard. Note that if you didn’t adopt a default value for your dashboard prompt, it will error out before you actually pick and apply a value from the prompt drop-down. Admittedly, the closer to a concatenated year-month (YYYYMM) column you have in your time dimension, the easier the implementation of this method will be. You could even set your default value to an appropriate server/repository variable so that your dashboard opens to the most current and / or a desired year-month combination. However, now that you’ve seen this, hopefully it’s sparked some inspiration into how you might apply this technique to other facets of your work and future reporting.

DRY_1

Categories: BI & Warehousing

Private or Public Training? (and a Discount Code)

Thu, 2014-12-04 08:22

Rittman Mead offers world class training that covers various Business Intelligence, Data Warehousing and Data Integration tools.

Rittman Mead currently offers two types of training, Public and Private – so what is the best option for your organization?

Both Private and Public options are staffed by the same highly qualified, experienced instructors. One thing to note is our instructors aren’t merely technology instructors, but they are real world consultants.

What does that mean for you? It means questions aren’t answered by the instructor flipping to a slide in a deck and reading off the best practice. You’ll probably hear an answer more like “I was on site with a company a few weeks ago who had the same issue. Let me tell you how we helped that client!”

Let’s start with Public courses. Our offerings include an OBIEE Bootcamp, an ODI Bootcamp, Exalytics for System Administrators, RPD Modeling and our all new OBIEE Front End Development, Design and Best Practices.

Why Public? If you have four or fewer attendees, the public classes are typically the most cost effective choice. You’ll attend the course along with other individuals from various public and private sector organizations. There’s plenty of time to learn the concepts, get your hands dirty by completing labs as well as getting all of your questions answered by our highly skilled instructors.

Now, our Private courses. There’s primarily two reasons a company would choose private training – cost, and the ability to customize.

Cost : Our Private Training is billed based on the instructor’s time not the number of attendees. If you have five or more individuals, it is most likely going to be a more cost effective option for us to come to you. Also, you’ll be covering travel costs for one instructor rather than your entire team.

Customization : Once you select the course that is best for your team, we’ll set up a call with a trainer to review your team’s experience level and your main goals. Based on that conversation, we can tweak the syllabus to make sure the areas you need extra focus are thoroughly covered.

One thing to note is we do offer a few courses (OWB 11gR2 & ODI 11g) privately that aren’t offered publicly.

We also offer a Custom Course Builder (http://www.rittmanmead.com/training/customobiee11gtraining/) which allows you to select modules out of various training classes and build a customized course for your team – this flexibility is a great reason to select Private Training.

Some of our clients love having their team get away for the week and focus strictly on training while others prefer to have training on site at their offices with the flexibility to cut away for important meetings or calls. Whichever program works best for your team, we have got you covered!

For more information, reach out to us at info@rittmanmead.com. If you’ve made it all the way to the end of this blog post, congratulations! As a way of saying thanks, you can mention the code “RMTEN” to save 10% on your next private or public course booking!

Categories: BI & Warehousing

Rittman Mead Founder Member of the Red Expert Alliance

Thu, 2014-11-20 05:24

At Rittman Mead we’re always keen to collaborate with our peers and partners in the Oracle industry, running events such as the BI Forum in Brighton and Atlanta each year and speaking at user groups in the UK, Europe, USA and around the world. We were pleased therefore to be contacted by our friends over at Amis in the Netherlands just before this year’s Openworld with an idea they had about forming an expert-services Oracle partner network, and to see if we’d be interested in getting involved.

Fast-forward a few weeks and thanks to the organising of Amis’ Robbrecht van Amerongen we had our first meeting at San Francisco’s Thirsty Bear, where Amis and Rittman Mead were joined by the inaugural member partners Avio Consulting, E-VIta AS, Link Consulting, Opitz Consulting and Rubicon Red.

Gathering of the expert alliance partners at #oow14 @Oracle open world. pic.twitter.com/PNK8fpt7iU

— Red Expert Alliance (@rexpertalliance) November 3, 2014

The Red Expert Alliance now has a website, news page and partner listing, and as the “about” page details, it’s all about collaborating between first-class Oracle consulting companies:

NewImage

“The Red Expert Alliance is an international  network of first-class Oracle consulting companies. We are  working together to deliver the maximum return on our customers investment in Oracle technology. We do this by collaborating, sharing and challenging each other to improve ourselves and our customers.

Collaborating with other companies is a powerful way to overcome challenges of today’s fast-paced world and improve competitive advantage. Collaboration provides participants mutual benefits such as shared resources, shared expertise and enhanced creativity; it gives companies an opportunity to improve their performance and operations, achieving more flexibility thanks to shared expertise and higher capacity. Collaboration also fuels innovation by providing more diversity to the workplace which can result in better-suited solutions for customers.”

There’s also a press release, and a partner directory listing out our details along with the other partners in the alliance. We’re looking forward to collaborating with the other partners in the Red Expert Alliance, increasing our shared knowledge and collaborating together on Oracle customer projects in the future.

Categories: BI & Warehousing

OBIEE SampleApp v406 Amazon EC2 AMI – available for public use

Thu, 2014-11-13 08:34

I wrote a while ago about converting Oracle’s superb OBIEE SampleApp from a VirtualBox image into an EC2-hosted instance. I’m pleased to announce that Oracle have agreed for us to make the image (AMI) on Amazon available publicly. This means that anyone who wants to run their own SampleApp v406 server on Amazon’s EC2 cloud service can do so.

2014-11-11_17-47-32

2014-11-12_09-15-41

Important caveats

Before getting to the juicy stuff there’s some important points to note about access to the AMI, which you are implicitly bound by if you use it:

  1. In accessing it you’re bound by the same terms and conditions that govern the original SampleApp
  2. SampleApp is only ever for use in your own development/testing/prototyping/demonstrating with OBIEE. It must not be used as the basis for any kind of Productionisation.
  3. Neither Oracle nor Rittman Mead provide any support for SampleApp or the AMI, nor warranty to any issues caused through their use.
  4. Once launched, the server will be accessible to the public and its your responsibility to secure it as such.
How does it work?
  1. Create yourself an AWS account, if you haven’t already. You’ll need your credit card for this. Read more about getting started with AWS here.
  2. Request access to the AMI (below)
  3. Launch the AMI on your AWS account
  4. Everything starts up automagically. After 15-20 minutes, enjoy your fully functioning SampleApp v406 instance, running in the cloud!
How much does it cost?

You can get an estimate of the cost involved using the Amazon Calculator.

As a rough guide, as of November 2014 an “m3.large” instance costs around $4 a day  – but it’s your responsibility to check pricing and commitments.

Be aware that once a server is created you’ll incur costs on it right through until you “terminate” it. You can “stop” it (in effect, power it off) which reduces the running costs but you’ll still pay for the ‘disk’ (EBS volume) that holds it. The benefit of this though is that you can then power it back up and it’ll be as you left it (just with a different IP).

You can track your AWS usage through the AWS page here.

Security
  • Access to the instance’s command line is through SSH as the oracle user using SSH keys only (provided by you when you launch the server) – no password access
    • You cannot ssh to the server as root; instead connect as oracle and use sudo as required.
    • The ssh key does not get set up until the very end of the first boot sequence, which can be 20 minutes. Be patient!
  • All the OBIEE/WebLogic usernames and passwords are per the stock SampleApp v406 image, so you are well advised to change them. Otherwise if someone finds your instance running, they’ll be able to access it
  • There is no firewall (iptables) running on the server. Since this is a public server you’d be wise to make use of Amazon’s Security Group functionality (in effect, a firewall at the virtual hardware level) to block access on all ports except those necessary.
    For example, you could block all traffic except 7780, and then enable access on port 22 (SSH) and 7001 (Admin Server) just when you need to access it for admin.
Using the AMI
  1. You first need to get access to the AMI, through the form below. You also need an active AWS account.
  2. Launch the server:
    1. From the AWS AMI page locate the SampleApp AMI using the details provided when you request access through the form below. Make sure you are on the Ireland/eu-west-1 region. Click Launch.
    2. Select an Instance Type. An “m3.large” size is a good starting point (this site is useful to see the spec of all instances).2014-11-11_22-42-48
    3. Click through the Configure Instance DetailsAdd Storage, and Tag Instance screens without making changes unless you need to.
    4. On the Security Group page select either a dedicated security group if you have already configured one, or create a new one.
      A security group is a firewall that controls traffic to the server regardless of any software firewall configured or not on the instance. By default only port 22 (SSH) is open, so you’ll need to open at least 7780 for analytics, and 7001 too if you want to access WLS/EM as well
      Note that you can amend a security group’s rules once the instance is created, but you cannot change which security group it is bound to. For ad-hoc purposes I’d always use a dedicated security group per instance so that you can change rules just for your server without impacting others on your account.
      2014-11-11_22-47-33
    5. Click on Review and Launch, check what you’ve specified, and then click Launch. You’ll now need to either specific an existing SSH key pair, or generate a new one. It’s vital that you get this bit right, otherwise you’ll not be able to access the server. If you generate a new key pair, make sure you download it (it’ll be a .pem file). 2014-11-11_22-49-29
    6. Click Launch Instances
      2014-11-11_22-51-09You’ll get a hyperlinked Instance ID; click on that and it’ll take you to the Instances page filtered for your new server.
      2014-11-11_22-52-01
      Shortly you’ll see the server’s public IP address shown.
      2014-11-11_22-52-09
  3. OBIEE is configured to start automagically at boot time along with the database. This means that in theory you don’t need to actually access the server directly. It does take 15-20 minutes on first boot to all fire up though, so be patient.
  4. The managed server is listening on port 7780, and admin server on 7001. If your server IP is 42.42.42.42 the URLs would be:
    • Analytics: http://42.42.42.42:7780/analytics
    • WLS: http://42.42.42.42:7001/console
    • EM: http://42.42.42.42:7001/em
On the server

The server is a stock SampleApp v406 image, with a few extras:

  • obiee and dbora services configured and set to run at bootup. Control obiee using:
    sudo service obiee status
    sudo service obiee stop
    sudo service obiee start
    sudo service obiee restart
  • screen installed with a .screenrc setup
Accessing the AMI

To get access to the AMI, please complete this short form and we will send you the AMI details by email.

By completing the form and requesting access to the AMI, you are acknowledging that you have read and understood the terms and conditions set out by Oracle here.

Categories: BI & Warehousing

Rittman Mead anuncia su catálogo de cursos en Español y en Portugués.

Wed, 2014-11-12 07:55

Spanish_Portuguese_Training

UnitedKingdom-ukflag brazil-flag

Tenemos el agrado de anunciarles que desde ahora Rittman Mead ofrece su catálogo completo de cursos en Español y en Portugués.  Esta es una gran noticia para quienes viven en América Latina, España y Portugal, ya que ahora pueden recibir la mejor capacitación, incluyendo el material teórico/práctico, en su idioma local.

Los cursos se dictan tanto en forma remota, con clases virtuales en vivo a través de nuestra plataforma web como también en forma presencial. De ambas maneras, cada estudiante recibirá el material teórico/práctico en forma electrónica y tendrá acceso durante el curso a una máquina virtual de uso exclusivo, para realizar las prácticas.

Ofrecemos un amplia variedad de cursos de Oracle Business Intelligence y tecnologías de Data Warehousing: desde cursos intensivos de 5 días (bootcamps)  a conjunto de cursos específicos orientados por rol de trabajo. Acceda al catálogo completo en Español, Portugués o Inglés.

¿Está interesado en nuestros cursos o quiere ser nuestro partner en capacitación? No dude en consultarnos al mail training@rittmanmead.com

Categories: BI & Warehousing

Auditing OBIEE Presentation Catalog Activity with Custom Log Filters

Mon, 2014-11-10 01:49

A question that I’ve noticed coming up a few times on the OBIEE OTN forums goes along the lines of “How can I find out who deleted a report from the Presentation Catalog?”. And whilst the BI Server’s Usage Tracking is superb for auditing who ran what report, we don’t by default have a way of seeing who deleted a report.

The Presentation Catalog (or “Web Catalog” as it was called in 10g) records who created an object and when it was last modified, accessible through both OBIEE’s Catalog view, and the dedicated Catalog Manager tool itself:


But if we want to find out who deleted an object, or maybe who modified it before the most recent person (that is, build up an audit trail of who modified an object) we have to dig a bit deeper.

Presentation Services Log Sources

Perusing the OBIEE product manuals, one will find documented additional Logging in Oracle BI Presentation Services options. This is more than just turning up the log level en masse, because it also includes additional log writers and filters. What this means is that you can have your standard Presentation Services logging, but then configure a separate file to capture more detailed information about just specific goings on within Presentation Services.

Looking at a normal Presentation Services log (in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/coreapplication_obips1/) you’ll see various messages by default – greater or fewer depending on the health of your system – but they all use the Location stack track, such as this one here:

[2014-11-10T06:33:19.000-00:00] [OBIPS] [WARNING:16] [] [saw.soap.soaphelpers.writeiteminfocontents] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000001ede,0:1] [tid: 2569512704] Resolving and writing full ACL for path /shared/Important stuff/Sales by brand[[
File:soaphelpers.cpp
Line:609
Location:
        saw.soap.soaphelpers.writeiteminfocontents
        saw.soap.catalogservice
        saw.SOAP
        saw.httpserver.request.soaprequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales by brand
AuthProps: AuthSchema=UidPwd-soap|PWD=******|UID=weblogic|User=weblogic
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000001ede,0:1
ThreadID: 2569512704

And it is the Location that is of interest to us here, because it’s what gives hints about the types of log messages that can be emitted and that we may want to filter. For example, the one quoted above is evidently something to do with the Presentation Catalog and SOAP, which I’d guess is a result of Catalog Manager (which uses web services/SOAP to access OBIEE).

To get a full listing of all the possible log sources, first set up the BI command line environment with bi-init:

source $FMW_HOME/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

and then run:

sawserver -logsources

(If you get an error, almost certainly you didn’t set up the command line environment properly with bi-init). You’ll get an list of over a thousand lines (which gives you an idea of quite how powerful this granular logging is). Assuming you’ll want to peruse it at your leisure, it makes sense to write it to disk which if you’re running this on *nix you can simply do thus:

sawserver -logsources > sawserver.logsources.txt

To find what you want on the list, you can just search through it. Looking for anything related to “catalog” and narrowing it down further, I came up with these interesting sources:

[oracle@demo ~]$ sawserver -logsources|grep catalog|grep local
saw.catalog.item.getlocalized
saw.catalog.local
saw.catalog.local.checkforcatalogupgrade
saw.catalog.local.copyItem
saw.catalog.local.createFolder
saw.catalog.local.createLink
saw.catalog.local.deleteItem
saw.catalog.local.getItemACL
saw.catalog.local.getItemInfo
saw.catalog.local.loadCatalog
saw.catalog.local.moveItem
saw.catalog.local.openObject
saw.catalog.local.readObject
saw.catalog.local.search
saw.catalog.local.setItemACL
saw.catalog.local.setItemInfo
saw.catalog.local.setMaintenanceMode
saw.catalog.local.setOwnership
saw.catalog.local.writeObject

Configuring granular Presentation Services logging

Let us see how to go and set up this additional logging. Remember, this is not the same as just going to Enterprise Manager and bumping the log level to 11 globally – we’re going to retain the default logging level, but for just specific actions that occur within the tool, capture greater information. The documentation for this is here.

The configuration is found in the instanceconfig.xml file, so like all good sysadmins let’s take a backup first:

cd $FMW_HOME/instances/instance1/config/OracleBIPresentationServicesComponent/coreapplication_obips1/
cp instanceconfig.xml instanceconfig.xml.20141110

Now depending on your poison, open the instanceconfig.xml directly in a text editor from the command line, or copy it to a desktop environment where you can open it in your favourite text editor there. Either way, these are the changes we’re going to make:

  1. Locate the <Logging> section. Note that within it there are three child entities – <Writers>, <WriterClassGroups> and <Filters>. We’re going to add an entry to each.

  2. Under <Writers>, add:

    <Writer implementation="FileLogWriter" name="RM Presentation Catalog Audit" disableCentralControl="true" writerClassId="6" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="rm_pres_cat_audit" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>

    This defines a new writer than will write logs to disk (FileLogWriter), in 100MB files of which it’ll keep 10. If you’re defining additional Writers, make sure they have a unique writerClassId See docs for detailed syntax.

  3. Under <WriterClassGroups> add:

    <WriterClassGroup name="RMLog">6</WriterClassGroup>

    This defines the RMLog class group as being associated with writerClassId 6 (as defined above), and is used in the Filters section to direct logs. If you wanted you could log entries to multiple logs (eg both file and console) this way.

  4. Under <Filters> add:

    <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.moveItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
    <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.deleteItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>

    Here we’re defining two event filters, with levels turned up to max (32), directing the capture of any occurences to the RMLog writerClassGroup.

After making the changes to instanceconfig.xml, restart Presentation Services:

$FMW_HOME/instances/instance1/bin/opmnctl restartproc ias-component=coreapplication_obips1

Here’s the completed instanceconfig.xml from the top of the file through to the end of the <Logging> section, with my changes overlayed the defaults:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Oracle Business Intelligence Presentation Services Configuration File -->
<WebConfig xmlns="oracle.bi.presentation.services/config/v1.1">
   <ServerInstance>

      <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><CatalogPath>/app/oracle/biee/instances/instance1/SampleAppWebcat</CatalogPath>

      <DSN>AnalyticsWeb</DSN>

      <Logging>

         <Writers>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="FileLogWriter" name="Global File Logger" writerClassId="1" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="sawlog" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="CoutWriter" name="Console Logger" writerClassId="2" maxFileSizeKb="10240"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="EventLogWriter" name="System Event Logger" writerClassId="3" maxFileSizeKb="10240"/>
            <!--  The following writer is not centrally controlled -->
            <Writer implementation="FileLogWriter" name="Webcat Upgrade Logger" disableCentralControl="true" writerClassId="5" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="webcatupgrade" maxFileSizeKb="2147483647" filesN="1" fmtName="ODL-Text"/>
            <Writer implementation="FileLogWriter" name="RM Presentation Catalog Audit" disableCentralControl="true" writerClassId="6" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="rm_pres_cat_audit" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>
         </Writers>

         <WriterClassGroups>
            <WriterClassGroup name="All">1,2,3,5,6</WriterClassGroup>
            <WriterClassGroup name="File">1</WriterClassGroup>
            <WriterClassGroup name="Console">2</WriterClassGroup>
            <WriterClassGroup name="EventLog">3</WriterClassGroup>
            <WriterClassGroup name="UpgradeLogFile">5</WriterClassGroup>
            <WriterClassGroup name="RMLog">6</WriterClassGroup>
         </WriterClassGroups>

         <Filters>
            <!--  These FilterRecords are updated by centrally controlled configuration -->
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><FilterRecord writerClassGroup="File" path="saw" information="1" warning="31" error="31" trace="0" incident_error="1"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><FilterRecord writerClassGroup="File" path="saw.mktgsqlsubsystem.joblog" information="1" warning="31" error="31" trace="0" incident_error="1"/>

            <!--  The following FilterRecords are not centrally controlled -->
            <FilterRecord writerClassGroup="UpgradeLogFile" disableCentralControl="true" path="saw.subsystem.catalog.initialize.upgrade" information="1" warning="32" error="32" trace="1" incident_error="32"/>
            <FilterRecord writerClassGroup="UpgradeLogFile" disableCentralControl="true" path="saw.subsystem.catalog.upgrade" information="1" warning="32" error="32" trace="1" incident_error="32"/>
            <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.moveItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
            <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.deleteItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
         </Filters>

      </Logging>

[...]

Granular logging in action

Having restarted Presentation Services after making the above change, I can see in my new log file whenever an item from the Presentation Catalog is deleted, by whom, and from what IP address:

[2014-11-10T07:13:36.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.deleteItem] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002cf1,0:1] [tid: 2458068736] Succeeded with '/shared/Important stuff/Sales by brand 2'[[
File:localwebcatalog.cpp
Line:626
Location:
        saw.catalog.local.deleteItem
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales by brand 2
SessionID: p8n6ojs0vkh7tou0mkstmlc9me381hadm9o1fui
AuthProps: AuthSchema=UidPwd|PWD=******|UID=r.mellie|User=r.mellie
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002cf1,0:1
ThreadID: 2458068736
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: action='rm',_scid='QR5zMdHIL3JsW1b67P9p',icharset='utf-8',urlGenerator='qualified',paths='["/shared/Important stuff/Sales by brand 2"]'
]]

And the same for when a file is moved/renamed:

[2014-11-10T07:28:17.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.moveItem] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000003265,0:1] [tid: 637863680] Source '/shared/Important stuff/copy of Sales by brand', Destination '/shared/Important stuff/Sales by brand 2': Succeeded with '/shared/Important stuff/copy of Sales by brand'[[
File:localwebcatalog.cpp
Line:1186
Location:
        saw.catalog.local.moveItem
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/copy of Sales by brand
SessionID: ddt6eo7llcm0ohs5e2oivddj7rtrhn8i41a7f32
AuthProps: AuthSchema=UidPwd|PWD=******|UID=f.saunders|User=f.saunders
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000003265,0:1
ThreadID: 637863680
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: path='/shared/Important stuff/copy of Sales by brand',action='ren',_scid='84mO8SRViXlwJ*180HV7',name='Sales by brand 2',keepLink='f',icharset='utf-8',urlGenerator='qualified'
]]

Be careful with your logging

Just because you can log everything, don’t be tempted to actually log everything. Bear in mind that we’re crossing over from simple end-user logging here into the very depths of the sawserver (Presentation Services) code, accessing logging that is extremely diagnostic in nature. Which for our specific purpose of tracking when someone deletes an object from the Presentation Catalog is handy. But as an example, if you enable saw.catalog.local.writeObject event logging, you may think that it will record who changed a report when, and that might be useful. But – look at what gets logged every time someone saves a report:

[2014-11-10T07:19:32.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.writeObject] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002efb,0:1] [tid: 2454759168] Succeeded with '/shared/Important stuff/Sales 01'[[
File:localwebcatalog.cpp
Line:1476
Location:
        saw.catalog.local.writeObject
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales 01
SessionID: p8n6ojs0vkh7tou0mkstmlc9me381hadm9o1fui
AuthProps: AuthSchema=UidPwd|PWD=******|UID=r.mellie|User=r.mellie
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002efb,0:1
ThreadID: 2454759168
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: path='/shared/Important stuff/Sales 01',action='wr',_scid='QR5zMdHIL3JsW1b67P9p',repl='t',followLinks='t',icharset='utf-8',modifiedTime='1415600931000',data='<saw:report xmlns:saw="com.siebel.analytics.web/report/v1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sawx="com.siebel.analytics.web/expression/v1.1" xmlVersion="201201160"><saw:criteria xsi:type="saw:simpleCriteria" subjectArea="&quot;A - Sample Sales&quot;" withinHierarchy="true"><saw:columns><saw:column xsi:type="saw:regularColumn" columnID="c1dff1637cbc77948"><saw:columnFormula><sawx:expr xsi:type="sawx:sqlExpression">"Time"."T05 Per Name Year"</sawx:expr></saw:columnFormula></saw:column></saw:columns></saw:criteria><saw:views currentView="0"><saw:view xsi:type="saw:compoundView" name="compoundView!1"><saw:cvTable><saw:cvRow><saw:cvCell viewName="titleView!1"><saw:displayFormat><saw:formatSpec/></saw:displayFormat></saw:cvCell></saw:cvRow><saw:cvRow><saw:cvCell viewName="tableView!1"><saw:displayFormat><saw:formatSpec/></saw:displayFormat></saw:cvCell></saw:cvRow></saw:cvTable></saw:view><saw:view xsi:type="saw:titleView" name="titleView!1"/><saw:view xsi:type="saw:tableView" name="tableView!1" scrollingEnabled="false"><saw:edges><saw:edge axis="page" showColumnHeader="true"/><saw:edge axis="section"/><saw:edge axis="row" showColumnHeader="true"><saw:edgeLayers><saw:edgeLayer type="column" columnID="c1dff1637cbc77948"/></saw:edgeLayers></saw:edge><saw:edge axis="column" showColumnHeader="rollover"/></saw:edges></saw:view></saw:views></saw:report>',sig='queryitem1'
]]

It’s the whole report definition! And this is a very very small report – real life reports can be page after page of XML. That is not a good level at which to be recording this information. If you want to retain this kind of control over who is saving what report, you should maybe be looking at authorisation groups for your users in terms of where they can save reports, and have trusted ‘gatekeepers’ for important areas.

As well as the verbose report capture with the writeObject event, you also get this background chatter:

[2014-11-10T07:20:27.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.writeObject] [ecid: 0051rj7FmC3Fw000jzwkno0007PK000000,0:200] [tid: 3034580736] Succeeded with '/users/r.mellie/_prefs/volatileuserdata'[[
File:localwebcatalog.cpp
Line:1476
Location:
        saw.catalog.local.writeObject
        saw.subsystem.security.cleanup
        saw.Sessions.cache.cleanup
        saw.taskScheduler.processJob
        taskscheduler
        saw.threads
Path: /users/r.mellie/_prefs/volatileuserdata
ecid: 0051rj7FmC3Fw000jzwkno0007PK000000,0:200
ThreadID: 3034580736
task: Cache/Sessions
]]

volatileuserdata is presumably just that (user data that is volatile, constantly changing) and not something that it would be of interest to anyone to log – but you can’t capture actual report writes without capturing this too. On a busy system you’re going to be unnecessarily thrashing the log files if you capture this event by routine – so don’t!

Summary

The detailed information is there for the taking in Presentation Services’ excellent granular log sources – just be careful what you capture lest you bite off more than you can chew.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

Tue, 2014-11-04 16:02

In this post we will see how Kibana can be used to create visualisations over various sets of data that we have combined together. Kibana is a graphical front end for data held in ElasticSearch, which also provides the analytic capabilities. Previously we looked at where the data came from and exposing it through Hive, and then loading it into ElasticSearch. Here’s what we’ve built so far, the borders denoting what was covered in the previous two blog articles and what we’ll cover here:

kib32

Now that we’ve got all the data into Elasticsearch, via Hive, we can start putting some pictures around it. Kibana works by directly querying Elasticsearch, generating the same kind of queries that you can run yourself through the Elasticsearch REST API (similar to what we saw when defining the mappings in the previous article). In this sense there is a loose parallel between OBIEE’s Presentation Services and the BI Server – one does the fancy front end stuff, generating queries to the hard-working backend.

I’ve been looking at both the current release version of Kibana (3.x), and also the beta of Kibana 4 which brings with it a very smart visualiser that we’ll look at in detail. It looks like Kibana 4 is a ground-up rewrite rather than modifications to Kibana 3, which means that at the moment it is a long way from parity of functionality – which is why I’m flitting between the two. For a primer in Kibana 3 and its interface see my article on using it to monitor OBIEE.

Installing Kibana is pretty easy in Kibana 3, involving a simple config change to a web server of your choice that you need to provide (details in my previous blog), and has been made even easier in Kibana 4 which actually ships with its own web server so you literally just download it, unarchive it and run it.

So the starting point is the assumption we have all the data in a single Elasticsearch index all_blog, with three different mappings which Kibana refers to accurately as “types”: blog posts, blog visits, and blog tweets.

Kibana 3

Starting with a simple example first, and to illustrate the “analysed” vs “non-analysed” mapping configuration that I mentioned previously, let’s look at the Term visualisation in Kibana 3. This displays the results of an Elasticsearch analysis against a given field. If the field has been marked as “not analysed” we get a listing of the literal values, ranking by the number of times they repeat. This is useful, for example, to show who has blogged the most:

But less useful if we want to analyse the use of words in blog titles, since non-analysed we just get a listing of blog titles:

(there are indeed two blog posts entitled “Odds and Ends” from quite a while ago 1 2)

Building the Term visualisation against the post title field that has been analysed gives us a more interesting, although hardly surprising, result:

Here I’ve weeded out the obvious words that will appear all the time (‘the’, ‘a’, etc), using the Exclude Term(s) option.

Term visualisations are really useful for displaying any kind of top/bottom ranked values, and also because they are interactive – if you click on the value it is applied as a filter to the data on the page. What that means is that we can take a simple dashboard using the two Term objects above, plus a histogram of posts made over time:

And by clicking on one of the terms (for example, my name in the authors list) it shows that I only started posting on the Rittman Mead blog three years ago, and that I write about OBIEE, performance, and exalytics.

Taking another tack, we can search for any term and add it in to the histogram. Here we can see when interest in 11g (the green line), as well as big data (red), started :

Note here we’re just analyzing post titles not content so it’s not 100% representative. Maybe loading in our post contents to Elasticsearch will be my next blog post. But that does then start to get a little bit meta…

Adding in a Table view gives us the ability to show the actual posts and links to them.

Let’s explore the data a bit. Clicking on an entry in the table gives us the option to filter down further

Here we can see for a selected blog post, what its traffic was and when (if at all) it was tweeted:

Interesting in the profile of blog hits is a second peak that looks like it might correlate with tweets. Let’s drill further by drag-clicking (brushing) on the graph to select the range we want, and bring in details of those tweets:

So this is all pretty interesting, and importantly, very rapid in terms of both the user experience and the response time.

Kibana 4

Now let’s take a look at what Kibana 4 offers us. As well as a snazzier interface (think hipster data explorer vs hairy ops guy parsing logs), its new Visualiser builder is great. Kibana 3 dumped you on a dashboard in which you have to build rows and panels and so on. Kibana 4 has a nice big “Visualize” button. Let’s see what this does for us. To start with it’s a nice “guided” build process:

By default we get a single bar, counting all the ‘documents’ for the time period. We can use the Search option at the top to filter just the ‘type’ of document we want, which in this case is going to be tweets about our blog articles.

Obviously, a single bar on its own isn’t that interesting, so let’s improve it. We’ll click the “Add Aggregation” button (even though to my pedantic mind the data is already aggregated to total), and add an X-Axis of date:

The bucket size in the histogram defaults to automatic, and the the axis label tells us it’s per three hours. At the volume of tweets we’re analysing, we’d see patterns better at a higher grain such as daily (the penultimate bar to the right of the graph shows a busy day of tweets that’s lost in the graph at 3-hour intervals):

NB at the moment in Kibana 4 intervals are fixed (in Kibana 3 they were freeform).

Let’s dig into the tweets a bit deeper. Adding a “Sub Aggregation” to split the bars based on top two tweet authors per day gives us this:

You can hover over the legend to highlight the relevant bar block too:

Now with a nifty function in the Visualizer we can change the order of this question. So instead of, “by day, who were the top two tweeters”, we can ask “who were the top two tweeters over the time period, and what was their tweet count by day” – all just by rearranging the buckets/aggregation with a single click:

Let’s take another angle on the data, looking not at time but which blog links were most tweeted, and by whom. Turns out I’m a self-publicist, tweeting four times about my OOW article. Note that I’ve also including some filtering on my data to exclude automated tweets:

Broadening out the tweets to all those from accounts we were capturing during the sample we can see the most active tweeters, and also what proportion are original content vs retweets:

Turning our attention to the blog hits, it’s easy to break it down by top five articles in a period, accesses by day:

Having combined (dare I say, mashed up) post metadata with apache logs, we can overlay information about which author gets the most hits. Unsuprisingly Mark Rittman gets the lion’s share, but interestingly Venkat, who has not blogged for quite a while is still in the top three authors (based on blog page hits) in the time period analysed:

It’s in the current lack of a table visualisation that Kibana 4 is currently limited (although it is planned), because this analysis here (of the top three authors, what were their respective two most popular posts) just makes no sense as a graph:

but would be nice an easy to read off a table. You can access a table view of sorts from the arrow at the bottom of the screen, but this feels more like a debug option than an equal method for presenting the data

Whilst you can access the table on a dashboard, it doesn’t persist as the default option of the view, always showing the graph initially. As noted above, a table visualisation is planned and under development for Kibana 4.

Speaking of dashboards, Kibana 4 has a very nice dashboard builder with interactive resizing of objects both within rows and columns – quite a departure from Kibana 3 which has a rigid system of rows and panels:

Summary

Kibana 3 is great for properly analysing data and trends as you find them in the data, if you don’t mind working your way through the slightly rough interface. In contrast, Kibana 4 has a pretty slick UI but being an early beta is missing features like Term and Table from Kibana 3 that would enable tables of data as well as the pretty graphs. It’ll be great to see how it develops.

Putting the data in Elasticsearch makes it very fast to query. I’m doing this on a the Big Data Lite VM which admittedly is not very representative of a realworld Hadoop cluster but the relative speeds are interesting – dozens of seconds for any kind of Hive query, subsecond for any kind of Kibana/Elasticsearch query. The advantage of the latter of course being very interesting from a data exploration point of view, because you not only have the speed but also the visualisation and interactions with those visuals to dig and drill further into it.

Whilst Elasticsearch is extremely fast to query, I’ve not compared it to other options that are designed for speed (eg Impala) and which support a more standard interface, such as ODBC or JDBC so you can bring your own data visualisation tool (eg T-who-shall-not-be-named). In addition, there is the architectural consideration of Elasticsearch’s fit with the rest of the Hadoop stack. Whilst the elasticsearch-hadoop connector is two-way, I’m not sure if you would necessarily site your data in Elasticsearch alone, opting instead to duplicate all or part of it from somewhere like HDFS.

What would be interesting is to look at a similar analysis exercise using the updated Hue Search in CDH 5.2 which uses Apache Solr and therefore based on the same project as Elasticsearch (Apache Lucene). Another angle on this is Oracle’s forthcoming Big Data Discovery tool which also looks like it covers a similar purpose.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

Tue, 2014-11-04 15:17
Introduction

In the first part of this series I described how I made several sets of data relating to the Rittman Mead blog from various sources available through Hive. This included blog hits from the Apache webserver log, tweets, and metadata from WordPress. Having got it into Hive I now need to get it into ElasticSearch as a pre-requisite for using Kibana to see how it holds up as a analysis tool or as a “data discovery” option. Here’s a reminder of the high-level architecture, with the parts that I’ve divided it up into covering over the three number of blog posts indicated:

kib31

In this article we will see how to go about doing that load into ElasticSearch, before getting into some hands-on with Kibana in the final article of this series.

Loading data from Hive to Elasticsearch

We need to get the data into Elasticsearch itself since that is where Kibana requires it to be for generating the visualisations. Elasticsearch holds the data and provides the analytics engine, and Kibana provides the visualisation rendering and the generation of queries into Elasticsearch. Kibana and Elasticsearch are the ‘E’ and ‘K’ of the ELK stack, which I have written about previously (the ‘L’ being Logstash but we’re not using that here).

Using the elasticsearch-hadoop connector we can load data exposed through Hive into Elasticsearch. It’s possible to load data directly from origin into Elasticsearch (using, for example, Logstash) but here we’re wanting to bring together several sets of data using Hadoop/Hive as the common point of integration.

Elasticsearch has a concept of an ‘index’ within which data is stored, held under a schema known as a ‘mapping’. Each index can have multiple mappings. It’s dead easy to run Elasticsearch – simply download it, unpack the archive, and then run it – it really is as easy as that:

[oracle@bigdatalite ~]$ /opt/elasticsearch-1.4.0.Beta1/bin/elasticsearch
[2014-10-30 16:59:39,078][INFO ][node                     ] [Master] version[1.4.0.Beta1], pid[13467], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-30 16:59:39,080][INFO ][node                     ] [Master] initializing ...
[2014-10-30 16:59:39,094][INFO ][plugins                  ] [Master] loaded [], sites [kopf, gui]
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] initialized
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] starting ...
[2014-10-30 16:59:43,419][INFO ][transport                ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.57.3:9300]}
[2014-10-30 16:59:43,446][INFO ][discovery                ] [Master] elasticsearch/mkQYgr4bSiG-FqEVRkB_iw
[2014-10-30 16:59:46,501][INFO ][cluster.service          ] [Master] new_master [Master][mkQYgr4bSiG-FqEVRkB_iw][bigdatalite.localdomain][inet[/192.168.57.3:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-30 16:59:46,552][INFO ][http                     ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.57.3:9200]}
[2014-10-30 16:59:46,552][INFO ][node                     ] [Master] started

You can load data directly across into Elasticsearch from Hive without having to prepare anything on Elasticsearch – it will create the index and mapping for you. But, for it to work how we want, we do need to specify the mapping in advance because we want to tell Elasticsearch two important things:

  • To treat the date field as a date – crucial for Kibana to do its time series-based magic
  • Not to “analyze” certain fields. By default Elasticsearch will analyze each string field so that you can display most common terms within it etc. However if we want to report things like blog title, breaking it down into individual words doesn’t make sense.

This means that the process is as follows:

  1. Define the Elasticsearch table in Hive
  2. Load a small sample of data into Elasticsearch from Hive
  3. Extract the mapping and amend the date field and mark required fields as non-analysed
  4. Load the new mapping definition to Elasticsearch
  5. Do a full load from Hive into Elasticsearch

Steps 2 and 3 can be sidestepped by crafting the mapping by hand from the outset but it’s typically quicker not to.

Before we can do anything in terms of shifting data around, we need to make elasticsearch-hadoop available to Hadoop. Download it from the github site, and copy the jar file to /usr/lib/hadoop and add it to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive-env.sh.

Defining the Hive table over Elasticsearch

The Hive definition for a table stored in Elasticsearch is pretty simple. Here’s a basic example of a table that’s going to hold a list of all blog posts made. Note the _es suffix, a convention I’m using to differentiate the Hive table from others with the same data and denoting that it’s in Elasticsearch (es). Also note the use of EXTERNAL as previously discussed, to stop Hive trashing the underlying data if you drop the Hive table:

CREATE EXTERNAL TABLE all_blog_posts_es (
ts_epoch bigint ,
post_title string ,
post_title_a string ,
post_author string ,
url string ,
post_type string )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/posts'
) ;

The ROW FORMAT and STORED BY are standard, but the TBLPROPERTIES values should be explained (you’ll find full details in the manual):

  1. es.nodes – this is the hostname of the Elasticsearch server. If you have multiple nodes it will discover the others from this.
  2. es.resource – this is the index and mapping where the data should be stored. We’ll see more about these later, because they’re important.
Time for a tangent …

The biggest issue I had getting data from Hive into Elasticsearch was timestamps. To cut a very long story (involving lots of random jiggling, hi Christian!) short, I found it was easiest to convert timestamps into Unix epoch (number of seconds since Jan 1st 1970), rather than prat about with format strings (and prat about I did). For timestamps already matching the ISO8601 standard such as those in my WordPress data, I could leverage the Hive function UNIX_TIMESTAMP which returns exactly that

0: jdbc:hive2://bigdatalite:10000> select post_date, unix_timestamp(post_date) as post_date_epoch from posts limit 1;
post_date        2007-03-07 17:45:07
post_date_epoch  1173289507

For others though that included the month name as text such as Wed, 17 Sep 2014 08:31:20 +0000 I had to write a very kludgy CASE statement to first switch the month names for numbers and then concatenate the whole lot into a ISO8601 that could be converted to unix epoch. This is why I also split the apache log SerDe so that it would bring in the timestamp components (time_dayDD, time_monthMMM, etc) individually, making the epoch conversion a little bit neater:

unix_timestamp(concat(concat(concat(concat(concat(concat(
a.time_yearyyyy,'-')
,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
,a.time_daydd,' ')
,a.time_hourhh,':')
,a.time_minmm,':')
,a.time_secss,'')
)

Because if you thought this was bad, check out what I had to do to the twitter timestamp:

unix_timestamp(
    concat(concat(concat(concat(regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-')
    ,case regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} ',''),' .*$','') 
    when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,regexp_replace(regexp_replace(created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))
)

As with a few things here, this was all for experimentation than streamlined production usage, so it probably could be rewritten more efficiently or solved in a better way – suggestions welcome!

So the nett result of all of these is the timestamp as epoch in seconds – but note that Elasticsearch works with millisecond epoch, so they all need multiplying by 1000.

As I’ve noted above, this feels more complex than it needed to have been, and maybe with a bit more perseverence I could have got it to work without resorting to epoch. The issue I continued to hit with passing timestamps across as non-epoch values (i.e. as strings using the format option of the Elasticsearch mapping definition, or Hive Timestamp, and even specifying es.mapping.timestamp) was org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: TimestampParsingException, regardless of the careful format masks that I applied.

Back on track – loading a sample row into Elasticsearch

We want to send a sample row of data to Elasticsearch now for two reasons:

  1. As a canary to prove the “plumbing” – no point chucking thousands of rows across through MapReduce if it’s going to fall over for a simple problem (I learnt my lesson during the timestamp fiddling above).
  2. Automagically generate the Elasticsearch mapping, which we subsequently need to modify by hand and is easier if it’s been created for us first.

Since the table is defined in Hive, we can just run a straightforward INSERT to send some data across, making use of the LIMIT clause of HiveQL to just send a couple of rows:

INSERT INTO TABLE all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
       title, 
       title, 
       author, 
       REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
       post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
LIMIT 2
;

Hive will generate a MapReduce job that pushes the resulting data over to Elasticsearch. You can see the log for the job – essential for troubleshooting – at /var/log/hive/hive-server2.log (by default). In this snippet you can see a successful completion:

2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Starting Job = job_1414451727442_0011, Tracking URL = http://bigdatalite.localdomain:8088/proxy/application_1414451727442_0011/
2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414451727442_0011
2014-10-30 22:35:22,244 INFO  exec.Task (SessionState.java:printInfo(417)) - Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 1
2014-10-30 22:35:22,275 WARN  mapreduce.Counters (AbstractCounters.java:getGroup(234)) - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-10-30 22:35:22,276 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:22,276 Stage-0 map = 0%,  reduce = 0%
2014-10-30 22:35:30,757 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:30,757 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.51 sec
2014-10-30 22:35:40,098 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:40,098 Stage-0 map = 100%,  reduce = 100%, Cumulative CPU 4.44 sec
2014-10-30 22:35:40,100 INFO  exec.Task (SessionState.java:printInfo(417)) - MapReduce Total cumulative CPU time: 4 seconds 440 msec
2014-10-30 22:35:40,132 INFO  exec.Task (SessionState.java:printInfo(417)) - Ended Job = job_1414451727442_0011
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - MapReduce Jobs Launched:
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.44 sec   HDFS Read: 4313 HDFS Write: 0 SUCCESS
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Total MapReduce CPU Time Spent: 4 seconds 440 msec
2014-10-30 22:35:40,159 INFO  ql.Driver (SessionState.java:printInfo(417)) - OK

But if you’ve a problem with your setup you’ll most likely see this generic error instead passed back to beeline prompt:

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

Meaning that you need to go to the Hive log file for the full diagnostics.

Amending the Elasticsearch mapping

So assuming the previous step worked (if you got the innocuous No rows affected from beeline then it did) you now have an index and mapping (and a couple of “documents” of data) in Elasticsearch. You can inspect the mapping in several ways, including with the GUI for Elasticsearch admin kopf.

You can also interogate Elasticsearch directly with its REST API, which is what we’re going to use to update the mapping so let’s use it also to view it. I’m going to use curl to do the HTTP call, and then pipe it | straight to jq to prettify the resulting JSON that Elasticsearch sends back.

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/posts/_mapping' | jq '.'
{
  "all_blog": {
    "mappings": {
      "posts": {
        "properties": {
          "url": {
            "type": "string"
          },
          "ts_epoch": {
            "type": "long"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string"
          },
          "post_author": {
            "type": "string"
          }
        }
      }
    }
  }
}

We can see from this that Elasticsearch has generated the mapping to match the data that we’ve sent across from Hive (note how it’s picked up the ts_epoch type as being numeric not string, per our Hive table DDL). But, as mentioned previously, there are two things we need to rectify here:

  1. ts_epoch needs to be a date type, not long. Without the correct type, Kibana won’t recognise it as a date field.
  2. Fields that we don’t want broken down for analysis need marking as such. We’ll see the real difference that this makes when we get on to Kibana later.

To amend the mapping we just take the JSON document, make the changes, and then push it back with curl again. You can use any editor with the JSON (I’ve found Atom on the Mac to be great for its syntax highlighting, brace matching, etc). To change the type of the date field just change long to date. To mark a field not for analysis add "index": "not_analyzed" to the column definition. After these changes, the amended fields in my mapping JSON look like this:

[...]
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
            [...]

The particularly eagle-eyed of you will notice that I am loading post_title in twice. This is because I want to use the field both as a label but also to analyse it as a field itself, looking at which terms get used most. So in the updated mapping, only post_title is set to not_analyzed; the post_title_a is left alone.

To remove the existing mapping, use this API call:

curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/posts'

and then the amended mapping put back. Note that the "all_blog" / "mappings" outer levels of the JSON have been removed from the JSON that we send back to Elasticsearch:

curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
{
      "posts": {
        "properties": {
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
          }
        }
      }
    }
'

Full load into Elasticsearch

Now we can go ahead and run a full INSERT from Hive, and this time the existing mapping will be used. Depending on how much data you’re loading, it might take a while but you can always tail the hive-server2.log file to monitor progress. So that we don’t duplicate the ‘canary’ data that we sent across, use the INSERT OVERWRITE statement:

INSERT OVERWRITE table all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
title, 
title, 
author, 
REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
;

To check the data’s made it across we can do a count from Hive:

0: jdbc:hive2://bigdatalite:10000> select count(*) from all_blog_posts_es;
+------+
| _c0  |
+------+
| 2257 |
+------+
1 row selected (27.005 seconds)

But this requires a MapReduce job to run and is fairly slow. Much faster is direct from the horse’s mouth – from Elasticsearch itself where the data is. Just as we called a REST API to get and set the mapping, Elasticsearch can also give us statistics back this way too:

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/_stats/docs' | jq '.indices[].total.docs'
{
  "deleted": 0,
  "count": 2257
}

Here I’ve used a bit more jq to parse down the stats in JSON that Elasticsearch sends back. If you want to explore more of what jq can do, you’ll find https://jqplay.org/ useful.

Code

For reference, here is the set of three curl/DDL/DML that I used:

  • Elasticsearch index mappings

    # For reruns, remove and recreate index
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog' && curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog'
    
    # For partial rerun, remove mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts'
    # Create posts mapping
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
    {
    "posts" : {
    "properties": {
    "ts_epoch": {"type": "date"},
    "post_author": {"type": "string", "index" : "not_analyzed"},
    "post_title": {"type": "string", "index" : "not_analyzed"},
    "post_title_a": {"type": "string", "index" : "analyzed"},
    "post_type": {"type": "string", "index" : "not_analyzed"},
    "url": {"type": "string", "index" : "not_analyzed"}
    }}}
    '
    
    # For partial rerun, remove mapping
    # Create tweets mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets'
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets' -d '
    {"tweets": {
    "properties": {
    "tweet_url": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_type": {
    "type": "string"
    },
    "ts_epoch": {
    "type": "date"
    },
    "tweet_author": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_author_followers": {
    "type": "string"
    },
    "tweet_author_friends": {
    "type": "string"
    },
    "tweet_author_handle": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_analysed": {     "type": "string"      }
    ,"post_author": {       "index": "not_analyzed","type": "string"      }
    ,"post_title": {       "index": "not_analyzed", "type": "string"      }
    ,"post_title_a": {    "type": "string"    }
    
    }
    }
    }'
    
    # For partial rerun, remove mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache'
    # Create apachelog mapping
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache' -d '
    
    {"apache": {
    "properties": {
    "user": {
    "type": "string"
    },
    "url": {
    "index": "not_analyzed",
    "type": "string"
    },
    "status": {
    "type": "string"
    },
    "agent": {
    "index": "not_analyzed",
    "type": "string"
    },
    "host": {
    "type": "string"
    },
    "http_call": {
    "type": "string"
    },
    "http_status": {
    "type": "string"
    },
    "identity": {
    "type": "string"
    },
    "referer": {
    "index": "not_analyzed",
    "type": "string"
    },
    "ts_epoch": {
    "type": "date"
    },
    "size": {
    "type": "string"
    },"post_author": {      "index": "not_analyzed","type": "string"      }
    ,"post_title": {       "index": "not_analyzed", "type": "string"      }
    ,"post_title_a": {    "type": "string"    }
    
    }}}'

  • Hive table DDL

    drop table all_blog_posts_es;
    CREATE external TABLE all_blog_posts_es(
    ts_epoch bigint ,
    post_title string ,
    post_title_a string ,
    post_author string ,
    url string ,
    post_type string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/posts')
    ;
    
    drop table all_blog_tweets_es;
    CREATE EXTERNAL TABLE all_blog_tweets_es(
    tweet_type string ,
    tweet_url string ,
    tweet_author string ,
    tweet string ,
    tweet_analysed string ,
    ts_epoch bigint ,
    tweet_author_handle string ,
    tweet_author_followers string ,
    tweet_author_friends string ,url string ,post_author string ,
    post_title string ,post_title_a string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/tweets')
    ;
    
    drop table all_blog_apache_es;
    CREATE EXTERNAL TABLE all_blog_apache_es(
    host string ,
    identity string ,
    user string ,
    ts_epoch bigint ,
    http_call string ,
    url string ,
    http_status string ,
    status string ,
    size string ,
    referer string ,
    agent string, post_author string ,
    post_title string ,post_title_a string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/apache');

  • Hive DML – load data to Elasticsearch

    insert into table all_blog_posts_es
    select unix_timestamp(post_date) * 1000 as post_date_epoch,title,title,author,
    regexp_extract(generated_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)
    ,post_type
    from posts
    where post_date is not null
    ;
    
    insert overwrite table all_blog_tweets_es
    select x.*,p.author,p.title
    from (
    select 'tweets'
    ,t.url as tweet_url
    ,t.author
    ,t.content as tweet
    ,t.content as tweet_analyzed
    ,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
    ,t.author_handle
    ,t.author_followers
    ,t.author_friends
    ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
    from tweets t lateral view explode (referenced_urls) refs as ref_url
    where t.author_followers is not null
    and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;
    
    insert overwrite table all_blog_tweets_es
    select x.*,p.author,p.title
    from (
    select 'retweets'
    ,t.url as tweet_url
    ,t.author
    ,t.content as tweet
    ,t.content as tweet_analyzed
    ,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
    ,t.author_handle
    ,t.author_followers
    ,t.author_friends
    ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
    from retweets t lateral view explode (referenced_urls) refs as ref_url
    where t.author_followers is not null
    and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;
    
    insert into table all_blog_apache_es
    select x.*,p.author,p.title,p.title
    from (
    select
    a.host,a.identity,a.user
    ,unix_timestamp(concat(concat(concat(concat(concat(concat(
    a.time_yearyyyy,'-')
    ,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,a.time_daydd,' ')
    ,a.time_hourhh,':')
    ,a.time_minmm,':')
    ,a.time_secss,'')
    ) * 1000 as ts_epoch
    ,a.http_call ,regexp_extract(a.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url,a.http_status ,a.status ,a.size ,a.referer ,a.agent
    from apachelog a
    where a.url regexp "^\\/\\d{4}\\/\\d{2}\\/.*"
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;

Summary

With the data loaded into Elasticsearch we’re now ready to start our analysis against it. Stay tuned for the final part in this short blog series to see how we use Kibana to do this.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

Mon, 2014-11-03 15:21
Introduction

I’ve recently started learning more about the tools and technologies that fall under the loose umbrella term of Big Data, following a lot of the blogs that Mark Rittman has written, including getting Apache log data into Hadoop, and bringing Twitter data into Hadoop via Mongodb.

What I wanted to do was visualise the data I’d brought in, looking for patterns and correlations. Obviously the de facto choice at our shop would be Oracle BI, which Mark previously demonstrated reporting on data in Hadoop through Hive and Impala. But, this was more at the “Data Discovery” phase that is discussed in the new Information Management and Big Data Reference Architecture that Rittman Mead helped write with Oracle. I basically wanted a quick and dirty way to start chucking around columns of data without yet being ready to impose the structure of the OBIEE metadata model on it. One of the tools I’ve worked with recently is a visualisation tool called Kibana which is part of the ELK stack (that I wrote about previously for use in building a monitoring solution for OBIEE). In this article we’ll take a look at making data available to Kibana and then the kind of analytics and visualisations you can do with it. In addition, we’ll see how loading the data into ElasticSearch has the benefit of extremely fast query times compared to through Hive alone.

The Data

I’ve got three sources of data I’m going to work with, all related to the Rittman Mead website:

  • Website logs, from Apache webserver
  • Tweets about Rittman Mead blog articles, via Datasift
  • Metadata about blog posts, extracted from the WordPress MySQL database

At the moment I’ve focussed on just getting the data in, so it’s mostly coming from static files, with the exception of the tweets which are held in a noSQL database (MongoDB).

The Tools

This is where ‘big data’ gets fun, because instead of “Acme DI” and “Acme Database” and “Acme BI”, we have the much more interesting – if somewhat silly – naming conventions of the whackier the better. Here I’m using:

  • Kibana – data visualisation tool for Elasticsearch
  • Elasticsearch – data store & analytics / search engine
  • HDFS – Hadoop’s distributed file system
  • MongoDB – NoSQL database
  • Hive – enables querying data held in various places including HDFS (and Elasticsearch, and MongoDB) with a SQL-like query language
  • Beeline – Hive command line interface
  • Datasift – online service that streams tweets matching a given pattern to a nominated datastore (such as MongoDB)
  • mongo-hadoop – a connector for MongoDB to Hadoop including Hive
  • elasticsearch-hadoop – a connector for Elasticsearch to Hadoop including Hive

Kibana only queries data held in Elasticsearch, which acts as both the data store and the analytics engine. There are various ways to get data into Elasticsearch directly from source but I’ve opted not to do that here, instead bringing it all in via HDFS and Hive. I’ve done that because my – albeit fairly limited – experience is that Elasticsearch is great once you’ve settled on your data and schema, but in the same way I’m not building a full OBIEE metadata model (RPD) yet, nor did I want to design my Elasticsearch schema up front and have to reload from source if it changed. Options for reprocessing and wrangling data once in Elasticsearch seem limited and complex, and by making all my data available through Hive first I could supplement it and mash it up as I wanted, loading it into Elasticsearch only when I had a chunk of data to explore. Another approach that I haven’t tried but could be useful if the requirement fits it would be to load the individual data elements directly into their own Elasticsearch area and then using the elasticsearch-hadoop connector run the required mashups with other data through Hive, loading the results back into Elasticsearch. It all depends on where you’re coming from with the data.

Overview

Here’s a diagram of what I’m building:

I’ll explain it in steps as follows:

  1. Loading the data and making it accessible through Hive
  2. Loading data from Hive to Elasticsearch
  3. Visualising and analysing data in Kibana
Getting the data into Hive

Strictly speaking we’re not getting the data into Hive, so much as making it available through Hive. Hive simply enables you to define and query tables sitting on top of data held in places including HDFS. The beauty of the Hadoop ecosystem is that you can physicalise data in a bunch of tools and the components will most often support interoperability with each other. It’s only when you get started playing with it that you realise how powerful this is.

The Apache log files and WordPress metadata suit themselves fairly well to a traditional RDBMS format of [de]normalised tables, so we can store them in HDFS with simple RDBMS tables defined on top through Hive. But the twitter data comes in JSON format (like this), and if we were going to store the Twitter data in a traditional RDBMS we’d have to work out how to explode the document into a normalised schema, catering for varying structures depending on the type of tweet and data payload within it. At the moment we just want to collect all the data that looks useful, and then look at different ways to analyse it afterwards. Instead of having to compromise one way (force a structure over the variable JSON) or another (not put a relational schema over obviously relational data) we can do both, and decide at run-time how to best use it. From there, we can identify important bits of data and refactor our design as necessary. This “schema on read” approach is one of the real essences of Hadoop and ‘big data’ in general.

So with that said, let’s see how we get the data in. This bit is the easy part of the article to write, because a lot of it is pretty much what Mark Rittman has already written up in his articles, so I’ll refer to those rather than duplicate here.

Apache log data

References:

I’ve used a variation on the standard Apache log SerDe that the interwebs offers, because I’m going to need to work with the timestamp quite closely (we’ll see why later) so I’ve burst it out into individual fields.

The DDL is:

CREATE EXTERNAL TABLE apachelog (
host STRING,    identity STRING,    user STRING,
time_dayDD STRING,  time_monthMMM STRING,   time_yearYYYY STRING,
time_hourHH STRING, time_minmm STRING,  time_secss STRING,  time_tzZ STRING,
http_call STRING,   url STRING, http_status STRING, status STRING,  size STRING,    referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(\\d{2})\\/(\\w{3})\\/(\\d{4}):(\\d{2}):(\\d{2}):(\\d{2}) (.*?)\\] \\\"(\\w*) ([^ ]*?)(?:\\/)? ([^ \\\"]*)\\\" (\\d*) (\\d*) \\\"(.*?)\\\" \\\"(.*?)\\\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s")
STORED AS TEXTFILE LOCATION '/user/oracle/apache_logs';

The EXTERNAL is important on the table definition as it stops Hive moving the HDFS files into its own area on HDFS. If Hive does move the files it is annoying if you want to also access them through another program (or Hive table), and downright destructive if you DROP the table since it’ll delete the HDFS files too – unless it’s EXTERNAL. Note the LOCATION must be an HDFS folder, even if it just holds one file.

For building and testing the SerDe regex Rubular is most excellent, but note that it’s Java regex you’re specifying in the SerDe which has its differences from Python or Ruby regex that Rubular (and most other online regex testers) support. For the final validation of Java regex I use the slightly ugly but still useful regexplanet, which also gives you the fully escaped version of your regex which you’ll need to use for the actual Hive DDL/DML.

A sample row from the apache log on disk looks like this:

74.208.161.70 - - [12/Oct/2014:03:47:43 +0000] "GET /2014/09/sunday-times-tech-track-100/ HTTP/1.0" 301 247 "-" "-"

and now in Hive:

0: jdbc:hive2://bigdatalite:10000> !outputformat vertical
0: jdbc:hive2://bigdatalite:10000> select * from apachelog limit 1;
host           74.208.161.70
identity       -
user           -
time_daydd     12
time_monthmmm  Oct
time_yearyyyy  2014
time_hourhh    03
time_minmm     47
time_secss     43
time_tzz       +0000
http_call      GET
url            /2014/09/sunday-times-tech-track-100/
http_status    HTTP/1.0
status         301
size           247
referer        -
agent          -

Twitter data

Reference:

The twitter data we’ve got includes the Hive ARRAY datatype for the collections of hashtag(s) and referenced url(s) from within a tweet. A point to note here is that the author_followers data appears in different locations of the JSON document depending on whether it’s a retweet or not. I ended up with two variations of this table and a UNION on top.

The table is mapped on data held in MongoDB and as with the HDFS data above the EXTERNAL is crucial to ensure you don’t trash your data when you drop your table.

CREATE EXTERNAL TABLE tweets
(
id string,
url string,
author string,
content string,
created_at string,
hashtags ARRAY<string>,
referenced_urls ARRAY<string>,
sentiment STRING,
author_handle string,
author_id string,
author_followers string,
author_friends string
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","url":"interaction.interaction.link","author":"interaction.interaction.author.name","content":"interaction.interaction.content","created_at":"interaction.interaction.created_at","hashtags":"interaction.interaction.hashtags","referenced_urls":"interaction.links.url","sentiment":"interaction.salience.content.sentiment","author_handle":"interaction.interaction.author.username","author_id":"interaction.interaction.author.id","author_followers":"interaction.twitter.user.followers_count","author_friends":"interaction.twitter.user.friends_count"}')
TBLPROPERTIES('mongo.uri'='mongodb://bigdatalite.localdomain:27017/rm_tweets.rm_tweets')
;

The other point to note is that we’re now using mongo-hadoop for Hive to connect to MongoDB. I found that I had to first build the full set of jar files by running ./gradlew jar -PclusterVersion='cdh5', and also download the MongoDB java driver, before copying the whole lot into /usr/lib/hadoop/lib. This is what I had by the end of it:

[oracle@bigdatalite mongo-hadoop-r1.3.0]$ ls -l /usr/lib/hadoop/lib/mongo-*
-rw-r--r--. 1 root root 105446 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-core-1.3.0.jar
-rw-r--r--. 1 root root  21259 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-hive-1.3.0.jar
-rw-r--r--. 1 root root 723219 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-pig-1.3.0.jar
-rw-r--r--. 1 root root    261 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-r1.3.0.jar
-rw-r--r--. 1 root root 697644 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-streaming-1.3.0.jar
-rw-r--r--. 1 root root 591189 Oct 24 00:44 /usr/lib/hadoop/lib/mongo-java-driver-2.12.4.jar

After all that, the data as it appears in Hive looks like this:

id                5441097d591f90cf2c8b45a1
url               https://twitter.com/rmoff/status/523085961681317889
author            Robin Moffatt
content           Blogged: Using #rlwrap with Apache #Hive #beeline for improved readline functionality http://t.co/IoMML2UDxp
created_at        Fri, 17 Oct 2014 12:19:46 +0000
hashtags          ["rlwrap","Hive","beeline"]
referenced_urls   ["http://www.rittmanmead.com/2014/10/using-rlwrap-with-apache-hive-beeline-for-improved-readline-functionality/"]
sentiment         4
author_handle     rmoff
author_id         82564066
author_followers  790
author_friends    375

For reference, without the mongo-hadoop connectors I was getting the error

Error in loading storage handler.com.mongodb.hadoop.hive.MongoStorageHandler

and with them installed but without the MongoDB java driver I got:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON (state=08S01,code=1)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON

WordPress metadata

WordPress holds its metadata in a MySQL database, so it’s easy to extract out:

  1. Run a query in MySQL to generate the CSV export files, such as:

    SELECT p.ID, p.POST_TITLE,p.POST_DATE_GMT,
           p.POST_TYPE,a.DISPLAY_NAME,p.POST_NAME,
           CONCAT('/', DATE_FORMAT(POST_DATE_GMT, '%Y'), '/', LPAD(
           DATE_FORMAT(POST_DATE_GMT, '%c'), 2, '0'), '/', p.POST_NAME) AS
           generated_url
    FROM   posts p
           INNER JOIN users a
                   ON p.POST_AUTHOR = a.ID
    WHERE  p.POST_TYPE IN ( 'page', 'post' )
           AND p.POST_STATUS = 'publish' 
    into outfile '/tmp/posts.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';

  2. Copy the CSV file to your Hadoop machine, and copy it onto HDFS. Make sure each type of data goes in its own HDFS folder:

    hadoop fs -mkdir posts
    hadoop fs -copyFromLocal /tmp/posts.csv posts

  3. Define the Hive table on top of it:

    CREATE EXTERNAL TABLE posts 
    ( post_id string,title string,post_date string,post_type string,author string,url string ,generated_url string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    "input.regex" = "^(\\d*),\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\"",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s")
    location '/user/oracle/posts'
    ;

Rinse & repeat for the category data, and post->category relationships.

The data once modelled in Hive looks like this:

0: jdbc:hive2://bigdatalite:10000> select * from posts limit 1;
post_id        788
title          Blog
post_date      2007-03-07 17:45:07
post_type      page
author         Mark Rittman
url            blog
generated_url  /2007/03/blog

0: jdbc:hive2://bigdatalite:10000> select * from categories limit 1;
category_id    5
cat2_id        5
category_name  category
category_code  BI (General)
catslug        bi

0: jdbc:hive2://bigdatalite:10000> select * from post_cats limit 5;
post_id      8046
category_id  1

The WordPress metadata quite obviously joins together, as it is already from the relational schema in which it was held on MySQL. Here is an example of where “schema on read” comes into play, because you could look at the above three tables (posts / post_cats / categories) and conclude it was redundant to export all three from WordPress and instead a single query listings posts and their respective category would be sufficient. But, some posts have more than one category, which then leads to a design/requirements decision. Either we retain one row per post – and collapse down the categories, but in doing so lose ability to easily treat categories as individual data – or have one row per post/category, and end up with multiple rows per post which if we’re doing a simple count of posts complicates matters. So we bring it in all raw from source, and then decide how we’re going to use it afterwards.

Bringing the data together

At this point I have six tables in Hive that I can query (albeit slowly) with HiveQL, a close relation to SQL with a few interesting differences running through the Hive client Beeline. The data is tweets, website visits, and details about the blog posts themselves.

0: jdbc:hive2://bigdatalite:10000> show tables;
+------------------------+
|        tab_name        |
+------------------------+
| apachelog              |
| categories             |
| post_cats              |
| posts                  |
| retweets               |
| tweets                 |
+------------------------+

As well as time, the other common element running throughout all the data is the blog article URL, whether it is a post, a visit to the website, or a tweet about it. But to join on it is not quite as simple as you’d hope, because all the following are examples of recorded instances of the data for the same blog post:

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So whether it’s querying the data within Hive, or loading it joined together to another platform, we need to be able to unify the values of this field.

Tangent: RegEx

And now it’s time, if you’d not already for your SerDe against the Apache file, to really immerse yourself in Regular Expressions (RegEx). Part of the “schema on read” approach is that it can get messy. You need to juggle and wrangle and munge data in ways that it really might not want to, and RegEx is an essential tool with which to do this. Regex isn’t specific to Hadoop – it’s used throughout the computing world.

My journey with regex over quite a few years in computing has gone in stages something like this:

  1. To be a fully rounded geek, I should learn regex. Looks up regex. Hmm, looks complicated….Squirrel!
    1. To be a fully round (geddit?!) geek, I should keep eating these big breakfasts
  2. I’ve got a problem, I’ve got a feeling regex will help me. But my word it looks complicated … I’ll just do it by hand.
  3. I’ve got another problem, I need to find this text in a file but with certain patterns around it. Here’s a regex I found on google. Neat!
  4. Hmmm another text matching problem, maybe I should really learn regex instead of googling it to death each time
  5. Mastered the basic concepts of regex
  6. Still a long way to go…

If you think you’ll nail RegEx overnight, you won’t (or at least, you’re a better geek than me). It’s one of those techniques, maybe a bit like SQL, that to fully grok takes a period of exposure and gradually increasing usage, before you have an “ah hah!” moment. There’s a great site explaining regex here: www.regular-expressions.info. My best advice is to take a real example text that you want to work with (match on, replace bits of, etc), and stick it in one of these parsers and experiment with the code:

Oh and finally, watch out for variations in regex – what works in a Java-based program (most of the Hadoop world) may not in Python and visa versa. Same goes for PHP, Ruby, and so on – they all have different regex engines that may or may not behave as you’d expect.

Back on track : joining data on non-matching columns

So to recap, we want to be able to analyse our blog data across tweets, site hits and postings, using the common field of the post URL, which from the various sources can look like any of the following (and more):

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So out comes the RegEx. First off, we’ll do the easy one – strip the http:// and server bit. Using the Hive function REGEXP_REPLACE we can use this in the query:

regexp_replace(ref_url,'http:\\/\\/www.rittmanmead.com','')

This means, take the ref_url column and if you find http://www.rittmanmead.com then replace it with nothing, i.e. delete it. The two backslashes before each forward slash simply escape them since a forward slash on its own has a special meaning in regex. Just to keep you on your toes – Java regex requires double backspace escaping, but all other regex (including the online parser I link to below) uses a single one.

So now our list possible join candidates has shrunk by one to look like this:

/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

The variation as you can see is whether there is a trailing forward slash (/) after the post ‘slug’ , and whether there is additional cruft after that too (feed, foobar+foorbar, etc). So let’s build it up a piece at a time. On each one, I’ve linked to an online parser that you can use to see it in action.

  1. We’ll match on the year and month (/2014/01/) because they’re fixed pattern, so using \d to match on digits and {x} to match x repetitions: (see example on Rubular.com)

    \/\d{4}\/\d{2}\/

    This will match /2014/01/.

  2. Now we need to match the slug, but we’re going to ditch the forward slash suffix if there is one. This is done with two steps.

    First, we define a “match anything except x” group, which is what the square brackets (group) and the caret ^ (negate) do, and in this case x is the forward slash character, escaped.

    Secondly, the plus symbol + tells regex to match at least one repetitions of the preceeding group – i.e. any character that is not a forward slash. (example)

    [^\/]+

    Combined with the above regex from the first step we will now match /2014/01/automated-regression-testing-for-obiee.

  3. The final step is to turn the previous REGEXP_REPLACE on its head and instead of replacing out content from the string that we don’t want, instead we’ll extract the content that we do want, using a regex capture group which is defined by regular brackets (parantheses, just like these). We’ve now brought in a couple of extra bits to make it hang together, seen in the completed regex here:

    \S*(\/\d{4}\/\d{2}\/[^\/]+).*$

    1. The \S* at the beginning means match any non-whitespace character, which will replace the previous regex replace we were doing to strip out the http://www.rittmanmead.com
    2. After the capture group, which is the content from steps one and two above, surround by parentheses (\/\d{4}\/\d{2}\/[^\/]+) there is a final .* to match anything else that might be present (eg trailing forward slash, foobar, etc etc)

    Now all we need to do is escape it for Java regex, and stick it in the Hive REGEXP_EXTRACT function, specifying 1 as the capture group number to extract: (example)

    regexp_extract(url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)

So now all our URLs will look like this, regardless of whether they’re from tweet data, website hits, or wordpress:

/2014/01/automated-regression-testing-for-obiee

Which is nice, because it means we can use it as the common join in our queries. For example, to look up the title of the blog post that someone has tweeted about, and who wrote the post:

SELECT
x.author AS tweet_author, 
x.tweet ,
x.tweet_url, 
x.created_at, 
p.author as post_author, 
p.title as post_title
FROM            ( 
SELECT 'tweets' , 
t.url AS tweet_url , 
t.author , 
t.content AS tweet , 
t.created_at ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
FROM   tweets t 
LATERAL VIEW EXPLODE (referenced_urls) refs as ref_url 
WHERE  t.author_followers IS NOT NULL 
AND    ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*' ) x 
INNER JOIN posts p 
ON regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url ;

[...]
tweet_author  Dain Hansen
tweet         Like a Big Data kid in a Hadoop candy store: Presos on #bigdata for BI, DW, Data Integration http://t.co/06DLnvxINx via @markrittman
tweet_url     https://twitter.com/dainsworld/status/520463199447961600
created_at    Fri, 10 Oct 2014 06:37:51 +0000
post_author   Mark Rittman
post_title    Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

tweet_author  Robin Moffatt
tweet         Analyzing Twitter Data using Datasift, MongoDB and Pig http://t.co/h67cd4kJo2 via @rittmanmead
tweet_url     https://twitter.com/rmoff/status/524197131276406785
created_at    Mon, 20 Oct 2014 13:55:09 +0000
post_author   Mark Rittman
post_title    Analyzing Twitter Data using Datasift, MongoDB and Pig
[...]

Note here also the use of LATERAL VIEW EXPLODE () as a way of denormalising out the Hive ARRAY of referenced url(s) in the tweet so there is one row returned per value.

Summary

We’ve got our three sources of data available to us in Hive, and can query across them. Next we’ll take a look at loading the data into Elasticsearch, taking advantage of our conformed url column to join data that we load. Stay tuned!

Categories: BI & Warehousing