Skip navigation.

BI & Warehousing

Simple Hadoop Dataflows using Apache Pig and CDH4.6

Rittman Mead Consulting - Fri, 2014-05-02 00:36

The other day I took some logs from the Apache webserver that runs the Rittman Mead website, and analysed them using Hadoop CDH5, Apache Hive and Impala to get some basic metrics on number of hits per month, where the hits came from and so on. Hive and Impala are great for analysing data sitting on HDFS on a Hadoop cluster, but like SQL compared to PL/SQL or C++, everything you do is declarative and set-based whereas sometimes, you want to build up your dataset using a dataflow-type approach, particularly if you’ve come from a programming vs. a data warehousing background.

If you’ve been looking at Hadoop for a while, you’ll probably therefore know there’s another basic high-level-language approach to querying Hadoop data to accompany Hive, and it’s called “Pig”. Pig, like Hive, is an Apache project and provides an engine for creating and executing data flows, in parallel, on Hadoop. Like Hive, jobs you create in Pig eventually translate into MapReduce jobs (with the advantages and disadvantages that this brings), and has concepts that are similar – but just that little bit different – to relational flows such as filters, joins and sorts.

It’s often called a “procedural” language (as opposed to Hive’s declarative language), but really it’s not – it’s a “data flow language” that has you specifically set out the data flow as the main part of a Pig program, rather than it being a by-product of the if/then/elses and control structures of a procedural language. For people like me that comes from an Oracle data warehousing background, in most cases we’d feel more comfortable using Hive’s set-based transformations to do our data loading and transformation on Hadoop, but in some cases – particularly when you’re querying data interactively, building up a data pipeline and working with nested data sets – it can be more appropriate.

Connecting to the Pig Console, and Pig Execution Options

Iteratively examining and analysing data from webserver log files is a great example of where Pig could be useful, as you naturally hone-down and pivot the data as you’re looking at it, and in-effect you’re looking to create a data pipeline from the raw logs through to whatever summary tables or files you’re looking to create. So let’s go back to the same input log files I used in the previous post on Hive and Impala, and this time bring them into Pig. 

Within CDH (Cloudera Distribution including Hadoop) you can run Pig scripts either interactively from the Pig command-line shell, called “Grunt”, or you can submit them as workflow jobs using the Hue web interface and the Oozie workflow scheduler; the advantage when you’re starting to working with the interactive Grunt shell is that you can run your commands one-by-one and examine the metadata structures that you create along the way, so let’s use that approach first and move onto batch scheduling later on.

I’ll start by SSH’ing into one of the CDH4.6 nodes and starting the Grunt shell:

officeimac:~ markrittman$ ssh root@cdh4-node1
root@cdh4-node1's password: 
Last login: Sat May  3 06:38:18 2014 from 192.168.2.200
[root@cdh4-node1 ~]# pig
2014-05-03 06:44:39,257 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.6.0 (rexported) compiled Feb 26 2014, 03:01:22
2014-05-03 06:44:39,258 [main] INFO  org.apache.pig.Main - Logging error messages to: /root/pig_1399095879254.log
2014-05-03 06:44:39,301 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2014-05-03 06:44:39,663 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-05-03 06:44:39,663 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cdh4-node1.rittmandev.com:8020
2014-05-03 06:44:40,392 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: cdh4-node1.rittmandev.com:8021
2014-05-03 06:44:40,394 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS

Even from within the Grunt shell, there’s two ways I can then run Pig. The default way is to have Grunt run your Pig commands as you’d expect, converting them in the end to MapReduce jobs which then run on your Hadoop cluster. Or, you can run in “local mode”, which again uses MapReduce but only runs on the machine you’re logged in to and only single-threaded, but can often be faster for when you’re just playing around with a local dataset and you want to see results fast (you can turn on local mode by adding an ‘-x local’ flag when starting Grunt). In my example, I’m going to run Grunt in regular MapReduce mode though anyway.

Loading and Parsing the Weblog Files

I then define my first pig relation, analogous to a relational table and technically, a named Pig “bag”, like this:

grunt> raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray);

Compared to the Pig table DDL script in the previous article example I posted, we declare the incoming dataset much more programmatically – the first row of the script creates a relation called “raw_logs”, analogous to a table in Hive, and declares it as having a single column (“line:array”) that maps onto a directory of files in HDFS (“/user/root/logs”). You can ask Pig (through the Pig command-line client, which I’m using now) to list-out the structure of this relation using the “describe” command:

grunt> describe raw_logs;
raw_logs: {line: chararray}

In this form the logs aren’t too useful though as each row contains all the data we want, as a single field. To take a look at what we’re working with currently, let’s create another relation that limits down the dataset to just five rows, and use the DUMP command to display the relation’s data on the screen:

grunt> raw_logs_limit_5 = LIMIT raw_logs 5;
grunt> DUMP raw_logs_limit_5
 
2014-05-03 16:15:13,260 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(38.127.201.3 - - [03/Apr/2014:20:56:34 +0000] "GET /wp-content/uploads/2012/01/Partial-Hybrid-Model-e1327470743307.png HTTP/1.1" 200 8432 "http://obiee.nl/?tag=dimensional-modelling&paged=2" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)")
(84.39.213.18 - - [08/Apr/2014:20:13:45 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 - "http://www.rittmanmead.com/2009/05/troubleshooting-obiee-connectivity-and-server-issues/" "Mozilla/5.0 (Windows NT 5.1; rv:26.0) Gecko/20100101 Firefox/26.0")
(137.151.212.38 - - [11/Apr/2014:06:08:10 +0000] "GET /wp-content/plugins/featured-content-gallery/scripts/jd.gallery.js.php HTTP/1.1" 200 6075 "http://www.rittmanmead.com/training/customobiee11gtraining/" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)")
(137.151.212.38 - - [11/Apr/2014:06:08:10 +0000] "GET /wp-content/plugins/featured-content-gallery/scripts/jd.gallery.transitions.js HTTP/1.1" 200 492 "http://www.rittmanmead.com/training/customobiee11gtraining/" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)")
(199.208.239.141 - - [18/Mar/2014:14:11:52 +0000] "GET /wp-content/uploads/2013/08/NewImage14.png HTTP/1.1" 200 217190 "http://www.rittmanmead.com/2013/08/inside-my-home-office-development-lab-vmware-os-x-server/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)")

What I’ve omitted for clarity in the above output is the MapReduce console output – what you’ll see if you run this in MapReduce mode is the process starting up, and then running, to retrieve 5 rows effectively at random from the whole set of log files, process them through the Map > Shuffle > Reduce process and then return them to the Grunt shell.

What would be really good though, of course, is if we could split these single log row columns into multiple ones, one for each part of the log entry. In the Hive example I posted the other day, I did this through a Hive “SerDe” that used a regular expression to split the file, and I can do something similar in Pig; Pig has a function called REGEX_EXTRACT_ALL that takes a regular expression and creates a column for each part of the expression, and so I can use it in conjunction with another relational operator, GENERATE FLATTEN, to take the first set of data, run it through the regular expression and come out with another set of data that’s been split as I want it:

logs_base = FOREACH raw_logs
 GENERATE FLATTEN 
    ( 
        REGEX_EXTRACT_ALL
        ( 
            line,
            '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
        ) 
    ) 
    AS 
    (
        remoteAddr: chararray, remoteLogname: chararray, user: chararray,  
        time: chararray, request: chararray, status: chararray, bytes_string: chararray, 
        referrer: chararray, browser: chararray
    );

grunt> describe logs_base
logs_base: {remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: chararray,bytes_string: chararray,referrer: chararray,browser: chararray}

GENERATE in Pig tells it to create (or “project”( some columns out of an incoming dataset; FLATTEN eliminates any nesting the resulting output (we’ll see more of FLATTEN and nesting in a moment). Notice how the DESCRIBE command afterwards now shows individual columns for the log elements, rather than just one single “line:chararray” column.

Using Pig to Interactively Filter the Dataset

So now we’ve got a more useful set of rows and columns in the Pig relation, and like an Oracle table, unless we do something to order them later, they’re effectively held in random order. Something we can do now is filter the dataset, for example creating another relation containing just those log entries where the request 404’d, and the further filter that dataset to those 404’d requests that were made by users using IE6:

grunt> logs_404 = FILTER logs_base BY status == ‘404';
grunt> logs_404_ie6 = FILTER logs_404 BY browser == 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';

So how many of our website users are on IE6 and getting page not available errors? To find out, I create another relation that groups the entries up in a single row, and then generates a count of those rows that were aggregated:

grunt> logs_404_ie6_count = FOREACH (GROUP logs_404_ie6 ALL) GENERATE COUNT(logs_404_ie6);
grunt> DUMP logs_404_ie6_count
...
(95)

and I can do a similar thing all of the 404’s:

grunt> logs_404_count = FOREACH (GROUP logs_404 ALL) GENERATE COUNT(logs_404);
grunt> dump logs_404_count
...
(31998)

You can see these Pig scripts running in CDH’s Cloudera Manager web application, with the screenshot below showing one of them at the point where 92% of the Mapper parts have completed, waiting to hand-off to the Reducers; the console output in Grunt will show you the status too, the output of which I removed from the above two statements for clarity.

NewImage

Grouping, Subsetting and Aggregating Data using Pig

How we generate counts and other aggregates is interesting in Pig. Pig has a relational operator called GROUP as we’ve seen before, and when you GROUP a relation by a column, or a group of columns, it creates a new relation that contains two columns; one called “group” that has the same datatype as whatever you grouped on (or a “tuple” made up of multiple columns, if you grouped-on more than one column), and a second column that’s named after whatever you grouped, i.e. the original relation. To take an example, if we grouped the logs_base relation on status code, you’d see the following if you then describe the resulting relation:

grunt> by_status = GROUP logs_base BY status;
grunt> describe by_status
by_status: {group: chararray,logs_base: {(remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: chararray,bytes_string: chararray,referrer: chararray,browser: chararray)}}

What’s interesting though about a pig GROUP, and conceptually different to SQL (and therefore Hive)’s GROUP BY, is that this second column is actually in Pig terms a “bag”, a bag of rows (or “tuples”) that are unaltered compared to the original relation, i.e. they’ve not been aggregated up by the grouping, but are still in their same detail-level. So Pig gives you, apart from its step-by-step data flow method of working with data, this ability to group data whilst still preserving the detail of the individual grouped rows, leaving any summation or other aggregation step to something you do afterwards. So for example, if I wanted to see how many 200s, 404’s and so on my log file dataset contained in total, I then tell Pig to iterate through these bags, project out the columns I’m interested in (in this case, just the status) and also perform aggregation over the grouping buckets specified in the GROUP relational operator:

grunt> by_status_count = FOREACH by_status GENERATE FLATTEN (group) as status, COUNT(logs_base);
 
grunt> dump by_status_count
(302,23680)
(304,512771)
(403,32)
(405,250)
(414,13)
(416,8)
(500,2155)
(,0)
(200,4447087)
(206,8875)
(301,304901)
(400,59)
(404,31998)
(408,9601)
(501,12)
(503,2)

So in that example, we told Pig to list out all of the groupings (i.e. the distinct list of status codes), and then run a count of rows against each of those groupings, giving us the output we’re interested in. We could, however, not aggregate those rows at this point though and instead treat each “bucket” formed by the grouping as a sub-selection, allowing us to, for example, investigate in more detail when and why the 301 errors – “Moved Permanently” – were caused. Let’s use that now to find out what the top 10 requests were that led to HTTP 301 errors, starting by creating another relation that just contains the ‘301’ group:

grunt> by_status_301 = FILTER by_status BY group == '301';
grunt> describe by_status_301

by_status_301: {group: chararray,logs_base: {(remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: chararray,bytes_string: chararray,referrer: chararray,browser: chararray)}}

Looking at the structure of the relation this has created though, you can see that the rows we’ve grouped are all contained within a single tuple called “logs_base”, and to do anything interesting with that data we’ll need to flatten it, which takes that tuple and un-nests it:

grunt> by_status_301_flattened = FOREACH by_status_301
>> GENERATE $0, FLATTEN($1);
 
grunt> describe by_status_301_flattened
by_status_301_flattened: {group: chararray,logs_base::remoteAddr: chararray,logs_base::remoteLogname: chararray,logs_base::user: chararray,logs_base::time: chararray,logs_base::request: chararray,logs_base::status: chararray,logs_base::bytes_string: chararray,logs_base::referrer: chararray,logs_base::browser: chararray}

Notice also how I referenced the two columns in the by_status_301 relation by positional notation ($0 and $1)? This is handy when either you’ve not got a proper schema defined for your data (all part of the “pigs eat anything” approach for Pig, in that it even handles data you don’t yet have a formal schema for), or when it’s just easier to refer to a column by position than work out it’s formal name.

So now we’ve got our list of log entries that have recorded HTTP 301 “permanently moved” error messages, let’s use another relation to project just the columns we want – the date and the requests – and also use some Pig string functions to extract the day, month and year along, and also split the request field up into its constituent method, URI and protocol fields:

grunt> by_status_301_date_and_urls = FOREACH by_status_301_flattened                                                                   
>> GENERATE SUBSTRING(time,3,6) as month,                                                                                          
>> SUBSTRING(time,7,11) as year, FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray);
grunt> describe by_status_301_date_and_urls
by_status_301_date_and_urls: {month: chararray,year: chararray,method: chararray,request_page: chararray,protocol: chararray}
 
grunt> by_status_date_and_urls_group = GROUP by_status_301_date_and_urls BY (year,month);                                              

grunt> describe by_status_date_and_urls_group
 
by_status_date_and_urls_group: {group: (year: chararray,month: chararray),by_status_301_date_and_urls: {(month: chararray,year: chararray,method: chararray,request_page: chararray,protocol: chararray)}}
 
grunt> by_status_date_and_urls_group_count = FOREACH by_status_date_and_urls_group
>> GENERATE FLATTEN(group) as (year,month),                                   
 
>> COUNT(by_status_301_date_and_urls);

All of these statements just set-up the data flow, and no actual processing takes place until we choose to dump, or store, the results of the data flow – which again makes Pig great for iteratively building-up a data flow, or in BI terms maybe an ETL flow, before finally pulling the trigger at the end and generating the end result. Let’s do that now using the dump command:

grunt> dump by_status_date_and_urls_group_count     
...
Success!
 
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs
job_201404280738_022212296889577by_status,by_status_301,by_status_301_date_and_urls,by_status_301_flattened,logs_base,raw_logsGROUP_BY
job_201404280738_02231166665555by_status_date_and_urls_group,by_status_date_and_urls_group_countGROUP_BY,COMBINERhdfs://cdh4-node1.rittmandev.com:8020/tmp/temp-1401083448/tmp827611833,
 
Input(s):
Successfully read 5341613 records (1448814961 bytes) from: "/user/root/logs"
 
Output(s):
Successfully stored 3 records (63 bytes) in: "hdfs://cdh4-node1.rittmandev.com:8020/tmp/temp-1401083448/tmp827611833"
 
Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 1
Total records proactively spilled: 221601
 
Job DAG:
job_201404280738_0222->job_201404280738_0223,
job_201404280738_0223
 
 
2014-05-04 00:33:50,193 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 169 time(s).
2014-05-04 00:33:50,193 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-05-04 00:33:50,194 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-05-04 00:33:50,199 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-05-04 00:33:50,199 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2014,Apr,85048)
(2014,Feb,47)
(2014,Mar,219806)

So we had around 85k “page permanently moved” errors in April, only a few in February, and a much larger amount in March 2014. So which web page requests in March 2014 were the biggest cause of this error? Let’s focus on just that month and list out the top ten page requests that hit this error:

grunt> mar_2014_urls = FILTER by_status_301_date_and_urls BY (year == '2014' AND month == 'Mar');
 
grunt> mar_2014_301_url_group = GROUP mar_2014_urls BY request_page;                             
grunt> describe mar_2014_301_url_group
mar_2014_301_url_group: {group: chararray,mar_2014_urls: {(month: chararray,year: chararray,method: chararray,request_page: chararray,protocol: chararray)}}
 
grunt> mar_2014_301_url_group_count = FOREACH mar_2014_301_url_group
>> GENERATE FLATTEN(group) as request,                          
>> COUNT(mar_2014_urls) as num;
 
grunt> mar_2014_301_url_group_count_sorted = ORDER mar_2014_301_url_group_count by num DESC;
grunt> mar_2014_301_url_group_count_limited = LIMIT mar_2014_301_url_group_count_sorted 10;
grunt> dump mar_2014_301_url_group_count_limited
...
(/wp-login.php,124124)
(/,72374)
(/feed,1623)
(/2013/12/creating-a-custom-analytics-dashboard-from-scratch-the-blue-peter-way/feed/index.php,896)
(/biforum2014/feed/index.php,661)
(/training/trn-205-oracle-bi-ee-11g-create-reports-dashboards-alerts-and-scorecards/feed/index.php,659)
(/2009/01/rittman-mead-bi-forum-call-for-papers/feed/index.php,519)
(/biforum2013/feed/index.php,430)
(/2010/05/oracle-bi-ee-10-1-3-4-1-do-we-need-measures-in-a-fact-table/feed/feed/,376)
(/2010/05/31/oracle-bi-ee-10-1-3-4-1-do-we-need-measures-in-a-fact-table/feed/,376)

Joining Datasets in Pig 

So far we’ve worked with just a single set of data – the Apache weblog files that we’ve then filtered, subsetted, parsed, analysed and so forth. But what would be really interesting though, would be if we can bring in some additional, reference or other lookup data to help us make more sense of the log activity on our website. One of the motivators for the people behind Pig, right at the start, was to give Hadoop the ability to join datasets, which up until then was really hard to do with just Java and MapReduce; as we’ll see later on there are still a lot of restrictions on how these joins take place, but Pig gives you the ability to join two or more datasets together, which we’ll do now in another example where we’ll look at the most popular blog posts, and blog authors, over the period covered by our logs.

Let’s start by taking the full set of logs, parsed into the separate elements of the log file entry, and add in additional columns for month and the request elements:

grunt> page_requests = FOREACH logs_base                      
>> GENERATE SUBSTRING(time,3,6) as month,                   
>> FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray);
grunt> describe page_requests;
page_requests: {month: chararray,method: chararray,request_page: chararray,protocol: chararray}

One thing you’re taught with Pig is “project early, and often”, so let’s remove the method and protocol columns from that dataset and then filter the remaining page requests to remove those that are blank or aren’t blog post requests:

grunt> page_requests_short = FOREACH page_requests
>> GENERATE $0,$2;
grunt> page_requests_short_filtered = FILTER page_requests_short BY (request_page is not null AND SUBSTRING(request_page,0,3) == '/20');
grunt> page_requests_count = FOREACH (GROUP page_requests_short_filtered ALL) GENERATE COUNT (page_requests_short_filtered);
grunt> dump page_requests_count
...
(310695)

Let’s now reduce that list down to the top ten page requests, the way we did before with pages causing 301 errors:

grunt> page_request_group = GROUP page_requests_short_filtered BY request_page;
grunt> page_request_group_count = FOREACH page_request_group GENERATE $0, COUNT(page_requests_short_filtered) as total_hits;
grunt> page_request_group_count_sorted = ORDER page_request_group_count BY $1 DESC;
grunt> page_request_group_count_limited = LIMIT page_request_group_count_sorted 10;
grunt> dump page_request_group_count_limited
...
(/2014/03/obiee-dashboard-prompt-at-least-one-mandatory/,4610)
(/2012/03/obiee-11g-security-week-connecting-to-active-directory-and-obtaining-group-membership-from-database-tables/,3528)
(/2013/04/upgrading-obiee-to-11-1-1-7/,2963)
(/2014/04/bi-forum-2014-preview-no-silver-bullets-obiee-performance-in-the-real-world/,2605)
(/2014/03/the-secret-life-of-conditional-formatting-in-obiee/,2579)
(/2012/03/obiee-11g-security-week-subject-area-catalog-and-functional-area-security-2/,2410)
(/2014/03/introducing-obi-metrics-agent-an-open-source-obiee-metrics-collector/,2321)
(/2014/03/using-oracle-r-enterprise-to-analyze-large-in-database-datasets/,2309)
(/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/,2240)
(/2012/03/obiee-11-1-1-6-new-dashboard-analysis-and-reporting-features/,2160)

Not bad. What would be even better though, would be if I could retrieve the full names of these posts in WordPress, on which our website runs, and also the author name. I’ve got text file export file of post names, URLs and authors that’s been previously exported from our WordPress install, so let’s declare another relation to hold initially the raw rows from that file, like this:

grunt> raw_posts = LOAD '/user/root/posts/' USING TextLoader AS (line:chararray);

Then split that file by the semicolon that delimits each of the entries (author, post name etc):

grunt> posts_line = FOREACH raw_posts 
>> GENERATE FLATTEN 
>> (
>> STRSPLIT(line,';',10)
>> )
>> AS 
>> (
>> post_id: chararray, title: chararray, post_date: chararray,
>> type: chararray, author: chararray, post_name: chararray,
>> url_generated: chararray
>> );
grunt> describe posts_line
posts_line: {post_id: chararray,title: chararray,post_date: chararray,type: chararray,author: chararray,post_name: chararray,url_generated: chararray}

I’ll now take that relation and project just the columns I’m interested in:

grunt> posts_and_authors = FOREACH posts_line
>> GENERATE title,author,post_name,CONCAT(REPLACE(url_generated,'"',''),'/') AS (url_generated:chararray);
grunt> describe posts_and_authors
posts_and_authors: {title: chararray,author: chararray,post_name: chararray,url_generated: chararray}

Now I’ll do the join, and then take that join and use it to generate a combined list of pages and who wrote them:

grunt> pages_and_authors_join = JOIN posts_and_authors BY url_generated, page_request_group_count_limited BY group;
grunt> pages_and_authors = FOREACH pages_and_authors_join GENERATE url_generated, post_name, author, total_hits;
grunt> top_pages_and_authors = ORDER pages_and_authors BY total_hits DESC;

and then finally, output the joined set of data to a comma-separated file in HDFS:

grunt> STORE top_pages_and_authors into '/user/root/top-pages-and-authors.csv' USING PigStorage(‘,');

Once that’s run, I can use Grunt’s “cat” command to output the contents of the file I just created:

grunt> cat /user/root/top-pages-and-authors.csv                                                      
/2014/03/obiee-dashboard-prompt-at-least-one-mandatory/,"obiee-dashboard-prompt-at-least-one-mandatory","Gianni Ceresa",4610
/2012/03/obiee-11g-security-week-connecting-to-active-directory-and-obtaining-group-membership-from-database-tables/,"obiee-11g-security-week-connecting-to-active-directory-and-obtaining-group-membership-from-database-tables","Mark Rittman",3528
/2013/04/upgrading-obiee-to-11-1-1-7/,"upgrading-obiee-to-11-1-1-7","Robin Moffatt",2963
/2014/04/bi-forum-2014-preview-no-silver-bullets-obiee-performance-in-the-real-world/,"bi-forum-2014-preview-no-silver-bullets-obiee-performance-in-the-real-world","Robin Moffatt",2605
/2014/03/the-secret-life-of-conditional-formatting-in-obiee/,"the-secret-life-of-conditional-formatting-in-obiee","Gianni Ceresa",2579
/2012/03/obiee-11g-security-week-subject-area-catalog-and-functional-area-security-2/,"obiee-11g-security-week-subject-area-catalog-and-functional-area-security-2","Mark Rittman",2410
/2014/03/introducing-obi-metrics-agent-an-open-source-obiee-metrics-collector/,"introducing-obi-metrics-agent-an-open-source-obiee-metrics-collector","Robin Moffatt",2321
/2014/03/using-oracle-r-enterprise-to-analyze-large-in-database-datasets/,"using-oracle-r-enterprise-to-analyze-large-in-database-datasets","Mark Rittman",2309
/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/,"using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm","Mark Rittman",2240
/2012/03/obiee-11-1-1-6-new-dashboard-analysis-and-reporting-features/,"obiee-11-1-1-6-new-dashboard-analysis-and-reporting-features","Mark Rittman",2160

But What About More Complex Joins and Transformations … Enter, Pig Extensibility

This is of course great, but going back to my previous Hive example I also managed to geo-code the log file entries, converting the IP addresses into country names via a lookup to a geocoding database. What made that example “interesting” though was the need to join the Hive table of log posts to the geocode table via a BETWEEN, or > and < than operators, so that I could locate each IP address within the ranges given by the geocoding database – and the reason it got interesting was that Hive can only do equi-joins, not non-equijoins or joins involving greater than, BETWEEN and so on. Impala *could* do it, and on a small set of input rows – five in my example – it worked fine. Try and scale the Impala query up to the full dataset though, and the query fails, because it runs out of memory; and that’s potentially the issue with Impala, and set-based queries, as Impala does everything in-memory, and most Hadoop systems are designed for fast I/O, not lots of memory. 

So can Pig help here? Well, it’s actually got the same limitation – non-equijoins are actually quite difficult to do in Hadoop because of the way MapReduce works, but where Pig could perhaps help is through its extensibility – you can stream Pig data, for example IP addresses, through Perl and Python scripts to return the relevant country, or you can write Pig UDFs – User-Defined Functions – to return the information we need in a similar way to how PL/SQL functions in Oracle let you call-out to arbitrary functions to return the results of a more complex look-up. But this is also where things get a bit more complicated, so we’ll save this to the next post in this series, where I’ll also be joined by my colleague Nelio who’s spent the best part of this last week VPNd into my VMWare-based Hadoop cluster getting this last example working.

Categories: BI & Warehousing

New ODI12c Article, and Details of our Inaugural ODI12c Course in Brighton, May 12th-14th 2014

Rittman Mead Consulting - Thu, 2014-05-01 09:11

NewImage

Oracle have just published the May/June 2014 edition of Oracle Magazine, and my business analytics column this time round is on the new 12c release of Oracle Data Integrator.

In “Go with the Flow” I look at how this new editor supports OWB-style multi-step mappings, and how new features like “deployment specifications” allow you to choose different load strategies depending on whether you’re doing a full, or an incremental load. On the same topic, you might also want to take a look at my colleague Stewart Bryson’s recent article on the Oracle Technology Network, “Making the Move from Oracle Warehouse Builder to Oracle Data Integrator 12c”, where he takes an in-depth look at what’s involved in migrating from, and interoperating with, Oracle Warehouse Builder, and what’s in-store for OWB developers when they upgrade to the new 12c release of ODI.

This is actually excellent timing, as we’re just about to launch our new ODI12c training, with our initial course being a three-day ODI12c bootcamp that’s running for the first time in Brighton, UK, from May 12th -14th 2014. Based on this latest release of Oracle Data Integrator, this three-day course assumes no prior knowledge and takes you through everything you need to know to get started with ODI12c.

From setting up the topology through to creating mapping, packages and load plans, this course features modules and labs covering many aspects of ODI 12c functionality. As with all of our courses, we bring our trainer to you and teach all of your team, together, how to make the most of Oracle’s premier data integration tool, with one of our experienced consultants leading the sessions and sharing their project experience.

This first run of the course will be taught by the course author, Oracle ACE Edel Kammermann, accompanied by Jerome Francoisse, our lead beta-tester for ODI12c and speaker at events such as Oracle Openworld and RMOUG training days. If you’ve been looking to get trained-up on the new 12c release of Oracle Data Integrator, this is an excellent opportunity to learn the basics in just three days, down in sunny Brighton in May! Course details are as follows:

  • Duration : 3 Days
  • Course Delivery : Instructor-led with labs, on-site at customer location
  • Who Should Attend : Developers, consultants, project managers, technical leads, DBAs
  • Prerequisites : None

Detailed Course Agenda :

  • Getting Started with ODI 12c
  • ODI Topology
  • ODI Projects
  • Models and Datastores
  • Data Quality in a Model
  • Introduction to ODI Mappings
  • ODI Procedures, Variables, Sequences, and User Functions
  • Advanced ODI Mappings
  • ODI Packages
  • Scenarios in ODI
  • The ODI Debugger
  • ODI Load Plans

To book a place on the course, running in Brighton, UK on May 12th-14th 2014, just click on this link - we’ll be running the course in the US shortly afterwards. Finally, if you’ve got any questions about this course or any of our other OBIEE, ODI, Oracle BI Apps or OWB courses, just drop us a line at training@rittmanmead.com.

Categories: BI & Warehousing

Previewing TimesTen, Endeca and Oracle DW Sessions at the Brighton BI Forum 2014

Rittman Mead Consulting - Thu, 2014-05-01 02:00

It’s under a week now to the first of the two Rittman Mead BI Forum 2014 events, with Brighton running next week at the Hotel Seattle and then Atlanta the week after, at the Renaissance Atlanta Midtown Hotel. Earlier in the week I went through a more detailed agenda for the Lars George Cloudera Hadoop Masterclass, and the week before Stewart covered-off some of the Oracle sessions at the Atlanta event, but as a final preview of this series I just wanted to talk about three session running at next week’s Brighton event.

NewImage

Someone I’ve got to know pretty well over the last year is Oracle’s Chris Jenkins, who’s the face of TimesTen development in the UK. I first ran into Chris, and his colleague Susan Cheung, late last year when I posted a series of articles on TimesTen vs. Essbase ASO, and then Chris presented alongside myself and Peak Indicators’ Tony Heljula at last year’s Oracle Openworld, on TimesTen on Exalytics Best Practices. Chris kindly agreed to come along to the Brighton BI Forum and share some of his tips and techniques on TimesTen development, and also answer some of the questions from members of the audience implementing TimesTen as part of their OBIEE setup. Over to Chris:

“Since the launch of Exalytics TimesTen has been at it’s heart delivering high performance access to relational data to support the ‘speed of thought’ experience. But it hasn’t all been plain sailing; each use case has its own specific challenges and correct configuration, adopting best operational practice and properly tuning the TimesTen database to support the workload are essential to getting the best results. When working with customers I often come across situations where things are not setup quite as well as they might be or where a less than optimal approach has been adopted, and this can negatively affect performance or manageability.

In my session I will highlight the most common pitfalls and show how to avoid them. I will also discuss best practices for operation and data loading and look at how to optimise the TimesTen database for your workload. And of course there is the opportunity to ask questions! By the end of the session I hope that you will have a good understanding of how to get the best out of TimesTen for your particular use case.”

NewImage

Another speaker speaking for the first time at the BI Forum is Truls Bergersen, but Truls will of course be well-known to the European user group community through his work with the Norwegian Oracle User Group, who run the excellent OUGN conference cruise each year around April. Truls has been working with Oracle’s BI and data warehousing tools for many years, but more recently has been taking a close look at Endeca Information Discovery, the search and data discovery tool Oracle added to their BI portfolio a couple of years ago. According to Truls …

“It’s been almost two and a half years now, since Oracle acquired Endeca,and in that period the tool has been given a few enhancements. E.g.improvements have been done to the look-and-feel of the UI, it has beenadded support for loading JSON and OBI presentation tables, and the toolcan now be installed on Weblogic. My two favorite things, however, are theself service provisioning and eBS extensions.

The goal of my presentation is to give the audience a good overview of thetool from a data architect’s point of view, and how the tool fits in withand extends your existing BI platform. I will not go into details aboutinstallation and other too technical aspects, but rather look at thetool’s capabilities from a data point of view – how can Endeca utilizeOBIEE and visa versa, what can be done in terms of self-service, etc.”

NewImage

Finally, we’re really pleased to be joined by none other than Reiner Zimmerman, who heads-up Oracle’s Data Warehouse Global Leaders’ Program. Rittman Mead are one of the European partner sponsors of the DW Global Leaders Forum, which brings together the top customers and partners working with Oracle’s data warehousing, engineered systems and big data products several times a year in EMEA, the Americas and APAC.  Reiner’s also the person most likely to take the “last man standing” award from our own Borkur and Ragnar, so before that happens, over to Reiner:

“The DW & Big Data Global Leaders program is an Oracle development driven strategic customer program to establish a platform for Oracle DW and Big Data customers to discuss best practices and experience with Oracle Product Management and Product Development and our Associate Partners like Rittman Mead.

Our current focus is Big Data and Advanced Analytics and we seek to create best practices around Big Data architectures in terms of Manageability, High Availability and Monitoring Big Data Systems. Learn what the program is, what is can bring to you, how you can participate and what other customers are doing.”

The Rittman Mead Brighton BI Forum 2014 runs next week (May 7th-9th 2014) at the Hotel Seattle, Brighton, and there’s still a few places left if you register now. Straight-after, we’re going over to Atlanta to do it all again at the Renaissance Midtown Atlanta Hotel, with full details of the event agendas here, and the event homepage including registration instructions, here. Hopefully see some of you in Brighton or Atlanta later in May!

Categories: BI & Warehousing

Extended Visualisation of OBIEE Performance Data with Grafana

Rittman Mead Consulting - Tue, 2014-04-29 12:00

Recently I wrote about the new obi-metrics-agent tool and how it enables easy collection of DMS data from OBIEE into whisper, the time-series based database behind graphite. In this post I’m going to show two things that take this idea further:

  1. How easy it is to add other data into Graphite
  2. How to install and use Grafana, a most excellent replacement for the graphite front-end.
Collecting data in Graphite

One of the questions I have been asked about using Graphite for collecting and rendering OBIEE DMS metrics is a very valid one : given that OBIEE is a data visualisation tool, and that it usually sits alongside a database, where is the value in introducing another tool that apparently duplicates both data storage and visualisation.

My answer is that it is horses for courses. Graphite has a fairly narrow use-case but what it does it does superbly. It lets you throw any data values at it (as we’re about to see) over time, and rapidly graph these out alongside any other metric in the same time frame.

You could do this with OBIEE and a traditional RDBMS, but you’d need to design the database table, write a load script, handle duplicates, handle date-time arithmetic, build and RPD, build graphs – and even then, you wouldn’t have some of the advanced flexibility that I am going to demonstrate with Grafana below.

Storing nqquery.log response times in Graphite

As part of my Rittman Mead BI Forum presentation “No Silver Bullets – OBIEE Performance in the Real World”, I have been doing a lot of work examining some of the internal metrics that OBIEE exposes through DMS and how these correlate with the timings that are recorded in the BI Server log, nqquery.log, for example:

[2014-04-21T22:36:36.000+01:00] [OracleBIServerComponent] [TRACE:2] [USER-33] [] [ecid: 11d1def534ea1be0:6faf73dc:14586304e07:-8000-00000000000006ca,0:1:9:6:102] [tid: e4c53700] [requestid: c44b002c] [sessionid: c44b0000] [username: weblogic] -------------------- 
Logical Query Summary Stats: Elapsed time 5, Response time 2, Compilation time 0 (seconds) [[
]]

Now, flicking back and forth between the query log is tedious with a single-user system, and as soon as you have multiple reports running it is pretty much impossible to track the timings from the log with data points in DMS. The astute of you at this point will be wondering about Usage Tracking data, but for reasons that you can find out if you attend the Rittman Mead BI Forum I am deliberately using nqquery.log instead.

Getting data in to Graphite is ridiculously easy. Simply chuck a metric name, value, and timestamp, at the Graphite data collector Carbon, and that’s it. You can use whatever method you want for sending it, here I am just using the Linux commandline tool NetCat (nc):

echo "example.foo.bar 3 `date +%s`"|nc localhost 2003

This will log the value of 3 for a metric example.foo.bar for the current timestamp (date +%s). Timestamps are in Unix Time, which is the number of seconds since 1st Jan 1970. You can specify historical values for your metric too:

echo "foo.bar 3 1386806400"|nc localhost 2003

Looking in Graphite we can see the handle of test values I just sent through appear:

Tip: if you don’t see your data coming through, check out the logs in ~/graphite/storage/log/carbon-cache/carbon-cache-a/ (assuming you have Graphite installed in ~/graphite)

So, we know what data we want (nqquery.log timings), and how to get data into Graphite (send the data value to Carbon via nc). How do we bring the two together? We do this in the same way that many Linux things work, and that it using pipes to join different commands together, each doing one thing and one thing well. The above example demonstrates this – the output from echo is redirected to nc.

To extract the data I want from nqquery.log I am using grep to isolate the lines of data that I want, and then gawk to parse the relevant data value out of each line. The output from gawk is then piped to nc just like above. The resulting command looks pretty grim, but is mostly a result of the timestamp conversion into Unix time:

grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$23);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.elapsed",$23,e}'|nc localhost 2003

An example of the output of the above is:

nqquery.logical.response 29 1395766983
nqquery.logical.response 22 1395766983
nqquery.logical.response 22 1395766983
nqquery.logical.response 24 1395766984
nqquery.logical.response 86 1395767047
nqquery.logical.response 10 1395767233
nqquery.logical.response 9 1395767233

which we can then send straight to Carbon.

I’ve created additional versions for other available metrics, which in total gives us:

# This will parse nqquery.log and send the following metrics to Graphite/Carbon, running on localhost port 2003
#       nqquery.logical.compilation
#       nqquery.logical.elapsed
#       nqquery.logical.response
#       nqquery.logical.rows_returned_to_client
#       nqquery.physical.bytes
#       nqquery.physical.response
#       nqquery.physical.rows
# NB it parses the whole file each time and sends all values to carbon. 
# Carbon will ignore duplicates, but if you're working with high volumes 
# it would be prudent to ensure the nqquery.log file is rotated 
# appropriately.

grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$23);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.elapsed",$23,e}'|nc localhost 2003
grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$26);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.response",$26,e}'|nc localhost 2003
grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.compilation",$29,e}'|nc localhost 2003
grep "Physical query response time" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.response",$(NF-4),e}'|nc localhost 2003
grep "Rows returned to Client" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.rows_returned_to_client",$(NF-1),e}'|nc localhost 2003
grep "retrieved from database" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$(NF-9));split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.rows",$(NF-9),e}'|nc localhost 2003
grep "retrieved from database" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.bytes",$(NF-7),e}'|nc localhost 2003

Now I run this script, it scrapes the data out of nqquery.log and sends it to Carbon, from where I can render it in Graphite:

or even better, Grafana:

Grafana

Grafana is an replacement for the default Graphite front-end, written by Torkel Ödegaard and available through the very active github repository.

It’s a great way to very rapidly develop and explore dashbaords of data sourced from Graphite. It’s easy to install too. Using SampleApp as an example, setup per the obi-metrics-agent example, do the following:

# Create a folder for Grafana
mkdir /home/oracle/grafana
cd /home/oracle/grafana
# Download the zip from http://grafana.org/download/
wget http://grafanarel.s3.amazonaws.com/grafana-1.5.3.zip
# Unzip it and rearrange the files
unzip grafana-1.5.3.zip
mv grafana-1.5.3/* .
# Create & update the config file
cp config.sample.js config.js
sed -i -e 's/8080/80/g' config.js
# Add grafana to apache config
sudo sed -i'.bak' -e '/Alias \/content/i Alias \/grafana \/home\/oracle\/grafana' /etc/httpd/conf.d/graphite-vhost.conf 
sudo service httpd restart 

# Download ElasticSearch from http://www.elasticsearch.org/overview/elkdownloads/
cd /home/oracle/grafana
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.zip
unzip elasticsearch-1.1.1.zip
# Run elasticsearch
nohup /home/oracle/grafana/elasticsearch-1.1.1/bin/elasticsearch &
# NB if you get an out of memory error, it could be a problem with the JDK available. Try installing java-1.6.0-openjdk.x86_64 and adding it to the path.

At this point you should be able to go to the URL on your sampleapp machine http://localhost/grafana/ and see the Grafana homepage.

One of the reasons I like working with Grafana so much is how easy it is to create very smart, interactive dashboards. Here’s a simple walkthrough.

  1. Click on the Open icon and then New to create a new dashboard
  2. On the new dashboard, click Add a panel to this row, set the Panel Type to Graphite, click on Add Panel and then Close.

  3. Click on the title of the new graph and select Edit from the menu that appears. In the edit screen click on Add query and from the select metric dropdown list define the metric that you want to display


    From here you can add additional metrics to the graph, or add graphite functions to the existing metric. I described the use of functions in my previous post about OBIEE and Graphite
  4. Click on Back to Dashboard at the top of the screen, to see your new graph in place. You can add rows to the dashboard, resize graphs, and add new ones. One of the really nice things you can do with Grafana is drag to zoom a graph, updating the time window shown for the whole page:


    You can set dashboards to autorefresh too, from the time menu at the top of the screen, from where you can also select pre-defined windows.
  5. When it comes to interacting with the data being able to click on a legend entry to temporarily ‘mute’ that metric is really handy.

This really is just scratching the surface of what Grafana can do. You can see more at the website, and a smart YouTube video.

Summary

Here I’ve shown how we can easily put additional, arbitrary data into Graphite’s datastore, called Whisper. In this instance it was nqquery.log data that I wanted to correlate with OBIEE’s DMS data, but I’ve also used it very successfully in the past to overlay the number of executing JMeter load test users with other data in graphite.

I’ve also demonstrated Grafana, a tool I strongly recommend if you do any work with Graphite. As a front-end it is an excellent replacement for the default Graphite web front end, and it’s very easy to use too.

Categories: BI & Warehousing

Data Warehouse for Big Data: Scale-Up vs. Scale-Out

Dylan Wan - Thu, 2014-01-02 15:33

Found a very good paper: http://research.microsoft.com/pubs/204499/a20-appuswamy.pdf


This paper discuss if it is a right approach of using Hadoop as the analytics infrastructure.


It is hard to argue with the industry trend.  However, Hadoop is not
new any more.  It is time for people to calm down and rethink about the
real benefits.

Categories: BI & Warehousing