Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 54 min 49 sec ago

Rittman Mead Founder Member of the Red Expert Alliance

Thu, 2014-11-20 05:24

At Rittman Mead we’re always keen to collaborate with our peers and partners in the Oracle industry, running events such as the BI Forum in Brighton and Atlanta each year and speaking at user groups in the UK, Europe, USA and around the world. We were pleased therefore to be contacted by our friends over at Amis in the Netherlands just before this year’s Openworld with an idea they had about forming an expert-services Oracle partner network, and to see if we’d be interested in getting involved.

Fast-forward a few weeks and thanks to the organising of Amis’ Robbrecht van Amerongen we had our first meeting at San Francisco’s Thirsty Bear, where Amis and Rittman Mead were joined by the inaugural member partners Avio Consulting, E-VIta AS, Link Consulting, Opitz Consulting and Rubicon Red.

Gathering of the expert alliance partners at #oow14 @Oracle open world. pic.twitter.com/PNK8fpt7iU

— Red Expert Alliance (@rexpertalliance) November 3, 2014

The Red Expert Alliance now has a website, news page and partner listing, and as the “about” page details, it’s all about collaborating between first-class Oracle consulting companies:

NewImage

“The Red Expert Alliance is an international  network of first-class Oracle consulting companies. We are  working together to deliver the maximum return on our customers investment in Oracle technology. We do this by collaborating, sharing and challenging each other to improve ourselves and our customers.

Collaborating with other companies is a powerful way to overcome challenges of today’s fast-paced world and improve competitive advantage. Collaboration provides participants mutual benefits such as shared resources, shared expertise and enhanced creativity; it gives companies an opportunity to improve their performance and operations, achieving more flexibility thanks to shared expertise and higher capacity. Collaboration also fuels innovation by providing more diversity to the workplace which can result in better-suited solutions for customers.”

There’s also a press release, and a partner directory listing out our details along with the other partners in the alliance. We’re looking forward to collaborating with the other partners in the Red Expert Alliance, increasing our shared knowledge and collaborating together on Oracle customer projects in the future.

Categories: BI & Warehousing

OBIEE SampleApp v406 Amazon EC2 AMI – available for public use

Thu, 2014-11-13 08:34

I wrote a while ago about converting Oracle’s superb OBIEE SampleApp from a VirtualBox image into an EC2-hosted instance. I’m pleased to announce that Oracle have agreed for us to make the image (AMI) on Amazon available publicly. This means that anyone who wants to run their own SampleApp v406 server on Amazon’s EC2 cloud service can do so.

2014-11-11_17-47-32

2014-11-12_09-15-41

Important caveats

Before getting to the juicy stuff there’s some important points to note about access to the AMI, which you are implicitly bound by if you use it:

  1. In accessing it you’re bound by the same terms and conditions that govern the original SampleApp
  2. SampleApp is only ever for use in your own development/testing/prototyping/demonstrating with OBIEE. It must not be used as the basis for any kind of Productionisation.
  3. Neither Oracle nor Rittman Mead provide any support for SampleApp or the AMI, nor warranty to any issues caused through their use.
  4. Once launched, the server will be accessible to the public and its your responsibility to secure it as such.
How does it work?
  1. Create yourself an AWS account, if you haven’t already. You’ll need your credit card for this. Read more about getting started with AWS here.
  2. Request access to the AMI (below)
  3. Launch the AMI on your AWS account
  4. Everything starts up automagically. After 15-20 minutes, enjoy your fully functioning SampleApp v406 instance, running in the cloud!
How much does it cost?

You can get an estimate of the cost involved using the Amazon Calculator.

As a rough guide, as of November 2014 an “m3.large” instance costs around $4 a day  – but it’s your responsibility to check pricing and commitments.

Be aware that once a server is created you’ll incur costs on it right through until you “terminate” it. You can “stop” it (in effect, power it off) which reduces the running costs but you’ll still pay for the ‘disk’ (EBS volume) that holds it. The benefit of this though is that you can then power it back up and it’ll be as you left it (just with a different IP).

You can track your AWS usage through the AWS page here.

Security
  • Access to the instance’s command line is through SSH as the oracle user using SSH keys only (provided by you when you launch the server) – no password access
    • You cannot ssh to the server as root; instead connect as oracle and use sudo as required.
    • The ssh key does not get set up until the very end of the first boot sequence, which can be 20 minutes. Be patient!
  • All the OBIEE/WebLogic usernames and passwords are per the stock SampleApp v406 image, so you are well advised to change them. Otherwise if someone finds your instance running, they’ll be able to access it
  • There is no firewall (iptables) running on the server. Since this is a public server you’d be wise to make use of Amazon’s Security Group functionality (in effect, a firewall at the virtual hardware level) to block access on all ports except those necessary.
    For example, you could block all traffic except 7780, and then enable access on port 22 (SSH) and 7001 (Admin Server) just when you need to access it for admin.
Using the AMI
  1. You first need to get access to the AMI, through the form below. You also need an active AWS account.
  2. Launch the server:
    1. From the AWS AMI page locate the SampleApp AMI using the details provided when you request access through the form below. Make sure you are on the Ireland/eu-west-1 region. Click Launch.
    2. Select an Instance Type. An “m3.large” size is a good starting point (this site is useful to see the spec of all instances).2014-11-11_22-42-48
    3. Click through the Configure Instance DetailsAdd Storage, and Tag Instance screens without making changes unless you need to.
    4. On the Security Group page select either a dedicated security group if you have already configured one, or create a new one.
      A security group is a firewall that controls traffic to the server regardless of any software firewall configured or not on the instance. By default only port 22 (SSH) is open, so you’ll need to open at least 7780 for analytics, and 7001 too if you want to access WLS/EM as well
      Note that you can amend a security group’s rules once the instance is created, but you cannot change which security group it is bound to. For ad-hoc purposes I’d always use a dedicated security group per instance so that you can change rules just for your server without impacting others on your account.
      2014-11-11_22-47-33
    5. Click on Review and Launch, check what you’ve specified, and then click Launch. You’ll now need to either specific an existing SSH key pair, or generate a new one. It’s vital that you get this bit right, otherwise you’ll not be able to access the server. If you generate a new key pair, make sure you download it (it’ll be a .pem file). 2014-11-11_22-49-29
    6. Click Launch Instances
      2014-11-11_22-51-09You’ll get a hyperlinked Instance ID; click on that and it’ll take you to the Instances page filtered for your new server.
      2014-11-11_22-52-01
      Shortly you’ll see the server’s public IP address shown.
      2014-11-11_22-52-09
  3. OBIEE is configured to start automagically at boot time along with the database. This means that in theory you don’t need to actually access the server directly. It does take 15-20 minutes on first boot to all fire up though, so be patient.
  4. The managed server is listening on port 7780, and admin server on 7001. If your server IP is 42.42.42.42 the URLs would be:
    • Analytics: http://42.42.42.42:7780/analytics
    • WLS: http://42.42.42.42:7001/console
    • EM: http://42.42.42.42:7001/em
On the server

The server is a stock SampleApp v406 image, with a few extras:

  • obiee and dbora services configured and set to run at bootup. Control obiee using:
    sudo service obiee status
    sudo service obiee stop
    sudo service obiee start
    sudo service obiee restart
  • screen installed with a .screenrc setup
Accessing the AMI

To get access to the AMI, please complete this short form and we will send you the AMI details by email.

By completing the form and requesting access to the AMI, you are acknowledging that you have read and understood the terms and conditions set out by Oracle here.

Categories: BI & Warehousing

Rittman Mead anuncia su catálogo de cursos en Español y en Portugués.

Wed, 2014-11-12 07:55

Spanish_Portuguese_Training

UnitedKingdom-ukflag brazil-flag

Tenemos el agrado de anunciarles que desde ahora Rittman Mead ofrece su catálogo completo de cursos en Español y en Portugués.  Esta es una gran noticia para quienes viven en América Latina, España y Portugal, ya que ahora pueden recibir la mejor capacitación, incluyendo el material teórico/práctico, en su idioma local.

Los cursos se dictan tanto en forma remota, con clases virtuales en vivo a través de nuestra plataforma web como también en forma presencial. De ambas maneras, cada estudiante recibirá el material teórico/práctico en forma electrónica y tendrá acceso durante el curso a una máquina virtual de uso exclusivo, para realizar las prácticas.

Ofrecemos un amplia variedad de cursos de Oracle Business Intelligence y tecnologías de Data Warehousing: desde cursos intensivos de 5 días (bootcamps)  a conjunto de cursos específicos orientados por rol de trabajo. Acceda al catálogo completo en Español, Portugués o Inglés.

¿Está interesado en nuestros cursos o quiere ser nuestro partner en capacitación? No dude en consultarnos al mail training@rittmanmead.com

Categories: BI & Warehousing

Auditing OBIEE Presentation Catalog Activity with Custom Log Filters

Mon, 2014-11-10 01:49

A question that I’ve noticed coming up a few times on the OBIEE OTN forums goes along the lines of “How can I find out who deleted a report from the Presentation Catalog?”. And whilst the BI Server’s Usage Tracking is superb for auditing who ran what report, we don’t by default have a way of seeing who deleted a report.

The Presentation Catalog (or “Web Catalog” as it was called in 10g) records who created an object and when it was last modified, accessible through both OBIEE’s Catalog view, and the dedicated Catalog Manager tool itself:


But if we want to find out who deleted an object, or maybe who modified it before the most recent person (that is, build up an audit trail of who modified an object) we have to dig a bit deeper.

Presentation Services Log Sources

Perusing the OBIEE product manuals, one will find documented additional Logging in Oracle BI Presentation Services options. This is more than just turning up the log level en masse, because it also includes additional log writers and filters. What this means is that you can have your standard Presentation Services logging, but then configure a separate file to capture more detailed information about just specific goings on within Presentation Services.

Looking at a normal Presentation Services log (in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/coreapplication_obips1/) you’ll see various messages by default – greater or fewer depending on the health of your system – but they all use the Location stack track, such as this one here:

[2014-11-10T06:33:19.000-00:00] [OBIPS] [WARNING:16] [] [saw.soap.soaphelpers.writeiteminfocontents] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000001ede,0:1] [tid: 2569512704] Resolving and writing full ACL for path /shared/Important stuff/Sales by brand[[
File:soaphelpers.cpp
Line:609
Location:
        saw.soap.soaphelpers.writeiteminfocontents
        saw.soap.catalogservice
        saw.SOAP
        saw.httpserver.request.soaprequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales by brand
AuthProps: AuthSchema=UidPwd-soap|PWD=******|UID=weblogic|User=weblogic
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000001ede,0:1
ThreadID: 2569512704

And it is the Location that is of interest to us here, because it’s what gives hints about the types of log messages that can be emitted and that we may want to filter. For example, the one quoted above is evidently something to do with the Presentation Catalog and SOAP, which I’d guess is a result of Catalog Manager (which uses web services/SOAP to access OBIEE).

To get a full listing of all the possible log sources, first set up the BI command line environment with bi-init:

source $FMW_HOME/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

and then run:

sawserver -logsources

(If you get an error, almost certainly you didn’t set up the command line environment properly with bi-init). You’ll get an list of over a thousand lines (which gives you an idea of quite how powerful this granular logging is). Assuming you’ll want to peruse it at your leisure, it makes sense to write it to disk which if you’re running this on *nix you can simply do thus:

sawserver -logsources > sawserver.logsources.txt

To find what you want on the list, you can just search through it. Looking for anything related to “catalog” and narrowing it down further, I came up with these interesting sources:

[oracle@demo ~]$ sawserver -logsources|grep catalog|grep local
saw.catalog.item.getlocalized
saw.catalog.local
saw.catalog.local.checkforcatalogupgrade
saw.catalog.local.copyItem
saw.catalog.local.createFolder
saw.catalog.local.createLink
saw.catalog.local.deleteItem
saw.catalog.local.getItemACL
saw.catalog.local.getItemInfo
saw.catalog.local.loadCatalog
saw.catalog.local.moveItem
saw.catalog.local.openObject
saw.catalog.local.readObject
saw.catalog.local.search
saw.catalog.local.setItemACL
saw.catalog.local.setItemInfo
saw.catalog.local.setMaintenanceMode
saw.catalog.local.setOwnership
saw.catalog.local.writeObject

Configuring granular Presentation Services logging

Let us see how to go and set up this additional logging. Remember, this is not the same as just going to Enterprise Manager and bumping the log level to 11 globally – we’re going to retain the default logging level, but for just specific actions that occur within the tool, capture greater information. The documentation for this is here.

The configuration is found in the instanceconfig.xml file, so like all good sysadmins let’s take a backup first:

cd $FMW_HOME/instances/instance1/config/OracleBIPresentationServicesComponent/coreapplication_obips1/
cp instanceconfig.xml instanceconfig.xml.20141110

Now depending on your poison, open the instanceconfig.xml directly in a text editor from the command line, or copy it to a desktop environment where you can open it in your favourite text editor there. Either way, these are the changes we’re going to make:

  1. Locate the <Logging> section. Note that within it there are three child entities – <Writers>, <WriterClassGroups> and <Filters>. We’re going to add an entry to each.

  2. Under <Writers>, add:

    <Writer implementation="FileLogWriter" name="RM Presentation Catalog Audit" disableCentralControl="true" writerClassId="6" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="rm_pres_cat_audit" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>

    This defines a new writer than will write logs to disk (FileLogWriter), in 100MB files of which it’ll keep 10. If you’re defining additional Writers, make sure they have a unique writerClassId See docs for detailed syntax.

  3. Under <WriterClassGroups> add:

    <WriterClassGroup name="RMLog">6</WriterClassGroup>

    This defines the RMLog class group as being associated with writerClassId 6 (as defined above), and is used in the Filters section to direct logs. If you wanted you could log entries to multiple logs (eg both file and console) this way.

  4. Under <Filters> add:

    <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.moveItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
    <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.deleteItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>

    Here we’re defining two event filters, with levels turned up to max (32), directing the capture of any occurences to the RMLog writerClassGroup.

After making the changes to instanceconfig.xml, restart Presentation Services:

$FMW_HOME/instances/instance1/bin/opmnctl restartproc ias-component=coreapplication_obips1

Here’s the completed instanceconfig.xml from the top of the file through to the end of the <Logging> section, with my changes overlayed the defaults:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Oracle Business Intelligence Presentation Services Configuration File -->
<WebConfig xmlns="oracle.bi.presentation.services/config/v1.1">
   <ServerInstance>

      <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><CatalogPath>/app/oracle/biee/instances/instance1/SampleAppWebcat</CatalogPath>

      <DSN>AnalyticsWeb</DSN>

      <Logging>

         <Writers>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="FileLogWriter" name="Global File Logger" writerClassId="1" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="sawlog" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="CoutWriter" name="Console Logger" writerClassId="2" maxFileSizeKb="10240"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="EventLogWriter" name="System Event Logger" writerClassId="3" maxFileSizeKb="10240"/>
            <!--  The following writer is not centrally controlled -->
            <Writer implementation="FileLogWriter" name="Webcat Upgrade Logger" disableCentralControl="true" writerClassId="5" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="webcatupgrade" maxFileSizeKb="2147483647" filesN="1" fmtName="ODL-Text"/>
            <Writer implementation="FileLogWriter" name="RM Presentation Catalog Audit" disableCentralControl="true" writerClassId="6" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="rm_pres_cat_audit" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>
         </Writers>

         <WriterClassGroups>
            <WriterClassGroup name="All">1,2,3,5,6</WriterClassGroup>
            <WriterClassGroup name="File">1</WriterClassGroup>
            <WriterClassGroup name="Console">2</WriterClassGroup>
            <WriterClassGroup name="EventLog">3</WriterClassGroup>
            <WriterClassGroup name="UpgradeLogFile">5</WriterClassGroup>
            <WriterClassGroup name="RMLog">6</WriterClassGroup>
         </WriterClassGroups>

         <Filters>
            <!--  These FilterRecords are updated by centrally controlled configuration -->
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><FilterRecord writerClassGroup="File" path="saw" information="1" warning="31" error="31" trace="0" incident_error="1"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><FilterRecord writerClassGroup="File" path="saw.mktgsqlsubsystem.joblog" information="1" warning="31" error="31" trace="0" incident_error="1"/>

            <!--  The following FilterRecords are not centrally controlled -->
            <FilterRecord writerClassGroup="UpgradeLogFile" disableCentralControl="true" path="saw.subsystem.catalog.initialize.upgrade" information="1" warning="32" error="32" trace="1" incident_error="32"/>
            <FilterRecord writerClassGroup="UpgradeLogFile" disableCentralControl="true" path="saw.subsystem.catalog.upgrade" information="1" warning="32" error="32" trace="1" incident_error="32"/>
            <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.moveItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
            <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.deleteItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
         </Filters>

      </Logging>

[...]

Granular logging in action

Having restarted Presentation Services after making the above change, I can see in my new log file whenever an item from the Presentation Catalog is deleted, by whom, and from what IP address:

[2014-11-10T07:13:36.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.deleteItem] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002cf1,0:1] [tid: 2458068736] Succeeded with '/shared/Important stuff/Sales by brand 2'[[
File:localwebcatalog.cpp
Line:626
Location:
        saw.catalog.local.deleteItem
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales by brand 2
SessionID: p8n6ojs0vkh7tou0mkstmlc9me381hadm9o1fui
AuthProps: AuthSchema=UidPwd|PWD=******|UID=r.mellie|User=r.mellie
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002cf1,0:1
ThreadID: 2458068736
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: action='rm',_scid='QR5zMdHIL3JsW1b67P9p',icharset='utf-8',urlGenerator='qualified',paths='["/shared/Important stuff/Sales by brand 2"]'
]]

And the same for when a file is moved/renamed:

[2014-11-10T07:28:17.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.moveItem] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000003265,0:1] [tid: 637863680] Source '/shared/Important stuff/copy of Sales by brand', Destination '/shared/Important stuff/Sales by brand 2': Succeeded with '/shared/Important stuff/copy of Sales by brand'[[
File:localwebcatalog.cpp
Line:1186
Location:
        saw.catalog.local.moveItem
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/copy of Sales by brand
SessionID: ddt6eo7llcm0ohs5e2oivddj7rtrhn8i41a7f32
AuthProps: AuthSchema=UidPwd|PWD=******|UID=f.saunders|User=f.saunders
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000003265,0:1
ThreadID: 637863680
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: path='/shared/Important stuff/copy of Sales by brand',action='ren',_scid='84mO8SRViXlwJ*180HV7',name='Sales by brand 2',keepLink='f',icharset='utf-8',urlGenerator='qualified'
]]

Be careful with your logging

Just because you can log everything, don’t be tempted to actually log everything. Bear in mind that we’re crossing over from simple end-user logging here into the very depths of the sawserver (Presentation Services) code, accessing logging that is extremely diagnostic in nature. Which for our specific purpose of tracking when someone deletes an object from the Presentation Catalog is handy. But as an example, if you enable saw.catalog.local.writeObject event logging, you may think that it will record who changed a report when, and that might be useful. But – look at what gets logged every time someone saves a report:

[2014-11-10T07:19:32.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.writeObject] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002efb,0:1] [tid: 2454759168] Succeeded with '/shared/Important stuff/Sales 01'[[
File:localwebcatalog.cpp
Line:1476
Location:
        saw.catalog.local.writeObject
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales 01
SessionID: p8n6ojs0vkh7tou0mkstmlc9me381hadm9o1fui
AuthProps: AuthSchema=UidPwd|PWD=******|UID=r.mellie|User=r.mellie
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002efb,0:1
ThreadID: 2454759168
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: path='/shared/Important stuff/Sales 01',action='wr',_scid='QR5zMdHIL3JsW1b67P9p',repl='t',followLinks='t',icharset='utf-8',modifiedTime='1415600931000',data='<saw:report xmlns:saw="com.siebel.analytics.web/report/v1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sawx="com.siebel.analytics.web/expression/v1.1" xmlVersion="201201160"><saw:criteria xsi:type="saw:simpleCriteria" subjectArea="&quot;A - Sample Sales&quot;" withinHierarchy="true"><saw:columns><saw:column xsi:type="saw:regularColumn" columnID="c1dff1637cbc77948"><saw:columnFormula><sawx:expr xsi:type="sawx:sqlExpression">"Time"."T05 Per Name Year"</sawx:expr></saw:columnFormula></saw:column></saw:columns></saw:criteria><saw:views currentView="0"><saw:view xsi:type="saw:compoundView" name="compoundView!1"><saw:cvTable><saw:cvRow><saw:cvCell viewName="titleView!1"><saw:displayFormat><saw:formatSpec/></saw:displayFormat></saw:cvCell></saw:cvRow><saw:cvRow><saw:cvCell viewName="tableView!1"><saw:displayFormat><saw:formatSpec/></saw:displayFormat></saw:cvCell></saw:cvRow></saw:cvTable></saw:view><saw:view xsi:type="saw:titleView" name="titleView!1"/><saw:view xsi:type="saw:tableView" name="tableView!1" scrollingEnabled="false"><saw:edges><saw:edge axis="page" showColumnHeader="true"/><saw:edge axis="section"/><saw:edge axis="row" showColumnHeader="true"><saw:edgeLayers><saw:edgeLayer type="column" columnID="c1dff1637cbc77948"/></saw:edgeLayers></saw:edge><saw:edge axis="column" showColumnHeader="rollover"/></saw:edges></saw:view></saw:views></saw:report>',sig='queryitem1'
]]

It’s the whole report definition! And this is a very very small report – real life reports can be page after page of XML. That is not a good level at which to be recording this information. If you want to retain this kind of control over who is saving what report, you should maybe be looking at authorisation groups for your users in terms of where they can save reports, and have trusted ‘gatekeepers’ for important areas.

As well as the verbose report capture with the writeObject event, you also get this background chatter:

[2014-11-10T07:20:27.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.writeObject] [ecid: 0051rj7FmC3Fw000jzwkno0007PK000000,0:200] [tid: 3034580736] Succeeded with '/users/r.mellie/_prefs/volatileuserdata'[[
File:localwebcatalog.cpp
Line:1476
Location:
        saw.catalog.local.writeObject
        saw.subsystem.security.cleanup
        saw.Sessions.cache.cleanup
        saw.taskScheduler.processJob
        taskscheduler
        saw.threads
Path: /users/r.mellie/_prefs/volatileuserdata
ecid: 0051rj7FmC3Fw000jzwkno0007PK000000,0:200
ThreadID: 3034580736
task: Cache/Sessions
]]

volatileuserdata is presumably just that (user data that is volatile, constantly changing) and not something that it would be of interest to anyone to log – but you can’t capture actual report writes without capturing this too. On a busy system you’re going to be unnecessarily thrashing the log files if you capture this event by routine – so don’t!

Summary

The detailed information is there for the taking in Presentation Services’ excellent granular log sources – just be careful what you capture lest you bite off more than you can chew.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

Tue, 2014-11-04 16:02

In this post we will see how Kibana can be used to create visualisations over various sets of data that we have combined together. Kibana is a graphical front end for data held in ElasticSearch, which also provides the analytic capabilities. Previously we looked at where the data came from and exposing it through Hive, and then loading it into ElasticSearch. Here’s what we’ve built so far, the borders denoting what was covered in the previous two blog articles and what we’ll cover here:

kib32

Now that we’ve got all the data into Elasticsearch, via Hive, we can start putting some pictures around it. Kibana works by directly querying Elasticsearch, generating the same kind of queries that you can run yourself through the Elasticsearch REST API (similar to what we saw when defining the mappings in the previous article). In this sense there is a loose parallel between OBIEE’s Presentation Services and the BI Server – one does the fancy front end stuff, generating queries to the hard-working backend.

I’ve been looking at both the current release version of Kibana (3.x), and also the beta of Kibana 4 which brings with it a very smart visualiser that we’ll look at in detail. It looks like Kibana 4 is a ground-up rewrite rather than modifications to Kibana 3, which means that at the moment it is a long way from parity of functionality – which is why I’m flitting between the two. For a primer in Kibana 3 and its interface see my article on using it to monitor OBIEE.

Installing Kibana is pretty easy in Kibana 3, involving a simple config change to a web server of your choice that you need to provide (details in my previous blog), and has been made even easier in Kibana 4 which actually ships with its own web server so you literally just download it, unarchive it and run it.

So the starting point is the assumption we have all the data in a single Elasticsearch index all_blog, with three different mappings which Kibana refers to accurately as “types”: blog posts, blog visits, and blog tweets.

Kibana 3

Starting with a simple example first, and to illustrate the “analysed” vs “non-analysed” mapping configuration that I mentioned previously, let’s look at the Term visualisation in Kibana 3. This displays the results of an Elasticsearch analysis against a given field. If the field has been marked as “not analysed” we get a listing of the literal values, ranking by the number of times they repeat. This is useful, for example, to show who has blogged the most:

But less useful if we want to analyse the use of words in blog titles, since non-analysed we just get a listing of blog titles:

(there are indeed two blog posts entitled “Odds and Ends” from quite a while ago 1 2)

Building the Term visualisation against the post title field that has been analysed gives us a more interesting, although hardly surprising, result:

Here I’ve weeded out the obvious words that will appear all the time (‘the’, ‘a’, etc), using the Exclude Term(s) option.

Term visualisations are really useful for displaying any kind of top/bottom ranked values, and also because they are interactive – if you click on the value it is applied as a filter to the data on the page. What that means is that we can take a simple dashboard using the two Term objects above, plus a histogram of posts made over time:

And by clicking on one of the terms (for example, my name in the authors list) it shows that I only started posting on the Rittman Mead blog three years ago, and that I write about OBIEE, performance, and exalytics.

Taking another tack, we can search for any term and add it in to the histogram. Here we can see when interest in 11g (the green line), as well as big data (red), started :

Note here we’re just analyzing post titles not content so it’s not 100% representative. Maybe loading in our post contents to Elasticsearch will be my next blog post. But that does then start to get a little bit meta…

Adding in a Table view gives us the ability to show the actual posts and links to them.

Let’s explore the data a bit. Clicking on an entry in the table gives us the option to filter down further

Here we can see for a selected blog post, what its traffic was and when (if at all) it was tweeted:

Interesting in the profile of blog hits is a second peak that looks like it might correlate with tweets. Let’s drill further by drag-clicking (brushing) on the graph to select the range we want, and bring in details of those tweets:

So this is all pretty interesting, and importantly, very rapid in terms of both the user experience and the response time.

Kibana 4

Now let’s take a look at what Kibana 4 offers us. As well as a snazzier interface (think hipster data explorer vs hairy ops guy parsing logs), its new Visualiser builder is great. Kibana 3 dumped you on a dashboard in which you have to build rows and panels and so on. Kibana 4 has a nice big “Visualize” button. Let’s see what this does for us. To start with it’s a nice “guided” build process:

By default we get a single bar, counting all the ‘documents’ for the time period. We can use the Search option at the top to filter just the ‘type’ of document we want, which in this case is going to be tweets about our blog articles.

Obviously, a single bar on its own isn’t that interesting, so let’s improve it. We’ll click the “Add Aggregation” button (even though to my pedantic mind the data is already aggregated to total), and add an X-Axis of date:

The bucket size in the histogram defaults to automatic, and the the axis label tells us it’s per three hours. At the volume of tweets we’re analysing, we’d see patterns better at a higher grain such as daily (the penultimate bar to the right of the graph shows a busy day of tweets that’s lost in the graph at 3-hour intervals):

NB at the moment in Kibana 4 intervals are fixed (in Kibana 3 they were freeform).

Let’s dig into the tweets a bit deeper. Adding a “Sub Aggregation” to split the bars based on top two tweet authors per day gives us this:

You can hover over the legend to highlight the relevant bar block too:

Now with a nifty function in the Visualizer we can change the order of this question. So instead of, “by day, who were the top two tweeters”, we can ask “who were the top two tweeters over the time period, and what was their tweet count by day” – all just by rearranging the buckets/aggregation with a single click:

Let’s take another angle on the data, looking not at time but which blog links were most tweeted, and by whom. Turns out I’m a self-publicist, tweeting four times about my OOW article. Note that I’ve also including some filtering on my data to exclude automated tweets:

Broadening out the tweets to all those from accounts we were capturing during the sample we can see the most active tweeters, and also what proportion are original content vs retweets:

Turning our attention to the blog hits, it’s easy to break it down by top five articles in a period, accesses by day:

Having combined (dare I say, mashed up) post metadata with apache logs, we can overlay information about which author gets the most hits. Unsuprisingly Mark Rittman gets the lion’s share, but interestingly Venkat, who has not blogged for quite a while is still in the top three authors (based on blog page hits) in the time period analysed:

It’s in the current lack of a table visualisation that Kibana 4 is currently limited (although it is planned), because this analysis here (of the top three authors, what were their respective two most popular posts) just makes no sense as a graph:

but would be nice an easy to read off a table. You can access a table view of sorts from the arrow at the bottom of the screen, but this feels more like a debug option than an equal method for presenting the data

Whilst you can access the table on a dashboard, it doesn’t persist as the default option of the view, always showing the graph initially. As noted above, a table visualisation is planned and under development for Kibana 4.

Speaking of dashboards, Kibana 4 has a very nice dashboard builder with interactive resizing of objects both within rows and columns – quite a departure from Kibana 3 which has a rigid system of rows and panels:

Summary

Kibana 3 is great for properly analysing data and trends as you find them in the data, if you don’t mind working your way through the slightly rough interface. In contrast, Kibana 4 has a pretty slick UI but being an early beta is missing features like Term and Table from Kibana 3 that would enable tables of data as well as the pretty graphs. It’ll be great to see how it develops.

Putting the data in Elasticsearch makes it very fast to query. I’m doing this on a the Big Data Lite VM which admittedly is not very representative of a realworld Hadoop cluster but the relative speeds are interesting – dozens of seconds for any kind of Hive query, subsecond for any kind of Kibana/Elasticsearch query. The advantage of the latter of course being very interesting from a data exploration point of view, because you not only have the speed but also the visualisation and interactions with those visuals to dig and drill further into it.

Whilst Elasticsearch is extremely fast to query, I’ve not compared it to other options that are designed for speed (eg Impala) and which support a more standard interface, such as ODBC or JDBC so you can bring your own data visualisation tool (eg T-who-shall-not-be-named). In addition, there is the architectural consideration of Elasticsearch’s fit with the rest of the Hadoop stack. Whilst the elasticsearch-hadoop connector is two-way, I’m not sure if you would necessarily site your data in Elasticsearch alone, opting instead to duplicate all or part of it from somewhere like HDFS.

What would be interesting is to look at a similar analysis exercise using the updated Hue Search in CDH 5.2 which uses Apache Solr and therefore based on the same project as Elasticsearch (Apache Lucene). Another angle on this is Oracle’s forthcoming Big Data Discovery tool which also looks like it covers a similar purpose.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

Tue, 2014-11-04 15:17
Introduction

In the first part of this series I described how I made several sets of data relating to the Rittman Mead blog from various sources available through Hive. This included blog hits from the Apache webserver log, tweets, and metadata from WordPress. Having got it into Hive I now need to get it into ElasticSearch as a pre-requisite for using Kibana to see how it holds up as a analysis tool or as a “data discovery” option. Here’s a reminder of the high-level architecture, with the parts that I’ve divided it up into covering over the three number of blog posts indicated:

kib31

In this article we will see how to go about doing that load into ElasticSearch, before getting into some hands-on with Kibana in the final article of this series.

Loading data from Hive to Elasticsearch

We need to get the data into Elasticsearch itself since that is where Kibana requires it to be for generating the visualisations. Elasticsearch holds the data and provides the analytics engine, and Kibana provides the visualisation rendering and the generation of queries into Elasticsearch. Kibana and Elasticsearch are the ‘E’ and ‘K’ of the ELK stack, which I have written about previously (the ‘L’ being Logstash but we’re not using that here).

Using the elasticsearch-hadoop connector we can load data exposed through Hive into Elasticsearch. It’s possible to load data directly from origin into Elasticsearch (using, for example, Logstash) but here we’re wanting to bring together several sets of data using Hadoop/Hive as the common point of integration.

Elasticsearch has a concept of an ‘index’ within which data is stored, held under a schema known as a ‘mapping’. Each index can have multiple mappings. It’s dead easy to run Elasticsearch – simply download it, unpack the archive, and then run it – it really is as easy as that:

[oracle@bigdatalite ~]$ /opt/elasticsearch-1.4.0.Beta1/bin/elasticsearch
[2014-10-30 16:59:39,078][INFO ][node                     ] [Master] version[1.4.0.Beta1], pid[13467], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-30 16:59:39,080][INFO ][node                     ] [Master] initializing ...
[2014-10-30 16:59:39,094][INFO ][plugins                  ] [Master] loaded [], sites [kopf, gui]
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] initialized
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] starting ...
[2014-10-30 16:59:43,419][INFO ][transport                ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.57.3:9300]}
[2014-10-30 16:59:43,446][INFO ][discovery                ] [Master] elasticsearch/mkQYgr4bSiG-FqEVRkB_iw
[2014-10-30 16:59:46,501][INFO ][cluster.service          ] [Master] new_master [Master][mkQYgr4bSiG-FqEVRkB_iw][bigdatalite.localdomain][inet[/192.168.57.3:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-30 16:59:46,552][INFO ][http                     ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.57.3:9200]}
[2014-10-30 16:59:46,552][INFO ][node                     ] [Master] started

You can load data directly across into Elasticsearch from Hive without having to prepare anything on Elasticsearch – it will create the index and mapping for you. But, for it to work how we want, we do need to specify the mapping in advance because we want to tell Elasticsearch two important things:

  • To treat the date field as a date – crucial for Kibana to do its time series-based magic
  • Not to “analyze” certain fields. By default Elasticsearch will analyze each string field so that you can display most common terms within it etc. However if we want to report things like blog title, breaking it down into individual words doesn’t make sense.

This means that the process is as follows:

  1. Define the Elasticsearch table in Hive
  2. Load a small sample of data into Elasticsearch from Hive
  3. Extract the mapping and amend the date field and mark required fields as non-analysed
  4. Load the new mapping definition to Elasticsearch
  5. Do a full load from Hive into Elasticsearch

Steps 2 and 3 can be sidestepped by crafting the mapping by hand from the outset but it’s typically quicker not to.

Before we can do anything in terms of shifting data around, we need to make elasticsearch-hadoop available to Hadoop. Download it from the github site, and copy the jar file to /usr/lib/hadoop and add it to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive-env.sh.

Defining the Hive table over Elasticsearch

The Hive definition for a table stored in Elasticsearch is pretty simple. Here’s a basic example of a table that’s going to hold a list of all blog posts made. Note the _es suffix, a convention I’m using to differentiate the Hive table from others with the same data and denoting that it’s in Elasticsearch (es). Also note the use of EXTERNAL as previously discussed, to stop Hive trashing the underlying data if you drop the Hive table:

CREATE EXTERNAL TABLE all_blog_posts_es (
ts_epoch bigint ,
post_title string ,
post_title_a string ,
post_author string ,
url string ,
post_type string )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/posts'
) ;

The ROW FORMAT and STORED BY are standard, but the TBLPROPERTIES values should be explained (you’ll find full details in the manual):

  1. es.nodes – this is the hostname of the Elasticsearch server. If you have multiple nodes it will discover the others from this.
  2. es.resource – this is the index and mapping where the data should be stored. We’ll see more about these later, because they’re important.
Time for a tangent …

The biggest issue I had getting data from Hive into Elasticsearch was timestamps. To cut a very long story (involving lots of random jiggling, hi Christian!) short, I found it was easiest to convert timestamps into Unix epoch (number of seconds since Jan 1st 1970), rather than prat about with format strings (and prat about I did). For timestamps already matching the ISO8601 standard such as those in my WordPress data, I could leverage the Hive function UNIX_TIMESTAMP which returns exactly that

0: jdbc:hive2://bigdatalite:10000> select post_date, unix_timestamp(post_date) as post_date_epoch from posts limit 1;
post_date        2007-03-07 17:45:07
post_date_epoch  1173289507

For others though that included the month name as text such as Wed, 17 Sep 2014 08:31:20 +0000 I had to write a very kludgy CASE statement to first switch the month names for numbers and then concatenate the whole lot into a ISO8601 that could be converted to unix epoch. This is why I also split the apache log SerDe so that it would bring in the timestamp components (time_dayDD, time_monthMMM, etc) individually, making the epoch conversion a little bit neater:

unix_timestamp(concat(concat(concat(concat(concat(concat(
a.time_yearyyyy,'-')
,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
,a.time_daydd,' ')
,a.time_hourhh,':')
,a.time_minmm,':')
,a.time_secss,'')
)

Because if you thought this was bad, check out what I had to do to the twitter timestamp:

unix_timestamp(
    concat(concat(concat(concat(regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-')
    ,case regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} ',''),' .*$','') 
    when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,regexp_replace(regexp_replace(created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))
)

As with a few things here, this was all for experimentation than streamlined production usage, so it probably could be rewritten more efficiently or solved in a better way – suggestions welcome!

So the nett result of all of these is the timestamp as epoch in seconds – but note that Elasticsearch works with millisecond epoch, so they all need multiplying by 1000.

As I’ve noted above, this feels more complex than it needed to have been, and maybe with a bit more perseverence I could have got it to work without resorting to epoch. The issue I continued to hit with passing timestamps across as non-epoch values (i.e. as strings using the format option of the Elasticsearch mapping definition, or Hive Timestamp, and even specifying es.mapping.timestamp) was org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: TimestampParsingException, regardless of the careful format masks that I applied.

Back on track – loading a sample row into Elasticsearch

We want to send a sample row of data to Elasticsearch now for two reasons:

  1. As a canary to prove the “plumbing” – no point chucking thousands of rows across through MapReduce if it’s going to fall over for a simple problem (I learnt my lesson during the timestamp fiddling above).
  2. Automagically generate the Elasticsearch mapping, which we subsequently need to modify by hand and is easier if it’s been created for us first.

Since the table is defined in Hive, we can just run a straightforward INSERT to send some data across, making use of the LIMIT clause of HiveQL to just send a couple of rows:

INSERT INTO TABLE all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
       title, 
       title, 
       author, 
       REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
       post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
LIMIT 2
;

Hive will generate a MapReduce job that pushes the resulting data over to Elasticsearch. You can see the log for the job – essential for troubleshooting – at /var/log/hive/hive-server2.log (by default). In this snippet you can see a successful completion:

2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Starting Job = job_1414451727442_0011, Tracking URL = http://bigdatalite.localdomain:8088/proxy/application_1414451727442_0011/
2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414451727442_0011
2014-10-30 22:35:22,244 INFO  exec.Task (SessionState.java:printInfo(417)) - Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 1
2014-10-30 22:35:22,275 WARN  mapreduce.Counters (AbstractCounters.java:getGroup(234)) - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-10-30 22:35:22,276 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:22,276 Stage-0 map = 0%,  reduce = 0%
2014-10-30 22:35:30,757 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:30,757 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.51 sec
2014-10-30 22:35:40,098 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:40,098 Stage-0 map = 100%,  reduce = 100%, Cumulative CPU 4.44 sec
2014-10-30 22:35:40,100 INFO  exec.Task (SessionState.java:printInfo(417)) - MapReduce Total cumulative CPU time: 4 seconds 440 msec
2014-10-30 22:35:40,132 INFO  exec.Task (SessionState.java:printInfo(417)) - Ended Job = job_1414451727442_0011
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - MapReduce Jobs Launched:
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.44 sec   HDFS Read: 4313 HDFS Write: 0 SUCCESS
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Total MapReduce CPU Time Spent: 4 seconds 440 msec
2014-10-30 22:35:40,159 INFO  ql.Driver (SessionState.java:printInfo(417)) - OK

But if you’ve a problem with your setup you’ll most likely see this generic error instead passed back to beeline prompt:

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

Meaning that you need to go to the Hive log file for the full diagnostics.

Amending the Elasticsearch mapping

So assuming the previous step worked (if you got the innocuous No rows affected from beeline then it did) you now have an index and mapping (and a couple of “documents” of data) in Elasticsearch. You can inspect the mapping in several ways, including with the GUI for Elasticsearch admin kopf.

You can also interogate Elasticsearch directly with its REST API, which is what we’re going to use to update the mapping so let’s use it also to view it. I’m going to use curl to do the HTTP call, and then pipe it | straight to jq to prettify the resulting JSON that Elasticsearch sends back.

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/posts/_mapping' | jq '.'
{
  "all_blog": {
    "mappings": {
      "posts": {
        "properties": {
          "url": {
            "type": "string"
          },
          "ts_epoch": {
            "type": "long"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string"
          },
          "post_author": {
            "type": "string"
          }
        }
      }
    }
  }
}

We can see from this that Elasticsearch has generated the mapping to match the data that we’ve sent across from Hive (note how it’s picked up the ts_epoch type as being numeric not string, per our Hive table DDL). But, as mentioned previously, there are two things we need to rectify here:

  1. ts_epoch needs to be a date type, not long. Without the correct type, Kibana won’t recognise it as a date field.
  2. Fields that we don’t want broken down for analysis need marking as such. We’ll see the real difference that this makes when we get on to Kibana later.

To amend the mapping we just take the JSON document, make the changes, and then push it back with curl again. You can use any editor with the JSON (I’ve found Atom on the Mac to be great for its syntax highlighting, brace matching, etc). To change the type of the date field just change long to date. To mark a field not for analysis add "index": "not_analyzed" to the column definition. After these changes, the amended fields in my mapping JSON look like this:

[...]
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
            [...]

The particularly eagle-eyed of you will notice that I am loading post_title in twice. This is because I want to use the field both as a label but also to analyse it as a field itself, looking at which terms get used most. So in the updated mapping, only post_title is set to not_analyzed; the post_title_a is left alone.

To remove the existing mapping, use this API call:

curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/posts'

and then the amended mapping put back. Note that the "all_blog" / "mappings" outer levels of the JSON have been removed from the JSON that we send back to Elasticsearch:

curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
{
      "posts": {
        "properties": {
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
          }
        }
      }
    }
'

Full load into Elasticsearch

Now we can go ahead and run a full INSERT from Hive, and this time the existing mapping will be used. Depending on how much data you’re loading, it might take a while but you can always tail the hive-server2.log file to monitor progress. So that we don’t duplicate the ‘canary’ data that we sent across, use the INSERT OVERWRITE statement:

INSERT OVERWRITE table all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
title, 
title, 
author, 
REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
;

To check the data’s made it across we can do a count from Hive:

0: jdbc:hive2://bigdatalite:10000> select count(*) from all_blog_posts_es;
+------+
| _c0  |
+------+
| 2257 |
+------+
1 row selected (27.005 seconds)

But this requires a MapReduce job to run and is fairly slow. Much faster is direct from the horse’s mouth – from Elasticsearch itself where the data is. Just as we called a REST API to get and set the mapping, Elasticsearch can also give us statistics back this way too:

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/_stats/docs' | jq '.indices[].total.docs'
{
  "deleted": 0,
  "count": 2257
}

Here I’ve used a bit more jq to parse down the stats in JSON that Elasticsearch sends back. If you want to explore more of what jq can do, you’ll find https://jqplay.org/ useful.

Code

For reference, here is the set of three curl/DDL/DML that I used:

  • Elasticsearch index mappings

    # For reruns, remove and recreate index
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog' && curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog'
    
    # For partial rerun, remove mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts'
    # Create posts mapping
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
    {
    "posts" : {
    "properties": {
    "ts_epoch": {"type": "date"},
    "post_author": {"type": "string", "index" : "not_analyzed"},
    "post_title": {"type": "string", "index" : "not_analyzed"},
    "post_title_a": {"type": "string", "index" : "analyzed"},
    "post_type": {"type": "string", "index" : "not_analyzed"},
    "url": {"type": "string", "index" : "not_analyzed"}
    }}}
    '
    
    # For partial rerun, remove mapping
    # Create tweets mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets'
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets' -d '
    {"tweets": {
    "properties": {
    "tweet_url": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_type": {
    "type": "string"
    },
    "ts_epoch": {
    "type": "date"
    },
    "tweet_author": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_author_followers": {
    "type": "string"
    },
    "tweet_author_friends": {
    "type": "string"
    },
    "tweet_author_handle": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_analysed": {     "type": "string"      }
    ,"post_author": {       "index": "not_analyzed","type": "string"      }
    ,"post_title": {       "index": "not_analyzed", "type": "string"      }
    ,"post_title_a": {    "type": "string"    }
    
    }
    }
    }'
    
    # For partial rerun, remove mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache'
    # Create apachelog mapping
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache' -d '
    
    {"apache": {
    "properties": {
    "user": {
    "type": "string"
    },
    "url": {
    "index": "not_analyzed",
    "type": "string"
    },
    "status": {
    "type": "string"
    },
    "agent": {
    "index": "not_analyzed",
    "type": "string"
    },
    "host": {
    "type": "string"
    },
    "http_call": {
    "type": "string"
    },
    "http_status": {
    "type": "string"
    },
    "identity": {
    "type": "string"
    },
    "referer": {
    "index": "not_analyzed",
    "type": "string"
    },
    "ts_epoch": {
    "type": "date"
    },
    "size": {
    "type": "string"
    },"post_author": {      "index": "not_analyzed","type": "string"      }
    ,"post_title": {       "index": "not_analyzed", "type": "string"      }
    ,"post_title_a": {    "type": "string"    }
    
    }}}'

  • Hive table DDL

    drop table all_blog_posts_es;
    CREATE external TABLE all_blog_posts_es(
    ts_epoch bigint ,
    post_title string ,
    post_title_a string ,
    post_author string ,
    url string ,
    post_type string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/posts')
    ;
    
    drop table all_blog_tweets_es;
    CREATE EXTERNAL TABLE all_blog_tweets_es(
    tweet_type string ,
    tweet_url string ,
    tweet_author string ,
    tweet string ,
    tweet_analysed string ,
    ts_epoch bigint ,
    tweet_author_handle string ,
    tweet_author_followers string ,
    tweet_author_friends string ,url string ,post_author string ,
    post_title string ,post_title_a string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/tweets')
    ;
    
    drop table all_blog_apache_es;
    CREATE EXTERNAL TABLE all_blog_apache_es(
    host string ,
    identity string ,
    user string ,
    ts_epoch bigint ,
    http_call string ,
    url string ,
    http_status string ,
    status string ,
    size string ,
    referer string ,
    agent string, post_author string ,
    post_title string ,post_title_a string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/apache');

  • Hive DML – load data to Elasticsearch

    insert into table all_blog_posts_es
    select unix_timestamp(post_date) * 1000 as post_date_epoch,title,title,author,
    regexp_extract(generated_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)
    ,post_type
    from posts
    where post_date is not null
    ;
    
    insert overwrite table all_blog_tweets_es
    select x.*,p.author,p.title
    from (
    select 'tweets'
    ,t.url as tweet_url
    ,t.author
    ,t.content as tweet
    ,t.content as tweet_analyzed
    ,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
    ,t.author_handle
    ,t.author_followers
    ,t.author_friends
    ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
    from tweets t lateral view explode (referenced_urls) refs as ref_url
    where t.author_followers is not null
    and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;
    
    insert overwrite table all_blog_tweets_es
    select x.*,p.author,p.title
    from (
    select 'retweets'
    ,t.url as tweet_url
    ,t.author
    ,t.content as tweet
    ,t.content as tweet_analyzed
    ,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
    ,t.author_handle
    ,t.author_followers
    ,t.author_friends
    ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
    from retweets t lateral view explode (referenced_urls) refs as ref_url
    where t.author_followers is not null
    and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;
    
    insert into table all_blog_apache_es
    select x.*,p.author,p.title,p.title
    from (
    select
    a.host,a.identity,a.user
    ,unix_timestamp(concat(concat(concat(concat(concat(concat(
    a.time_yearyyyy,'-')
    ,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,a.time_daydd,' ')
    ,a.time_hourhh,':')
    ,a.time_minmm,':')
    ,a.time_secss,'')
    ) * 1000 as ts_epoch
    ,a.http_call ,regexp_extract(a.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url,a.http_status ,a.status ,a.size ,a.referer ,a.agent
    from apachelog a
    where a.url regexp "^\\/\\d{4}\\/\\d{2}\\/.*"
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;

Summary

With the data loaded into Elasticsearch we’re now ready to start our analysis against it. Stay tuned for the final part in this short blog series to see how we use Kibana to do this.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

Mon, 2014-11-03 15:21
Introduction

I’ve recently started learning more about the tools and technologies that fall under the loose umbrella term of Big Data, following a lot of the blogs that Mark Rittman has written, including getting Apache log data into Hadoop, and bringing Twitter data into Hadoop via Mongodb.

What I wanted to do was visualise the data I’d brought in, looking for patterns and correlations. Obviously the de facto choice at our shop would be Oracle BI, which Mark previously demonstrated reporting on data in Hadoop through Hive and Impala. But, this was more at the “Data Discovery” phase that is discussed in the new Information Management and Big Data Reference Architecture that Rittman Mead helped write with Oracle. I basically wanted a quick and dirty way to start chucking around columns of data without yet being ready to impose the structure of the OBIEE metadata model on it. One of the tools I’ve worked with recently is a visualisation tool called Kibana which is part of the ELK stack (that I wrote about previously for use in building a monitoring solution for OBIEE). In this article we’ll take a look at making data available to Kibana and then the kind of analytics and visualisations you can do with it. In addition, we’ll see how loading the data into ElasticSearch has the benefit of extremely fast query times compared to through Hive alone.

The Data

I’ve got three sources of data I’m going to work with, all related to the Rittman Mead website:

  • Website logs, from Apache webserver
  • Tweets about Rittman Mead blog articles, via Datasift
  • Metadata about blog posts, extracted from the WordPress MySQL database

At the moment I’ve focussed on just getting the data in, so it’s mostly coming from static files, with the exception of the tweets which are held in a noSQL database (MongoDB).

The Tools

This is where ‘big data’ gets fun, because instead of “Acme DI” and “Acme Database” and “Acme BI”, we have the much more interesting – if somewhat silly – naming conventions of the whackier the better. Here I’m using:

  • Kibana – data visualisation tool for Elasticsearch
  • Elasticsearch – data store & analytics / search engine
  • HDFS – Hadoop’s distributed file system
  • MongoDB – NoSQL database
  • Hive – enables querying data held in various places including HDFS (and Elasticsearch, and MongoDB) with a SQL-like query language
  • Beeline – Hive command line interface
  • Datasift – online service that streams tweets matching a given pattern to a nominated datastore (such as MongoDB)
  • mongo-hadoop – a connector for MongoDB to Hadoop including Hive
  • elasticsearch-hadoop – a connector for Elasticsearch to Hadoop including Hive

Kibana only queries data held in Elasticsearch, which acts as both the data store and the analytics engine. There are various ways to get data into Elasticsearch directly from source but I’ve opted not to do that here, instead bringing it all in via HDFS and Hive. I’ve done that because my – albeit fairly limited – experience is that Elasticsearch is great once you’ve settled on your data and schema, but in the same way I’m not building a full OBIEE metadata model (RPD) yet, nor did I want to design my Elasticsearch schema up front and have to reload from source if it changed. Options for reprocessing and wrangling data once in Elasticsearch seem limited and complex, and by making all my data available through Hive first I could supplement it and mash it up as I wanted, loading it into Elasticsearch only when I had a chunk of data to explore. Another approach that I haven’t tried but could be useful if the requirement fits it would be to load the individual data elements directly into their own Elasticsearch area and then using the elasticsearch-hadoop connector run the required mashups with other data through Hive, loading the results back into Elasticsearch. It all depends on where you’re coming from with the data.

Overview

Here’s a diagram of what I’m building:

I’ll explain it in steps as follows:

  1. Loading the data and making it accessible through Hive
  2. Loading data from Hive to Elasticsearch
  3. Visualising and analysing data in Kibana
Getting the data into Hive

Strictly speaking we’re not getting the data into Hive, so much as making it available through Hive. Hive simply enables you to define and query tables sitting on top of data held in places including HDFS. The beauty of the Hadoop ecosystem is that you can physicalise data in a bunch of tools and the components will most often support interoperability with each other. It’s only when you get started playing with it that you realise how powerful this is.

The Apache log files and WordPress metadata suit themselves fairly well to a traditional RDBMS format of [de]normalised tables, so we can store them in HDFS with simple RDBMS tables defined on top through Hive. But the twitter data comes in JSON format (like this), and if we were going to store the Twitter data in a traditional RDBMS we’d have to work out how to explode the document into a normalised schema, catering for varying structures depending on the type of tweet and data payload within it. At the moment we just want to collect all the data that looks useful, and then look at different ways to analyse it afterwards. Instead of having to compromise one way (force a structure over the variable JSON) or another (not put a relational schema over obviously relational data) we can do both, and decide at run-time how to best use it. From there, we can identify important bits of data and refactor our design as necessary. This “schema on read” approach is one of the real essences of Hadoop and ‘big data’ in general.

So with that said, let’s see how we get the data in. This bit is the easy part of the article to write, because a lot of it is pretty much what Mark Rittman has already written up in his articles, so I’ll refer to those rather than duplicate here.

Apache log data

References:

I’ve used a variation on the standard Apache log SerDe that the interwebs offers, because I’m going to need to work with the timestamp quite closely (we’ll see why later) so I’ve burst it out into individual fields.

The DDL is:

CREATE EXTERNAL TABLE apachelog (
host STRING,    identity STRING,    user STRING,
time_dayDD STRING,  time_monthMMM STRING,   time_yearYYYY STRING,
time_hourHH STRING, time_minmm STRING,  time_secss STRING,  time_tzZ STRING,
http_call STRING,   url STRING, http_status STRING, status STRING,  size STRING,    referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(\\d{2})\\/(\\w{3})\\/(\\d{4}):(\\d{2}):(\\d{2}):(\\d{2}) (.*?)\\] \\\"(\\w*) ([^ ]*?)(?:\\/)? ([^ \\\"]*)\\\" (\\d*) (\\d*) \\\"(.*?)\\\" \\\"(.*?)\\\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s")
STORED AS TEXTFILE LOCATION '/user/oracle/apache_logs';

The EXTERNAL is important on the table definition as it stops Hive moving the HDFS files into its own area on HDFS. If Hive does move the files it is annoying if you want to also access them through another program (or Hive table), and downright destructive if you DROP the table since it’ll delete the HDFS files too – unless it’s EXTERNAL. Note the LOCATION must be an HDFS folder, even if it just holds one file.

For building and testing the SerDe regex Rubular is most excellent, but note that it’s Java regex you’re specifying in the SerDe which has its differences from Python or Ruby regex that Rubular (and most other online regex testers) support. For the final validation of Java regex I use the slightly ugly but still useful regexplanet, which also gives you the fully escaped version of your regex which you’ll need to use for the actual Hive DDL/DML.

A sample row from the apache log on disk looks like this:

74.208.161.70 - - [12/Oct/2014:03:47:43 +0000] "GET /2014/09/sunday-times-tech-track-100/ HTTP/1.0" 301 247 "-" "-"

and now in Hive:

0: jdbc:hive2://bigdatalite:10000> !outputformat vertical
0: jdbc:hive2://bigdatalite:10000> select * from apachelog limit 1;
host           74.208.161.70
identity       -
user           -
time_daydd     12
time_monthmmm  Oct
time_yearyyyy  2014
time_hourhh    03
time_minmm     47
time_secss     43
time_tzz       +0000
http_call      GET
url            /2014/09/sunday-times-tech-track-100/
http_status    HTTP/1.0
status         301
size           247
referer        -
agent          -

Twitter data

Reference:

The twitter data we’ve got includes the Hive ARRAY datatype for the collections of hashtag(s) and referenced url(s) from within a tweet. A point to note here is that the author_followers data appears in different locations of the JSON document depending on whether it’s a retweet or not. I ended up with two variations of this table and a UNION on top.

The table is mapped on data held in MongoDB and as with the HDFS data above the EXTERNAL is crucial to ensure you don’t trash your data when you drop your table.

CREATE EXTERNAL TABLE tweets
(
id string,
url string,
author string,
content string,
created_at string,
hashtags ARRAY<string>,
referenced_urls ARRAY<string>,
sentiment STRING,
author_handle string,
author_id string,
author_followers string,
author_friends string
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","url":"interaction.interaction.link","author":"interaction.interaction.author.name","content":"interaction.interaction.content","created_at":"interaction.interaction.created_at","hashtags":"interaction.interaction.hashtags","referenced_urls":"interaction.links.url","sentiment":"interaction.salience.content.sentiment","author_handle":"interaction.interaction.author.username","author_id":"interaction.interaction.author.id","author_followers":"interaction.twitter.user.followers_count","author_friends":"interaction.twitter.user.friends_count"}')
TBLPROPERTIES('mongo.uri'='mongodb://bigdatalite.localdomain:27017/rm_tweets.rm_tweets')
;

The other point to note is that we’re now using mongo-hadoop for Hive to connect to MongoDB. I found that I had to first build the full set of jar files by running ./gradlew jar -PclusterVersion='cdh5', and also download the MongoDB java driver, before copying the whole lot into /usr/lib/hadoop/lib. This is what I had by the end of it:

[oracle@bigdatalite mongo-hadoop-r1.3.0]$ ls -l /usr/lib/hadoop/lib/mongo-*
-rw-r--r--. 1 root root 105446 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-core-1.3.0.jar
-rw-r--r--. 1 root root  21259 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-hive-1.3.0.jar
-rw-r--r--. 1 root root 723219 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-pig-1.3.0.jar
-rw-r--r--. 1 root root    261 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-r1.3.0.jar
-rw-r--r--. 1 root root 697644 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-streaming-1.3.0.jar
-rw-r--r--. 1 root root 591189 Oct 24 00:44 /usr/lib/hadoop/lib/mongo-java-driver-2.12.4.jar

After all that, the data as it appears in Hive looks like this:

id                5441097d591f90cf2c8b45a1
url               https://twitter.com/rmoff/status/523085961681317889
author            Robin Moffatt
content           Blogged: Using #rlwrap with Apache #Hive #beeline for improved readline functionality http://t.co/IoMML2UDxp
created_at        Fri, 17 Oct 2014 12:19:46 +0000
hashtags          ["rlwrap","Hive","beeline"]
referenced_urls   ["http://www.rittmanmead.com/2014/10/using-rlwrap-with-apache-hive-beeline-for-improved-readline-functionality/"]
sentiment         4
author_handle     rmoff
author_id         82564066
author_followers  790
author_friends    375

For reference, without the mongo-hadoop connectors I was getting the error

Error in loading storage handler.com.mongodb.hadoop.hive.MongoStorageHandler

and with them installed but without the MongoDB java driver I got:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON (state=08S01,code=1)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON

WordPress metadata

WordPress holds its metadata in a MySQL database, so it’s easy to extract out:

  1. Run a query in MySQL to generate the CSV export files, such as:

    SELECT p.ID, p.POST_TITLE,p.POST_DATE_GMT,
           p.POST_TYPE,a.DISPLAY_NAME,p.POST_NAME,
           CONCAT('/', DATE_FORMAT(POST_DATE_GMT, '%Y'), '/', LPAD(
           DATE_FORMAT(POST_DATE_GMT, '%c'), 2, '0'), '/', p.POST_NAME) AS
           generated_url
    FROM   posts p
           INNER JOIN users a
                   ON p.POST_AUTHOR = a.ID
    WHERE  p.POST_TYPE IN ( 'page', 'post' )
           AND p.POST_STATUS = 'publish' 
    into outfile '/tmp/posts.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';

  2. Copy the CSV file to your Hadoop machine, and copy it onto HDFS. Make sure each type of data goes in its own HDFS folder:

    hadoop fs -mkdir posts
    hadoop fs -copyFromLocal /tmp/posts.csv posts

  3. Define the Hive table on top of it:

    CREATE EXTERNAL TABLE posts 
    ( post_id string,title string,post_date string,post_type string,author string,url string ,generated_url string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    "input.regex" = "^(\\d*),\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\"",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s")
    location '/user/oracle/posts'
    ;

Rinse & repeat for the category data, and post->category relationships.

The data once modelled in Hive looks like this:

0: jdbc:hive2://bigdatalite:10000> select * from posts limit 1;
post_id        788
title          Blog
post_date      2007-03-07 17:45:07
post_type      page
author         Mark Rittman
url            blog
generated_url  /2007/03/blog

0: jdbc:hive2://bigdatalite:10000> select * from categories limit 1;
category_id    5
cat2_id        5
category_name  category
category_code  BI (General)
catslug        bi

0: jdbc:hive2://bigdatalite:10000> select * from post_cats limit 5;
post_id      8046
category_id  1

The WordPress metadata quite obviously joins together, as it is already from the relational schema in which it was held on MySQL. Here is an example of where “schema on read” comes into play, because you could look at the above three tables (posts / post_cats / categories) and conclude it was redundant to export all three from WordPress and instead a single query listings posts and their respective category would be sufficient. But, some posts have more than one category, which then leads to a design/requirements decision. Either we retain one row per post – and collapse down the categories, but in doing so lose ability to easily treat categories as individual data – or have one row per post/category, and end up with multiple rows per post which if we’re doing a simple count of posts complicates matters. So we bring it in all raw from source, and then decide how we’re going to use it afterwards.

Bringing the data together

At this point I have six tables in Hive that I can query (albeit slowly) with HiveQL, a close relation to SQL with a few interesting differences running through the Hive client Beeline. The data is tweets, website visits, and details about the blog posts themselves.

0: jdbc:hive2://bigdatalite:10000> show tables;
+------------------------+
|        tab_name        |
+------------------------+
| apachelog              |
| categories             |
| post_cats              |
| posts                  |
| retweets               |
| tweets                 |
+------------------------+

As well as time, the other common element running throughout all the data is the blog article URL, whether it is a post, a visit to the website, or a tweet about it. But to join on it is not quite as simple as you’d hope, because all the following are examples of recorded instances of the data for the same blog post:

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So whether it’s querying the data within Hive, or loading it joined together to another platform, we need to be able to unify the values of this field.

Tangent: RegEx

And now it’s time, if you’d not already for your SerDe against the Apache file, to really immerse yourself in Regular Expressions (RegEx). Part of the “schema on read” approach is that it can get messy. You need to juggle and wrangle and munge data in ways that it really might not want to, and RegEx is an essential tool with which to do this. Regex isn’t specific to Hadoop – it’s used throughout the computing world.

My journey with regex over quite a few years in computing has gone in stages something like this:

  1. To be a fully rounded geek, I should learn regex. Looks up regex. Hmm, looks complicated….Squirrel!
    1. To be a fully round (geddit?!) geek, I should keep eating these big breakfasts
  2. I’ve got a problem, I’ve got a feeling regex will help me. But my word it looks complicated … I’ll just do it by hand.
  3. I’ve got another problem, I need to find this text in a file but with certain patterns around it. Here’s a regex I found on google. Neat!
  4. Hmmm another text matching problem, maybe I should really learn regex instead of googling it to death each time
  5. Mastered the basic concepts of regex
  6. Still a long way to go…

If you think you’ll nail RegEx overnight, you won’t (or at least, you’re a better geek than me). It’s one of those techniques, maybe a bit like SQL, that to fully grok takes a period of exposure and gradually increasing usage, before you have an “ah hah!” moment. There’s a great site explaining regex here: www.regular-expressions.info. My best advice is to take a real example text that you want to work with (match on, replace bits of, etc), and stick it in one of these parsers and experiment with the code:

Oh and finally, watch out for variations in regex – what works in a Java-based program (most of the Hadoop world) may not in Python and visa versa. Same goes for PHP, Ruby, and so on – they all have different regex engines that may or may not behave as you’d expect.

Back on track : joining data on non-matching columns

So to recap, we want to be able to analyse our blog data across tweets, site hits and postings, using the common field of the post URL, which from the various sources can look like any of the following (and more):

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So out comes the RegEx. First off, we’ll do the easy one – strip the http:// and server bit. Using the Hive function REGEXP_REPLACE we can use this in the query:

regexp_replace(ref_url,'http:\\/\\/www.rittmanmead.com','')

This means, take the ref_url column and if you find http://www.rittmanmead.com then replace it with nothing, i.e. delete it. The two backslashes before each forward slash simply escape them since a forward slash on its own has a special meaning in regex. Just to keep you on your toes – Java regex requires double backspace escaping, but all other regex (including the online parser I link to below) uses a single one.

So now our list possible join candidates has shrunk by one to look like this:

/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

The variation as you can see is whether there is a trailing forward slash (/) after the post ‘slug’ , and whether there is additional cruft after that too (feed, foobar+foorbar, etc). So let’s build it up a piece at a time. On each one, I’ve linked to an online parser that you can use to see it in action.

  1. We’ll match on the year and month (/2014/01/) because they’re fixed pattern, so using \d to match on digits and {x} to match x repetitions: (see example on Rubular.com)

    \/\d{4}\/\d{2}\/

    This will match /2014/01/.

  2. Now we need to match the slug, but we’re going to ditch the forward slash suffix if there is one. This is done with two steps.

    First, we define a “match anything except x” group, which is what the square brackets (group) and the caret ^ (negate) do, and in this case x is the forward slash character, escaped.

    Secondly, the plus symbol + tells regex to match at least one repetitions of the preceeding group – i.e. any character that is not a forward slash. (example)

    [^\/]+

    Combined with the above regex from the first step we will now match /2014/01/automated-regression-testing-for-obiee.

  3. The final step is to turn the previous REGEXP_REPLACE on its head and instead of replacing out content from the string that we don’t want, instead we’ll extract the content that we do want, using a regex capture group which is defined by regular brackets (parantheses, just like these). We’ve now brought in a couple of extra bits to make it hang together, seen in the completed regex here:

    \S*(\/\d{4}\/\d{2}\/[^\/]+).*$

    1. The \S* at the beginning means match any non-whitespace character, which will replace the previous regex replace we were doing to strip out the http://www.rittmanmead.com
    2. After the capture group, which is the content from steps one and two above, surround by parentheses (\/\d{4}\/\d{2}\/[^\/]+) there is a final .* to match anything else that might be present (eg trailing forward slash, foobar, etc etc)

    Now all we need to do is escape it for Java regex, and stick it in the Hive REGEXP_EXTRACT function, specifying 1 as the capture group number to extract: (example)

    regexp_extract(url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)

So now all our URLs will look like this, regardless of whether they’re from tweet data, website hits, or wordpress:

/2014/01/automated-regression-testing-for-obiee

Which is nice, because it means we can use it as the common join in our queries. For example, to look up the title of the blog post that someone has tweeted about, and who wrote the post:

SELECT
x.author AS tweet_author, 
x.tweet ,
x.tweet_url, 
x.created_at, 
p.author as post_author, 
p.title as post_title
FROM            ( 
SELECT 'tweets' , 
t.url AS tweet_url , 
t.author , 
t.content AS tweet , 
t.created_at ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
FROM   tweets t 
LATERAL VIEW EXPLODE (referenced_urls) refs as ref_url 
WHERE  t.author_followers IS NOT NULL 
AND    ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*' ) x 
INNER JOIN posts p 
ON regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url ;

[...]
tweet_author  Dain Hansen
tweet         Like a Big Data kid in a Hadoop candy store: Presos on #bigdata for BI, DW, Data Integration http://t.co/06DLnvxINx via @markrittman
tweet_url     https://twitter.com/dainsworld/status/520463199447961600
created_at    Fri, 10 Oct 2014 06:37:51 +0000
post_author   Mark Rittman
post_title    Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

tweet_author  Robin Moffatt
tweet         Analyzing Twitter Data using Datasift, MongoDB and Pig http://t.co/h67cd4kJo2 via @rittmanmead
tweet_url     https://twitter.com/rmoff/status/524197131276406785
created_at    Mon, 20 Oct 2014 13:55:09 +0000
post_author   Mark Rittman
post_title    Analyzing Twitter Data using Datasift, MongoDB and Pig
[...]

Note here also the use of LATERAL VIEW EXPLODE () as a way of denormalising out the Hive ARRAY of referenced url(s) in the tweet so there is one row returned per value.

Summary

We’ve got our three sources of data available to us in Hive, and can query across them. Next we’ll take a look at loading the data into Elasticsearch, taking advantage of our conformed url column to join data that we load. Stay tuned!

Categories: BI & Warehousing

UKOUG Partner of the Year Awards

Mon, 2014-11-03 10:23

A few days ago Rittman Mead won 5 awards at the UKOUG Partner of the Year Awards.

  • Business Intelligence Partner of the Year (Silver)
  • Training Partner of the Year (Silver)
  • Managed Services (Outsourcing and Operations) Partner of the Year (Silver)
  • Emerging (New Products) Partner of the Year (Silver)
  • Operating Systems, Storage and Hardware Partner of the Year (Gold)

We have consistently done well at this event and feel the awards reflect the effort we put in to both our clients and the user community alike.

The number and diversity of the awards demonstrates how much Rittman Mead has grown over the years. Enterprise Business Intelligence is still at the heart of the work we do, however awards for Operating Systems, Storage and Hardware Partner of the Year and Emerging (New Products) Partner of the Year show the investment we have made in the engineered systems, big data and cloud markets.

Likewise the Training Partner of the Year and Managed Services (Outsourcing and Operations) Partner of the Year awards show the end to end services we now offer.

I would like to say thank you very much to everyone who voted for us and most of all thanks to all our staff who have put in the effort to make this happen.

Categories: BI & Warehousing

Oracle BI Cloud Service for SaaS Application Reporting Part 1: Integrating BICS to Salesforce.com using REST APIs

Mon, 2014-11-03 01:30

Last month Mark Rittman covered a series of posts detailing the Oracle BI Cloud Service (BICS), aimed at departmental users who want the power of OBIEE 11g without the need to stand-up their own infrastructure. If you’re coming in late, here’s the link to the series.

Before the GA Rittman Mead participated in the beta program for release one of Oracle’s Business Intelligence Cloud Service (BICS), the global aim of the beta was both to understand the capabilities and to identify potential use cases of the new BICS platform.  As Mark wrote, an excellent use case for BICS is to report on top of any SaaS application that expose the data (stored in the cloud) using REST APIs by taking advantage of the ApEx capabilities hosted in the Database Schema Service that comes with BICS. SaaS applications like Oracle’s Fusion CRM, Taleo or Salesforce.com – that was used during the beta program – can be easily integrated and queried with BICS.

The technical goal of our beta program has been to check the features, options and limitations of Oracle BICS by connecting it to a Salesforce.com instance, accessing the data exposed through REST APIs by using the ApEx native functions,  storing the data in the Database Schema Service, creating the Repository using the new Model Editor and showing the data in dashboard by keeping the same data security settings configured in the source platform.

Salesforce.com to Oracle BI Cloud Service

This series of post is going to explain all the details about our successful PoC, over the next few days we’ll be covering the following topics, and we’ll update the list with hyperlinks once the articles are published:

  • Oracle BI Cloud Service for SaaS Application Reporting Part 1: Integrating BICS to Salesforce.com using REST APIs
  • Oracle BI Cloud Service for SaaS Application Reporting Part 2: BICS ApEx components
  • Oracle BI Cloud Service for SaaS Application Reporting Part 3: BICS Repository and Front-end configurations

In the first post of the series we’re going to define the steps required to setup a Salesforce.com demo environment that can be accessed with REST APIs. Later in the post we intend to analyse the most interesting Salesforce.com REST APIs that are used to extract the data from the platform. Before starting digging into the BICS-Salesforce.com integration, some definitions may be needed:

  • RESTful APIs: A Web service API can be defined as RESTful if it conforms to the REST architectural constraints. Some more details about RESTful APIs can be found here.
  • Salesforce.comSalesforce.com is company specialised in software as a service (SaaS) and one of the major providers of CRM in the cloud. In addition to CRM Salesforce.com offers Force.com a cloud platform as a service (PaaS) that developers can use in order to build multitenant applications hosted on Salesforce.com servers. All the Force.com applications can be accessed by using RESTful APIs.
Environments Setup

In a new BI project it is never safe to directly extract data from a live (production) environment. Thus the first step needed in order to test BICS-Salesforce.com connection is to obtain a Salesforce.com environment that we can use without the risk of slowing it down, or even worse crashing and impacting users. A free demo Salesforce.com platform can be obtained by subscribing for the developers program and can be customised as needed. After requesting the Salesforce.com demo environment the creation of a security token is needed, the token is associated with the user invoking the RESTful API. To generate the token: login to Salesforce.com with the specific user -> click on the username -> My Settings -> Reset My Security Token. The new token should then be sent to the user’s email.

Salesforce Reset token Procedure

The third step in order to access Salesforce.com data with REST APIs is to define a Connected App. The creation of a Connected App generates a Consumer Key and a Consumer Secret that is going to be used later during the REST login calls. Once defined a Connected App, the last bit missing is to populate the Salesforce.com instance since it’s empty by default, it can be populated manually or by writing population scripts based on the create REST API statements explained here. An Oracle BICS instance is needed in order to analyse the Salesforce.com data, all the info regarding Oracle BICS and how to activate one instance can be found at this link. Having covered all the basics, it’s time to start analysing how the Salesforce.com data can be extracted.

Salesforce.com REST APIs

In the following sections we are going to analyse some of the REST APIs used in our process. The Salesforce.com REST APIs will be called with cURL commands, cURL is a command line tool and library for transferring data with URL syntax.

Salesforce.com Authentication

The first step in order to download the data from any Salesforce.com instance is to execute the authentication and retrieve the access token, the access token will then be used in all the following REST calls. There are various authentication mechanisms that can be used against Salesforce.com, for the aim of the beta program we used the one called “Username-Password OAuth Authentication Flow” that is described in detail here and in the image below.

Salesforce.com User Password OAuth

The REST API authentication command is

curl -v https://login.salesforce.com/services/oauth2/token -d "grant_type=password" \
    -d "client_id=CLIENT_ID" -d "client_secret=CLIENT_SECRET" \
    -d "username=USERNAME" -d "password=PASSWORD_TOKEN"

where the following parameters must be assigned:

  • CLIENT_ID:  The client ID generated during the creation of the Connected App (step described in Basic Setup)
  • CLIENT_SECRET: The client Secret generated during the creation of the Connected App (step described in Basic Setup)
  • USERNAME: The username you want to use in order to access Salesforce.com data
  • PASSWORD_TOKEN: The concatenation of password and user security token (e.g. if the password is ABC and the security token is 123 then PASSWORD_TOKEN is ABC123)

The response (if all the parameters are correct) should be like the following

{  
   "id":"https://login.salesforce.com/id/00Dx0000000BV7z/005x00000012Q9P",
   "issued_at":"1278448832702",
   "instance_url":"https://na1.salesforce.com",
   "signature":"0CmxinZir53Yex7nE0TD+zMpvIWYGb/bdJh6XfOH6EQ=",
   "access_token":"00Dx0000000BV7z!AR8AQAxo9UfVkh8AlV0Gomt9Czx9LjHnSSpwBMmbRcgKFmxOtvxjTrKW19ye6PE3Ds1eQz3z8jr3W7_VbWmEu4Q8TVGSTHxs"
}

where the most interesting parameters are:

  • instance_url: defines the salesforce instance to use.
  • access_token: defines the time-limited security token to use for all the following calls.
Salesforce.com List of Objects and Retrieve Metadata for an Object

Two of the Salesforce.com REST APIs were very useful when trying to build a general ETL that could be applied to any customised Salesforce.com instance:

  • Get list of Objects: retrieves the list of all objects available in Salesforce.com (custom or not) with some additional information for each object.
  • Retrieve Metadata for an Object: retrieves the list of columns for the selected object with related data types and additional information.

The Get list of Objects call is the following

curl INSTANCE/services/data/VERSION/sobjects/ -H "Authorization: Bearer TOKEN"

where

  • INSTANCE is the instance_url parameter retrieved from the authentication call.
  • TOKEN is the access_token parameter retrieved from the authentication call.
  • VERSION is the REST APIs version used, in our beta test we used v29.0.

The response should be similar to the following in which you could see for each object (identified by the “name” field) all the metadata available.

{  
   "encoding":"UTF-8",
   "maxBatchSize":200,
   "sobjects":[  
      {  
         "name":"Account",
         "label":"Account",
         "keyPrefix":"001",
         "labelPlural":"Accounts",
         "custom":false,
         "layoutable":true,
         "activateable":false,
         "urls":{  
            "sobject":"/services/data/v26.0/sobjects/Account",
            "describe":"/services/data/v26.0/sobjects/Account/describe",
            "rowTemplate":"/services/data/v26.0/sobjects/Account/{ID}"
         },
         "searchable":true,
         "updateable":true,
         "createable":true,
         "deprecatedAndHidden":false,
         "customSetting":false,
         "deletable":true,
         "feedEnabled":true,
         "mergeable":true,
         "queryable":true,
         "replicateable":true,
         "retrieveable":true,
         "undeletable":true,
         "triggerable":true
      },
      ...
   ]
}

The Retrieve Metadata of an Object can be called for each object listed in Get list of Object response, the code is the following:

curl INSTANCE/services/data/VERSION/sobjects/OBJECT/ \
    -H "Authorization: Bearer TOKEN"

where:

  • INSTANCE and TOKEN  and VERSION are the same parameters defined for Get list of objects call
  • OBJECT is the Salesforce.com object to retrieve

The response for the Account object is similar to the following

{  
   "objectDescribe":{  
      "name":"Account",
      "updateable":true,
      "label":"Account",
      "keyPrefix":"001",
      ...  
          
        "replicateable":true,
      "retrieveable":true,
      "undeletable":true,
      "triggerable":true
   },
   "recentItems":[  
      {  
         "attributes":{  
            "type":"Account",
            "url":"/services/data/v20.0/sobjects/Account/001D000000INjVeIAL"
         },
         "Id":"001D000000INjVeIAL",
         "Name":"asdasdasd"
      },
      ...
   ]
}

Interesting parameters for each column are:

  • name: is the columns name.
  • retrievable: if true it means that the particular column can be part of a query
  • soapType and length or byteLength: these fields provide the field type information and the length of the field.

With the three REST APIs analysed we could recreate all the objects (custom or not) in any Salesforce.com (or Force.com) instance as tables in the Oracle Database Schema Service. In the next section we will see how to extract the data from the list of objects by using the Salesforce.com query capabilities.

Salesforce.com query

Salesforce provides two methods of querying the objects: including or excluding deleted object. The difference in the code is minimal, the first uses the /queryAll suffix while the second uses the /query. A Salesforce.com REST API query call is:

curl INSTANCE/services/data/VERSION/query/?q=SELECT+LIST_OF_COLUMNS+from+OBJECT \
    -H "Authorization: Bearer TOKEN"

where:

  • INSTANCE and TOKEN parameters are the ones retrieved from the Authentication call
  • VERSION is the Salesforce.com REST API version used
  • OBJECT is the Salesforce.com object to query
  • LIST_OF_COLUMNS is the comma delimited list of columns to retrieve

The response of a query “select+Name+from+Account” is:

{  
   "done":true,
   "totalSize":14,
   "records":[  
      {  
         "attributes":{  
            "type":"Account",
            "url":"/services/data/v20.0/sobjects/Account/001D000000IRFmaIAH"
         },
         "Name":"Test 1"
      },
      {  
         "attributes":{  
            "type":"Account",
            "url":"/services/data/v20.0/sobjects/Account/001D000000IomazIAB"
         },
         "Name":"Test 2"
      },
      ...
   ]
}

In the JSON result you can find:

  • totalsize: the number of records retrieved
  • Name: for each record the name of the Account

If the resultSet is too big, Salesforce.com will start paging results. If the initial query returns only part of the results, the end of the response will contain a field called nextRecordsUrl. For example:

"nextRecordsUrl" : "/services/data/v20.0/query/01gD0000002HU6KIAW-2000"

In order to get the next page of data a call like the following is needed passing the access token as a parameter.

curl INSTANCE/services/data/v20.0/query/01gD0000002HU6KIAW-2000 \
    -H "Authorization: Bearer TOKEN"

Salesforce.com limits

It’s important to be aware of the limit set by Salesforce.com on the number of REST API calls per day. In order to check at any time the remaining number of calls available execute the following code.

curl INSTANCE/services/data/VERSION/limits/ -H "Authorization: Bearer TOKEN "X-PrettyPrint:1"

The DailyApiRequest -> Remaining field contained in the response shows the amount of calls still available.

{
    "DailyApiRequests":
    {
        "Remaining":"4980",
        "Max":"5000"
    },
    "DailyAsyncApexExecutions":
    {
        "Remaining":"250000",
        "Max":"250000"
    },
    ...
}

Resultset output format

Salesforce.com default output format is JSON, which is a common format for most of the cloud application. However there is the opportunity to also retrieve the result in XML format by using the HTTP ACCEPT header set to “application/xml”. This first release of the tool is based on Oracle DB11g which supports native XML and not JSON. For this reason during our beta program with Oracle BI Cloud Service we decided to use the XML output. Once BICS will be bundled with the Oracle DB 12c that supports JSON and XML native, the resultset output format could be kept as default in JSON. In this post we defined the steps required in order to setup a Salesforce.com demo environment that can be accessed with REST APIs and which are the most interesting Salesforce.com REST APIs that we used to extract the data in order to analyse it with Oracle BI Cloud Service platform. In the next post we will look more in detail at the BICS ApEx part of it by analysing how the Salesforce.com REST APIs can be called from ApEx.

Categories: BI & Warehousing

OBIEE How-To: A View Selector for your Dashboard

Wed, 2014-10-29 20:00

A common problem report developers face is user groups having different needs and preferences, and as a consequence these user groups want to see their data presented in different ways. Some users prefer to see a graph when others want a table is a classic example. So, how do we do this? It’s a no brainer… we use a view selector. View selectors give us a great amount of flexibility by allowing us to swap out one analysis view for another. You might even take it a step further and use a view selector to swap out an entire compound layout for another one, giving the user an entirely different set of views to look at. Truly powerful stuff, right?

But view selectors do have one limitation… they’re only available at the analysis level. What if you wanted this selector functionality at the dashboard level so that you could swap out an analysis from one subject area for one from different subject area? Or what if you wanted to be able to switch one dashboard prompt for another one? You’re out of luck, it’s just not possible…

Just kidding… of course it’s possible. As it turns out, it’s fairly straightforward to build your own dashboard level view selector using other objects already provided by OBIEE out-of-the-box.

Create a dashboard variable prompt to drive the content. We need a way for the users to select the view they want to see. View selectors have a built in dropdown prompt to accomplish this at the analysis level. To do this at the dashboard level we’re going to use a dashboard prompt.

So, the first step is to create a new dashboard prompt object and add a variable prompt. You can name the variable whatever you wish, for this example we’re just going to call it P_SECTION. You can set the User Input to whatever you want, but it’s important that only one option is selected at a time… multiple values should not be allowed. Let’s set the user input to “Choice List” and add some custom values.

What you name these custom values isn’t important but the labels should be descriptive enough so that the users understand the different options. Just keep in mind, the values you use here will need to exactly match the analysis we create in the next step. For this example, let’s use ‘Section1′, ‘Section2′, and ‘Section3′ to keep things simple.

JFB - View Selector - P_SECTION Prompt

 Create an analysis to drive the conditional logic. We need to create an analysis that will return a set number of rows for each of the options in the prompt we just created. The number of rows returned then drives which section we see on the dashboard.

Ultimately, the logic of this analysis doesn’t matter, and there are a dozen ways to accomplish this. To keep things simple, we’re just going to use CASE statements. While not an elegant solution, it’ll work just fine for our basic example.

Add three columns to the criteria, we’ll use a Time dimension and modify the column formula with the following CASE statements. Make sure that the text strings match the Custom Values used in the prompt.

CASE WHEN "Time"."T05 Per Name Year" IN ('2006') THEN 'Section1' END

CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007') THEN 'Section2' END

CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007', '2008') THEN 'Section3' END

JFB - View Selector - Table

Now we need to update the filter so that the appropriate rows are shown based upon what the user selects. Basically, we need the request to return 1, 2, or 3 rows based upon our P_SECTION presentation variable.

For our example we’re going to create a filter for each of the options and set them equal to the presentation variable we created earlier in our dashboard prompt. Only one filter will be true at a time so the operator between these filters has been set to OR. Also you’ll notice that the default value for the presentation variable has been set to ‘Section1′, across the board. If, for whatever reason, the P_SECTION variable isn’t set we want the dashboard to default to the first section.

JFB - View Selector - Filter

CASE WHEN "Time"."T05 Per Name Year" IN ('2006') THEN 'Section1' END is equal to / is in @{P_SECTION}{Section1}
OR CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007') THEN 'Section2' END is equal to / is in @{P_SECTION}{Section1}
OR CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007', '2008') THEN 'Section3' END is equal to / is in @{P_SECTION}{Section1}

So, let’s quickly walk through how this works. The end user selects ’Section1’ from the dashboard prompt. That selection is stored in our P_SECTION presentation variable, which is then passed to and used by our filter. With ‘Section1’ selected only the 1st line of the filter will hold true which will result in a single row returned. When ‘Section2’ is chosen, the second row of the filter is true which returns two rows, and so on.

We’re almost done, in the next step we’ll create some conditions on the individual dashboard sections and put it all together.

Create sections and set some conditions. We just need to create our sections and set some conditions so that they are shown/hidden appropriately. Create a new dashboard page. Edit the dashboard page and drag three empty sections on to the page. Create a condition on the first section using the Analysis created in the last step. The first condition we need to create should be True If Row Count is equal to 1.

JFB - View Selector - Condition

Are you beginning to see how this is going to work? The only time we’ll get a single row back is when the presentation variable is set to ‘Section1’. When P_SECTION is set to ‘Section2’ we’ll get two rows back from our analysis. Go ahead and create a second condition that is True If Row Count is equal to 2 for section 2. For section 3 create a condition that’s True If Row Count is equal to 3.

JFB - View Selector - Dashboard Editor

Since we aren’t adding content to these sections, you’ll want to make sure to enable the option to “Show Section Title” or add a couple text fields so that you can easily identify which section is rendered on the page. Lastly, drag the dashboard prompt onto the page. Save the dashboard page and let’s take a look.

When the page first renders, you should see something similar to the following screenshot. The prompt is set to ‘Section1’ and sure enough, Section 1 appears below it. If you change the selection to ‘Section2’ or ‘Section3’ and hit apply, Section 1 will be hidden and the corresponding content will appear. All that’s left now would be to go back and add content to the sections.

JFB - View Selector - Result

So, using only out-of-the-box features, we were able to create an extremely versatile and dynamic bit of functionality… and all it took was a dashboard prompt, an analysis to hold our conditional logic, and some sections and conditions.

This approach is just another tool that you can use to help deliver the dynamic content your users are looking for. It provides flexibility within the context of a single dashboard page and also limits the need to navigate (and maintain) multiple pages. Admittedly, the example was just walked through isn’t all that exciting, but hopefully you can see the potential.

Some of your users want a minimalist view allowing them to filter on just the basics, while others want to slice and dice by everything under the sun? Create two prompts, a basic and an advanced, and allow the users to switch between the two.

JFB - View Selector - BasicAdv

Want to pack a large amount of charts into a page while still minimizing scrolling for those poor souls working with 1024×768? No problem, have a low-res option of the dashboard.

JFB - View Selector - LowRes

 The finance department wants a to see a dashboard full of bar charts, but the payroll department is being totally unreasonable and only wants to see line graphs? Well, you get the idea…

Categories: BI & Warehousing

Monitoring OBIEE with the ELK stack

Tue, 2014-10-21 09:37

Monitoring the health of an OBIEE system and diagnosing problems that may occur is a vital task for the system’s administrator and support staff. It’s one that at Rittman Mead we help customers with implementing themselves, and also provide as a managed service. In this article I am going to discuss the ELK stack, which fills a specific gap between the high-level monitoring and configuration functionality of Enterprise Manager 11g Fusion Middleware Control, and the Enterprise-grade monitoring, alerting and configuration management of Enterprise Manager 12c Cloud Control.

The ELK stack enables you to rapidly access both summary and detail information across the stack, supporting swift identification and diagnosis of any issues that may occur. The responsive interface lets you to drill into time periods or any ad-hoc field or filter as you wish, to analyse and diagnose problems. Data can be summarised and grouped arbitrarily, displaying relative error rates ensuring that genuine problems are not lost in the ‘noise’ of usual operation.

Out of the box, OBIEE ships with Enterprise Manager 11g Fusion Middleware Control (FMC), which as the name says is part of the Enterprise Manager line of tools from Oracle for managing systems. It is more of a configuration and deployment tool than it is really a monitoring and diagnostics one. The next step up is FMC’s (very) big brother, Enterprise Manager 12c Cloud Control (EM12c). This is very much its own product, requiring its own infrastructure and geared up to monitoring an organisation’s entire fleet of [Oracle] hardware and software. With this greatly enhanced functionally with EM12c also comes a license cost. The ELK stack conceptually fits perfectly alongside your existing EM FMC, providing a most excellent OBIEE monitoring dashboard and analysis tool, and allowing you to explore the kind of diagnostics and historical data that you could have access to in EM 12c.

In ELK we can see at a glance what kind of relative activity there has been on the system over the past few days:

There have been some errors, and the top three nQS and ORA error codes and messages are shown. This is an important differentiator to EM where you can search for errors, but cannot see straightaway if it is a one-off or multiple occurence. By grouping by error message it’s possible to quickly see what the biggest problem on a system may currently be:

At this point we might want to drill down into what was being run when the errors were being thrown. For example, from the error summary alone we can see the biggest problem was a locked database account – but which database was being queried? Lower down the dashboard page is a list of log details, and by clicking on the search icon against an error message we can filter the results shown:

We can use the search icon again to restrict results by ECID

And from there see all the related log entries, including which connection pool the request was against (and thus which database account is locked)

Another way of diagnosing a sudden rash of errors would be to instead drilldown on time alone to take a more holistic view at the logs (useful also given that ECIDs don’t always give the full picture). Using the system activity timeline along with the events log view it is a piece of cake to do this – simply click and drag a time window on the chart to instantly zoom into it.

Taking a step back up, we can see at a glance which areas of the OBIEE metadata model (RPD) are being used, as well as where we are pulling logs from – and all of these are clickable in order to filter the results further. So it’s easy to see, for a given subject area, what’s the current error rate? Or to quickly access all the log files for a specific set of components alone (for example, BI Server and OPMN). Any field that is displayed, whether in a chart or a detailed log view, can be clicked and used as the input for an ad-hoc filter.

It’s not just errors and logs we can monitor – the current and trending performance of the system (or a part of it; note the filtering by subject area and database described above) can be observed and of course, drilled into:

The ELK stack

The ELK stack is a suite of free software made up of three tools, the first letter of each giving it its name:

  • ElasticSearch
  • Logstash
  • Kibana

At a very high level, we collect and enrich diagnostic data from log files using logstash, store it in ElasticSearch, and present and analyse it through Kibana.

  • ElasticSearch is a document store, in which data with no predefined structure can be stored. Its origins and core strength are in full text search of any of the data held within it, and it is this that differentiates it from pure document stores such as MongoDB that Mark Rittman wrote about recently. Data is loaded and retrieved from ElasticSearch through messages sent over the HTTP protocol, and one of the applications that can send data this way and works extremely well is Logstash.
  • Logstash is an innocuous looking tool that at first glance one could mistakenly write off as “just” a log parser. It does a lot more than that and a healthy ecosystem of input, filter, codec and output plugins means that it can interface between a great variety of applications, shifting data from one to another and optionally processing and enriching it along the way.
  • The final piece of the stack is Kibana, a web application that enables one to build very flexible and interactive time-based dashboards, sourcing data from ElasticSearch. Interestingly, another of my favourite tools that I have written about before – and will write about again in this article – is Grafana which is forked from Kibana (and modified to source its data from time-series databases like graphite/carbon/whisper or InfluxDB) – thus if you’re at home with one you will be the other.

In this article I’m going to show how to set up your own ELK stack to monitor OBIEE, based on SampleApp v406.

Who is this for?

As you will see below, setting up and configuring the ELK stack does involve rolling up ones sleeves and diving right in. If you’re looking for an off-the-shelf monitoring solution then you should look elsewhere (such as EM12c). But if you want to have a crack at it I think you’ll be pleasantly surprised at what is possible once you get past the initially (bumpy) learning curve. The capabilities are great, and there’s an active support community as is the case with lots of open-source tools. With a bit of work it is possible to create a monitoring environment tailored pretty much entirely to your design.

Installing the stack

ELK runs on all common linux distributions (including Oracle Linux), as well as Mac OS. The only prerequisite is a JDK for ElasticSearch and Logstash, and web server for Kibana; here I am using Apache.

First up, let’s install JDK 1.7 (SampleApp has 1.6, which isn’t enough):

sudo yum install -y java-1.7.0-openjdk.x86_64

Apache is already installed on SampleApp, which we can verify thus:

[oracle@demo ~]$ sudo yum install -y httpd
Loaded plugins: refresh-packagekit
Setting up Install Process
Package httpd-2.2.15-29.0.1.el6_4.x86_64 already installed and latest version
Nothing to do
[oracle@demo ~]$ sudo service httpd status
httpd is stopped

It’s shutdown by default and that’s fine because we need to update the configuration on it anyway.

ElasticSearch

The easiest way to install ElasticSearch is using the yum repository:

sudo rpm --import http://packages.elasticsearch.org/GPG-KEY-elasticsearch

cat > /tmp/elasticsearch.repo<<EOF
[elasticsearch-1.3]
name=Elasticsearch repository for 1.3.x packages
baseurl=http://packages.elasticsearch.org/elasticsearch/1.3/centos
gpgcheck=1
gpgkey=http://packages.elasticsearch.org/GPG-KEY-elasticsearch
enabled=1
EOF

sudo mv /tmp/elasticsearch.repo /etc/yum.repos.d/
sudo yum install -y elasticsearch

I’d then set it to start at boot automagically:

sudo chkconfig elasticsearch on

and then start it up:

sudo service elasticsearch start

One final, optional, step in the installation is a plugin called kopf which gives a nice web dashboard for looking at the status of ElasticSearch:

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/
cd /usr/share/elasticsearch/bin
sudo ./plugin -install lmenezes/elasticsearch-kopf

logstash

There’s no repository for logstash, but it’s no biggie because there’s no install as such, just a download and unpack. Grab the download archive for logstash from the ELK download page, and then unpack it:

cd ~/Downloads
wget https://download.elasticsearch.org/logstash/logstash/logstash-1.4.2.tar.gz
# Need to use sudo because /opt is owned by root
sudo tar -xf logstash-1.4.2.tar.gz --directory /opt/
sudo mv /opt/logstash-1.4.2/ /opt/logstash/
sudo chown -R oracle. /opt/logstash/

Kibana

As with logstash, Kibana just needs downloading and unpacking. There’s also a wee bit of configuration to do, so that the web server (Apache, in our case) knows to talk to it, and so that Kibana knows how to find ElasticSearch.

cd ~/Downloads/
wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.0.tar.gz
sudo tar -xf kibana-3.1.0.tar.gz --directory /opt
sudo mv /opt/kibana-3.1.0/ /opt/kibana/
sudo chown -R oracle. /opt/kibana/

Now to configure Apache, telling it where to find Kibana. If you have existing sites configured, you’ll need to sort this bit out yourself, but on a vanilla SampleApp v406 you can use the following sed command to set up the needful:

sudo sed -i'.bak' -e 's/DocumentRoot.*$/DocumentRoot "\/opt\/kibana\/"/g' /etc/httpd/conf/httpd.conf

Lastly, Kibana needs to know where to find ElasticSearch, which is where it is going to pull its data from. An important point here is that the URL of ElasticSearch must be resolvable and accessing from the web browser you run Kibana on, so if you are using a DNS name it must resolve etc. You can update the configuratinon file config.js by hand (it’s the elasticsearch: definition that needs updating), or use this sed command:

sed -i'.bak' -e 's/^\s*elasticsearch:.*$/elasticsearch: "http:\/\/demo.us.oracle.com:9200",/g' /opt/kibana/config.js

Finally, [re]start Apache so that it uses the new configuration:

sudo service httpd restart

You should be able to now point your web browser at the server and see the default Kibana dashboard. So for sampleapp, if you’re running Firefox locally on it, the URL would simply be http://localhost/ (port 80, so no need to specify it in the URL). Note that if you’re doing anything funky with network, your local web browser needs to be able to hit both Apache (port 80 by default), and ElasticSearch (port 9200 by default).

Configuring ELK end-to-end

Now that we’ve got the software installed, let’s see how it hangs together and create our first end-to-end example. There’s a good logstash tutorial here that covers a lot of the functionality. Here, I’ll just look at some of the very basics, creating a very simple logstash configuration which will prompt for input (i.e. stdin) and send it straight to ElasticSearch. The kopf plugin that we installed above can show that the data made it to ElasticSeach, and finally we will create a very simple Kibana dashboard to demonstrate its use.

Logstash works by reading a configuration file and then running continually waiting for the configured input. As well as the input we configure an output, and optionally in between we can have a set of filters. For now we will keep it simple with just an input and output. Create the following file in /opt/logstash and call it logstash-basic.conf:

input {
        stdin {}
        }
output {
        elasticsearch {}
        }

It’s pretty obvious what it’s saying – for the input, use stdin, and send it as output to elasticsearch (which will default to the localhost).
Run this with logstash:

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/
cd /opt/logstash
bin/logstash -f logstash-basic.conf

After a few moments you should get a prompt. Enter some text, and nothing happens… apparently. What’s actually happened is that the text, plus some information such as the current timestamp, has been sent to ElasticSearch.

Let’s see where it went. In a web browser, got to http://localhost:9200/_plugin/kopf/. 9200 is the port on which ElasticSearch listens by default, and kopf is a plugin we can use to inspect ElasticSearch’s state and data. ElasticSeach has a concept of an index, in which documents, maybe of the same repeating structure but not necessarily, can be stored. Crudely put, this can be seen as roughly analogous to tables and rows of data respectively. When logstash sends data to ElasticSearch it creates one index per day, and in kopf now you should see an index with the current date, with a “document” for each line you entered after running logstash:

ElasticSearch can be queried using HTTP requests, and kopf gives a nice way to construct these and see the results which are in JSON format. Click on the rest (as in, REST API) menu option, leave the request as default http://localhost:9200/_search, and click send. You’ll see in the response pane a chunk of JSON in amongst which are the strings that you’ve entered to logstash:

Enter a few more lines into the logstash prompt, and then head over to http://localhost/ where you should find the default dashboard, and click on the Logstash Dashboard option:

It’s fairly bare, because there’s very little data. Notice how you have a histogram of event rates over the past day at the top, and then details of each event at the bottom. There are two things to explore here. First up, go and enter a bit more data into logstash, so that the create events have been spread out over time. Click the refresh icon on the Kibana dashboard, and then click-drag to select just the period on the chart that has data. This will zoom in on it and you’ll see in greater definition when the events were created. Go and click on one of the event messages in the lower pane and see how it expands, showing the value of each field – including message which is what logstash sent through from its input to output.

Now let’s get some proper data in, by pointing logstash at the BI Server log (nqsserver.log). Create a new configuration file, logstash-obi.conf, and build it up as follows. First we’ll use the file input to get data from …wait for it….a file! The syntax is fairly obvious:

input {
        file {
                path => "/app/oracle/biee/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log"
                }
        }

Now we need to tell Logstash how to interpret the file. By default it’ll chuck every line of the log to ElasticSearch, with the current timestamp – rather than the timestamp of the actual event.

Now is time to introduce the wonderful world of the grok. A grok is one of the most important of the numerous filter plugins that are available in logstash. It defines expected patterns of content in the input, and maps it to fields in the output. So everything in a log message, such as the timestamp, user, ecid, and so on – all can be extracted from the input and stored as distinct items. They can also be used for further processing – such as amending the timestamp output from the logstash event to that of the log file line, rather than the system time at which it was processed.

So, let us see how to extract the timestamp from the log line. An important part of grok’ing is patterns. Grok statement are written as Regular expressions, or regex (obXKCD), so to avoid continual wheel-reinventing of regex statements for common objects (time, ip addresses, etc) logstash ships with a bunch of these predefined, and you can build your own too. Taking a line from nqsserver.log we can see the timestamp matches the ISO 8601 standard:

So our grok will use the pre-defined pattern TIMESTAMP_ISO8601, and then everything else (“GREEDYDATA”) after the timestamp, map to the log message field. The timestamp is in square brackets, which I’ve escaped with the backslash character. To indicate that it’s a grok pattern we want to match, it’s enclosed in %{ } markers.

\[%{TIMESTAMP_ISO8601:timestamp}\] %{GREEDYDATA:log_message}

This can be broken down as follows:

\[                                  The opening square bracket, escaped by \
%{TIMESTAMP_ISO8601:timestamp}      Capture an ISO 8601 timestamp, store it in a field called 'timestamp'
\]                                  The closing square bracket, escaped by \
%{GREEDYDATA:log_message}           Capture everything else ('GREEDYDATA' is also a grok pattern) and store it in the 'log_message' field

A grok operator in logstash is part of the filter processing, so we need a new stanza in the configuration file, after input and before output. Note that the grok operator is matching our pattern we built above against the message field, which is pre-populated by default by the input stream. You can grok against any field though.

input {
        file {
                path => "/app/oracle/biee/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log"
                }
        }
filter {
        grok {
                match => ["message","\[%{TIMESTAMP_ISO8601:timestamp}\] %{GREEDYDATA:log_message}"]
            }
        }
output {
        elasticsearch {}
        }

Now we can see in the resulting capture we’ve extracted the timestamp to a field called “timestamp”, with the remainder of the field in “log_message”

But – the actual timestamp of the log entry that we have attached to the event stored in ElasticSearch, a special field called @timestamp is still reflecting the timestamp at which logstash read the logfile entry (30th September), rather than when the logfile entry was created (11th June). To fix this, we use a new filter option (grok being the first), the date filter:

date {
        match => ["timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSZZ"]
}

This matches the timestamp field that we captured with the grok, and converts it into the special @timestamp field of the event, using the mask specified. Now if we go back to kopf, the ElasticSearch admin tool, we can see that a new index has been created, with the date of log entry that we just parsed and correctly extracted the actual date from:

And over in Kibana if you set the timeframe long enough in the filter at the top you’ll be able to find the log entries showing up, now with the correct timestamp. Note that ElasticSearch accounts for the timezone of the message, storing it in UTC:

A large part of setting up your own ELK stack is this configuration of the filters in order to imbue and extract as much information from the log as you want. Grok filters are just the beginning – you can use conditional paragraphs to grok certain fields or logs for certain strings (think, error messages and ORA codes), you can mutate output to add in your own fields and tags based on what has been read. The point of doing this is that you enrich what is in the logs by giving the different pieces of data meaning that you can then drive the Kibana dashboard and filtering through.

Tips for using logstash

To think of the logstash .conf file as merely “configuration” in the same way a simple .ini is would be to underestimate it. In effect you are writing a bit of data transformation code, so brace yourself – but the silver lining is that whatever you want to do, you probably can. Logstash is a most flexible and powerful piece of software, and one in which the model of input, filter, codec and output work very well.

In the spirit of code development, here are some tips you may find useful:

  • Grok patterns can inherit. Study the provided patterns (in the logstash patterns folder), and build your own. Look for commonality across files and look to reuse as much as possible. If you have multiple unrelated patterns for every type of log file in FMW then you’ve done it wrong.
  • When you’re building grok statements and patterns, use the superb http://grokdebug.herokuapp.com. The advantage of this over a standard regex debugger is that it supports the grok syntax of capture groups and even custom patterns.
  • For building up and testing regex from scratch, there are some useful sites:

    On the Mac there is a useful tool called Oyster which does the same as the above sites but doesn’t require an Internet connection.

  • By default a failed grok will add the tag _grokparsefailure to the event. If you have multiple grok clauses, give each its own unique tag_on_failure value so that you can see from the output message which grok statement failed.

    grok {
        match => [  "message", "%{WLSLOG}"
        ]
        tag_on_failure => ["_grokparsefailure","grok03"]
    }

  • You can use # as a comment character, so comment your code liberally, particularly whilst you’re finding your way with the configuration language so you can find your way back from the gingerbread house.
  • grok’ing takes resources, so target your grok statements by using conditionals. For example, to only parse nqquery logs for an expected string, you could write :

    if [path] =~ /nqquery/ {
        # Do your grok here
        }

    NB the =~ denotes a regex match, and the / delineate the regex string.

  • Use a conditional to split the output, writing and grok failures to stdout and/or a file, so that it is easier to see what’s failing and when. Note the use of the rubydebug codec here to prettify output sent to stdout.

    output {
        if "_grokparsefailure" not in [tags] {
            elasticsearch { }
        } else {
            file { path => "unprocessed.out" }
            stdout { codec => rubydebug }
        }
    }

  • Use the multiline codec to work with log messages that span multiple lines, and watch out for newline characters – grok does not like them so use a mutate gsub to fix them out.
Building a Kibana dashboard

Now that we’ve got the data streaming through nicely from logstash into ElasticSearch, let us take a look at building dashboards against it in Kibana. Kibana is a web application that runs within an existing web server such as Apache, and it builds dashboards from data stored in ElasticSearch.

The building blocks of a Kibana dashboard are rows, which contain panels of a given pane width, up to twelve per row. On a panel goes one of the types of object, such as a graph. We’ll see now how to build up a dashboard and the interactions that we can use for displaying and analysing data.

When you first load Kibana you get a default dashboard that links to a few other start dashboards. Here we’re going to properly start from scratch so as to build up a picture of how a dashboard is created. Click on Blank Dashboard, and then in the top-right corner click on Configure Dashboard. Click on Index and from the Timestamping option select day. What this does is tell Kibana which ElasticSearch indices it is to pull data from, in this case using the standard Logstash index naming pattern – which we observed in kopf earlier – logstash-YYYY.MM.DD. Click Save, and then select Add a row. Give the row a title such as System Activity, click on Create Row and then Save. A new row appears on the dashboard.

Now we’ll add a graph to the row. Click on Add panel to empty row, and select Histogram as the Panel Type. Note that by default the width is 4 – change this to 12. You’ll note that there are plenty of options to explore, but to start with we’ll just keep it simple, so go ahead and click Save. By default the chart will show all data, so use the Time filter dropdown option at the top of the screen to select a recent time period. Assuming your data has loaded from logstash into ElasticSearch you should see a graph similar to this:

This is a graph of the number of events (log file entries) per time period. The graph will amend the resolution according to the zoom so that a reasonable resolution of data is shown, or you can force it through the Resolution option in the graph properties. In the legend of the chart you can see the resolution currently used.

Assuming you’ve selecting a broad time interval, such as the last week, you’ll presumably want to drill into the data shown. This is very intuitive in Kibana – simply click and drag horizontally over the time period you want to examine.

There are two important concepts for selecting and grouping data in Kibana, called filters and queries. A query groups data based on conditions, and we’ll explore those later. Filtering is a predicate applied to all data returned. Think of it just like a WHERE clause on a SQL query. You can see the current filter(s) applied at the top of the dashboard.

You may well want to hide these, and they can be collapsed – as can the query row and all of the dashboard rows – by clicking on the little triangle

From the Filter area you can also add, amend, disable and remove filters.

So far all we’ve got is a graph showing system activity over time, based on events recorded in a log file. But, we’ve no way of seeing what those logs are, and this is where the Table panel comes in. Add a new row, give it a title of Log messages and add the Table panel to it specifying the span as 12. You should now see a list of messages with timestamps corresponding to the time period shown in the graph. You can customise the Table panel, for example specifying which fields to show; by default it shows _source which is the raw row returned by ElasticSearch. More useful to us is the log_message field that we parsed out using the grok in logstash earlier. You can do this by selecting the relevant field from the Fields list on the left (which can be collapsed for convenience), or editing the Table panel and specifying it in the Columns area.

From the Table panel it is possible to select the data shown even more precisely by adding additional filters based on data in the table. Clicking on a particular row will expand it and show all of the associated fields, and each field has a set of Action options. You can filter only for that value, or specifically excluding it, and you can also add each field into the table shown (just like we did above for logmessage). So here I can opt to only display messages that I’ve tagged in logstash as coming from the BI Server component itself:

You’ll note in the second screenshow, once the filter has been applied, that the graph has changed and is showing less data. That is because a filter is global to a dashboard. But what if we want to show on the graph counts for all logs, but in the data table just those for BI Server? Here is where queries come into play. A query also looks like a predicate, but rather than restricting the data returned it just identifies a set of data within what is returned. To illustrate this I’m first going to remove all existing filters except the time period one:

And now in the Query area click on the + (to the right of the line). Now there are two queries, both with a wildcard as their value meaning they’ll each match everything. In the second query box I add the query Component: OracleBIServerComponent – note for this to work your logstash must be sending messages to ElasticSearch with the necessary Component field. Once updated, the second query’s impact can be seen in the graph, which is showing both the “all” query and the BI Server component tows. Use the View > option in the top left of the graph as a quick way of getting to the graph settings, including disabling cumulative/stack view:

Each panel in Kibana can be configured to show all or some of the query groups that have been defined. This is most useful for creating breakdowns of data, including those that are splitting it in different ways and you wouldn’t want all of the options displayed in entirety on all panels. You might want to group out the components, and the types of error, and then show a break down of system activity by one or the other – but not necessarily both. To configure which query a panel is to show use the Configure option in the top-right of a panel and go to the Queries tab. If it’s set to all then each and every query set will be shown individually on the panel.

If it’s selected then you can select one or more of the defined query sets to display

There are about a dozen types of panel in Kibana, and I’m not going to cover them all here. The other ones particularly of interest for building this kind of OBIEE monitoring dashboard include:

  • Terms is basically a SELECT FIELD, COUNT(*) ... GROUP BY FIELD. It shows the top x number of terms for a given field, and how frequently they occurred. Results can be as a pie or bar chart, or just a table:From a Terms panel you can add filters by clicking on a term. In the example above, clicking on the pie segment, or table row icon, for ERROR would add a filter to show just ERROR log entries
  • Trends shows the trend of event occurrences in a given time frame. Combined with an appropriate query you can show things like error rates
  • Stats shows a set of statistics, so you can identify mean response times, maximum users logged on, and so on – assuming you have this data coming through from the logstash parsing.

Once you’ve built your dashboard save it (Kibana stores it in ElasticSearch). The dashboard is defined in json and you can opt to download this too.

The complete OBIEE monitoring view

Parsing logs is a great way to get out valuable information from the text stored and build visualisations and metrics on top of it. However, for pure metrics alone (such as machine CPU, OBIEE DMS metrics, and so on) a close-relation to Kibana, Grafana, is better suited to the task. Thus we have the text-based data going into ElasticSearch and reported through Kibana, and the pure metrics into a time-based store such as whisper (graphite’s database) and reported through Grafana. Because Grafana is a fork of Kibana, the look and feel is very similar.

Using obi-metrics-agent the DMS metrics from OBIEE can be collected and stored in whisper, and so also graphed out in Grafana alongside the system metrics. This gives us an overall architecture like this:

Obviously, it would be nice if we could integrate the fundamental time-based nature of both Kibana and Grafana together, so that drilling into a particular time period of interest maybe from an error rate point of view in the logs would also show the system and DMS metrics for the same period. There has been discussion about this (1, 2, 3) but I don’t get the impression that it will happen soon, if ever. One other item of interest here is Marvel, which is a commercial offering for monitoring ElasticSearch – through Kibana. It makes use of stock Kibana panel types, along with some new ones including the Nodes panel type, which suits the requirements we have of monitoring OBIEE/system metrics within a Kibana view, but unfortunately it looks like currently it is going to remain within Marvel only.

One other path to consider is trying to get the metrics currently sent to graphite/whisper instead into ElasticSearch so that Kibana can then report on them. The problem with this is twofold. Firstly, ElasticSearch is fundamentally a text-based store, whereas whisper fits much better for time/metric data (as would another DB such as influxDB). So trying to crow-bar the two together may not be the best solution, and instead better for it to be resolved at front end as discussed above. Secondly, Kibana’s graphing capabilities do not conceptually extend to multiple metrics in the same graph – only multiple queries – which means that graphing something that would be simple in Grafana (such as CPU wait/user/sys) would be overly complex in Kibana.

Architecting an ELK deployment

So far I’ve shown how to configure ELK on a single server, reading logs from that same server. But there are two extra things we should consider. First, Logstash in particular can be quite a ‘heavy’ process depending on how much work you’re doing with it. If you are processing all the logs that FMW writes, and have lots of grok filters (which isn’t a bad thing; it means you’re extracting lots of good information), then you will see logstash using a lot of CPU, lots IO, possibly to the detriment of other processes on the system – a tad ironic if the purpose of using logstash is to monitor for any system problems that occur. Secondly, ELK works very well with mutiple servers. You might have a scaled out OBIEE stack, or want to monitor multiple environments. Rather than replicating the ELK stack on each server instead it’s better for each server to push its log messages to a central ELK server for processing. And since the processing takes place on the ELK server and not the server being monitored we reduce the local resource footprint too.

To implement this kind of deployment, you need something like logstash-forwarder on the OBI server which is a light-touch program, sending the messages to logstash itself on the ELK server, over a custom protocol called lumberjack. Logstash then processes the messages as before, except it is reading the input from the logstash-forwarder rather than from file. An alternative approach to this is using redis as a message broker, with logstash running on both the source (sending output to redis) and ELK server (using redis as the input). This approach is documented very well here / here, and the former of using logstash-forwarder here. Logstash-forwarder worked very well for me in my tests, and seems to fit the purpose nicely.

Conclusion

Responsive monitoring tools are crucial for successful and timely support of an OBIEE system, and the ELK stack provides an excellent basis on which to build beyond the capabilities of Enterprise Manager Fusion Middleware Control. The learning curve is a bit steep at first, and you have to be comfortable with installing unpackaged tools, but the payoff makes it worth it! If you are interested in finding out about how Rittman Mead can help with your OBIEE implementation or other areas, please contact us.

Categories: BI & Warehousing

Using rlwrap with Apache Hive beeline for improved readline functionality

Fri, 2014-10-17 06:18

rlwrap is a nice little wrapper in which you can invoke commandline utilities and get them to behave with full readline functionality just like you’d get at the bash prompt. For example, up/down arrow keys to move between commands, but also home/end to go to the start/finish of a line, and even ctrl-R to search through command history to rapidly find a command. It’s one of the standard config changes I’ll make to any system with Oracle’s sqlplus on, and it works just as nicely with Apache Hive’s commandline interface, beeline.

beeline comes with some of this functionality (up/down arrow) but not all (for me, it was ‘home’ and ‘end’ not working and printing 1~ and 5~ respectively instead that prompted me to setup rlwrap with it).

Installing rlwrap

To install rlwrap simply add the EPEL yum packages to your repository configuration:

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm

and then install rlwrap from yum:

sudo yum install -y rlwrap

Use

Once rlwrap is installed you can invoke beeline through it manually, specifying all the standard beeline options as you would normally: (I’ve used the \ line continuation character here just to make the example nice and clear)

rlwrap -a beeline \
-u jdbc:hive2://bdanode1:10000 \
-n rmoffatt -p password \
-d org.apache.hive.jdbc.HiveDriver

Now I can connect to beeline, and as before I press up arrow to access commands from when I previously used the tool, but I can also hit ctrl-R to start typing part of a command to recall it, just like I would in bash. Some other useful shortcuts:

  • Ctrl-lclears the screen but with the current line still shown
  • Ctrl-kdeletes to the end of the line from the current cursor position
  • Ctrl-udeletes to the beginning of the line from the current cursor position
  • Esc-fmove forward one word
  • Esc-bmove backward one word
    (more here)

And most importantly, Home and End work just fine! (or, ctrl-a/ctrl-e if you prefer).

NB the -a argument for rlwrap is necessary because beeline already does some readline-esque functions, and we want rlwrap to forceable override them (otherwise neither work very well). Or more formally (from man rlwrap):

Always remain in “readline mode”, regardless of command’s terminal settings. Use this option if you want to use rlwrap with commands that already use readline.

Alias

A useful thing to do is to add an alias directly in your profile so that it is always available to launch beeline under rlwrap, in this case as the rlbeeline command:

# this sets up "rlbeeline" as the command to run beeline
# under rlwrap, you can call it what you want though. 
cat >> ~/.bashrc<<EOF
alias rlbeeline='rlwrap -a beeline'
EOF
# example usage:
# rlbeeline /
# -u jdbc:hive2://bdanode1:10000 /
# -n rmoffatt -p password /
# -d org.apache.hive.jdbc.HiveDriver

If you want this alias available for all users on a machine create the above as a standalone .sh file in /etc/profile.d/.

Autocomplete

One possible downside of using rlwrap with beeline is that you lose the native auto-complete option within beeline for the HiveQL statements. But never fear – we can have the best of both worlds, with the -f argument for rlwrap, specifying a list of custom auto-completes. So this is even a level-up for beeline, because we could populate it with our own schema objects and so on that we want auto-completed.

As a quick-start, run beeline without rlwrap, hit tab-twice and then ‘y’ to show all options and paste the resulting list into a text file (eg beeline_autocomplete.txt). Now call beeline, via rlwrap, passing that file as an argument to rlwrap:

rlwrap -a -f beeline_autocomplete.txt beeline

Once connected, use auto-complete just as you would normally (hit tab after typing a character or two of the word you’re going to match):

Connecting to jdbc:hive2://bdanode1:10000
Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
[...]
Beeline version 0.12.0-cdh5.0.1 by Apache Hive
0: jdbc:hive2://bdanode1:10000> SE
SECOND        SECTION       SELECT        SERIALIZABLE  SERVER_NAME   SESSION       SESSION_USER  SET
0: jdbc:hive2://bdanode1:10000> SELECT

Conclusion

rlwrap is the tool that keeps on giving; just as I was writing this article, I noticed that it also auto-highlights opening parentheses when typing the closing one. Nice!

Categories: BI & Warehousing

First-timer tips for Oracle Open World

Wed, 2014-10-08 07:16

Last week I had the great pleasure to attend Oracle Open World (OOW) for the first time, presenting No Silver Bullets – OBIEE Performance in the Real World at one of the ODTUG user group sessions on the Sunday. It was a blast, as the saying goes, but the week before OOW I was more nervous about the event itself than my presentation. Despite having been to smaller conferences before, OOW is vast in its scale and I felt like the week before going to university for the first time, full of uncertainty about what lay ahead and worrying that everyone would know everyone else except you! So during the week I jotted down a few things that I’d have found useful to know ahead of going and hopefully will help others going to OOW take it all in their stride from the very beginning.

Coming and going

I arrived on the Friday at midday SF time, and it worked perfectly for me. I was jetlagged so walked around like a zombie for the remainder of the day. Saturday I had chance to walk around SF and get my bearings both geographically, culturally and climate. Sunday is “day zero” when all the user group sessions are held, along with the opening OOW keynote in the evening. I think if I’d arrived Saturday afternoon instead I’d have felt a bit thrust into it all straight away on the Sunday.

In terms of leaving, the last formal day is Thursday and it’s full day of sessions too. I left straight after breakfast on Thursday and I felt I was leaving too early. But, OOW is a long few days & nights so chances are by Thursday you’ll be beat anyway, so check the schedule and plan your escape around it.

Accomodation

Book in advance! Like, at least two months in advance. There are 60,000 people descending on San Francisco, all wanting some place to stay.

Get airbnb, a lot more for your money than a hotel. Wifi is generally going to be a lot better, and having a living space in which to exist is nicer than just a hotel room. Don’t fret about the “perfect” location – anywhere walkable to Moscone (where OOW is held) is good because it means you can drop your rucksack off at the end of the day etc, but other than that the events are spread around so you’ll end up walking further to at least some of them. Or, get an Uber like the locals do!

Sessions

Go to Oak Table World (OTW), it’s great, and free. Non-marketing presentations from some of the most respected speakers in the industry. Cuts through the BS. It’s also basically on the same site as the rest of OOW, so easy to switch back and forth between OOW/OTW sessions.

Go and say hi to the speakers. In general they’re going to want to know that you liked it. Ask questions — hopefully they like what they talk about so they’ll love to speak some more about it. You’ll get more out of a five minute chat than two hours of keynote. And on that subject, don’t fret about dropping sessions — people tweet them, the slides are usually available, and in fact you could be sat at your desk instead of OOW and have missed the whole lot so just be grateful for what you do see. Chance encounters and chats aren’t available for download afterwards; most presentations are. Be strict in your selection of “must see” sessions, lest you drop one you really really did want to see.

Use the schedule builder in advance, but download it to your calendar (watch out for line-breaks in the exported file that will break the import) and sync it to your mobile phone so you can see rapidly where you need to head next. Conference mobile apps are rarely that useful and frequently bloated and/or unstable.

Don’t feel you need to book every waking moment of every day to sessions. It’s not slacking if you go to half as many but are twice as effective from not being worn out!

Dress

Dress wise, jeans and polo is fine, company polo or a shirt for delivering presentations. Day wear is fine for evenings too, no need to dress up. Some people do wear shorts but they’re in the great minority. There are lots of suits around, given it is a customer/sales conference too.

Socialising

The sessions and random conversations with people during the day are only part of OOW — the geek chat over a beer (or soda) is a big part too. Look out for the Pythian blogger meetup, meetups from your country’s user groups, companies you work with, and so on.

Register for the evening events that you get invited to (ODTUG, Pythian, etc) because often if you haven’t pre-registered you can’t get in if you change your mind, whereas if you do register but then don’t go that’s fine as they’ll bank on no-shows. The evening events are great for getting to chat to people (dare I say, networking), as are the other events that are organised like the swim in the bay, run across the bridge, etc.

Sign up for stuff like swim in the bay,  it’s good fun – and I can’t even swim really. Run/Bike across the bridge are two other events also organised. Hang around on twitter for details, people like Yury Velikanov and Jeff Smith are usually in the know if not doing the actual organising.

General

When the busy days and long evenings start to take their toll don’t be afraid to duck out and go and decompress. Grab a shower, get a coffee, do some sight seeing. Don’t forget to drink water as well as the copious quantities of coffee and soda.

Get a data package for your mobile phone in advance of going eg £5 per day unlimited data. Conference wifi is just about OK at best, often flaky. Trying to organise short-notice meetups with other people by IM/twitter/email gets frustrating if you only get online half an hour after the time they suggested to meet!

Don’t pack extra clothes ‘just in case’. Pack minimally because (1) you are just around the corner from Market Street with Gap, Old Navy etc so can pick up more clothes cheaply if you need to and (2) you’ll get t-shirts from exhibitors, events (eg swim in the bay) and you’ll need the suitcase space to bring them all home. Bring a suitcase with space in or that expands, don’t arrive with a suitcase that’s already at capacity.

Food

So much good food and beer. Watch out for some of the American beers; they seem to start at about 5% ABV and go upwards, compared to around 3.6% ABV here in the UK. Knocking back this at the same rate as this will get messy.

In terms of food you really are spoilt, some of my favourites were:

  • Lori’s diner (map) : As a brit, I loved this American Diner, and great food - yum yum. 5-10 minutes walk from Moscone.
  • Mel’s drive-in (map) : Just round the corner from Moscone, very busy but lots of seats. Great american breakfast experience! yum
  • Grove (map) : Good place for breakfast if you want somewhere a bit less greasy than a diner (WAT!)

 

Categories: BI & Warehousing

Adding Oracle Big Data SQL to ODI12c to Enhance Hive Data Transformations

Sun, 2014-10-05 15:29

An updated version of the Oracle BigDataLite VM came out a couple of weeks ago, and as well as updating the core Cloudera CDH software to the latest release it also included Oracle Big Data SQL, the SQL access layer over Hadoop that I covered on the blog a few months ago (here and here). Big Data SQL takes the SmartScan technology from Exadata and extends it to Hadoop, presenting Hive tables and HDFS files as Oracle external tables and pushing down the filtering and column-selection of data to individual Hadoop nodes. Any table registered in the Hive metastore can be exposed as an external table in Oracle, and a BigDataSQL agent installed on each Hadoop node gives them the ability to understand full Oracle SQL syntax rather than the cut-down SQL dialect that you get with Hive.

NewImage

There’s two immediate use-cases that come to mind when you think about Big Data SQL in the context of BI and data warehousing; you can use Big Data SQL to include Hive tables in regular Oracle set-based ETL transformations, giving you the ability to reference Hive data during part of your data load; and you can also use Big Data SQL as a way to access Hive tables from OBIEE, rather than having to go through Hive or Impala ODBC drivers. Let’s start off in this post by looking at the ETL scenario using ODI12c as the data integration environment, and I’ll come back to the BI example later in the week.

You may recall in a couple of earlier posts earlier in the year on ETL and data integration on Hadoop, I looked at a scenario where I wanted to geo-code web server log transactions using an IP address range lookup file from a company called MaxMind. To determine the country for a given IP address you need to locate the IP address of interest within ranges listed in the lookup file, something that’s easy to do with a full SQL dialect such as that provided by Oracle:

NewImage

In my case, I’d want to join my Hive table of server log entries with a Hive table containing the IP address ranges, using the BETWEEN operator – except that Hive doesn’t support any type of join other than an equi-join. You can use Impala and a BETWEEN clause there, but in my testing anything other than a relatively small log file Hive table took massive amounts of memory to do the join as Impala works in-memory which effectively ruled-out doing the geo-lookup set-based. I then went on to do the lookup using Pig and a Python API into the geocoding database but then you’ve got to learn Pig, and I finally came up with my best solution using Hive streaming and a Python script that called that same API, but each of these are fairly involved and require a bit of skill and experience from the developer.

But this of course is where Big Data SQL could be useful. If I could expose the Hive table containing my log file entries as an Oracle external table and then join that within Oracle to an Oracle-native lookup table, I could do my join using the BETWEEN operator and then output the join results to a temporary Oracle table; once that’s done I could then use ODI12c’s sqoop functionality to copy the results back down to Hive for the rest of the ETL process. Looking at my Hive database using SQL*Developer 4.0.3’s new ability to work with Hive tables I can see the table I’m interested in listed there:

NewImage

and I can also see it listed in the DBA_HIVE_TABLES static view that comes with Big Data SQL on Oracle Database 12c:

SQL> select database_name, table_name, location
  2  from dba_hive_tables
  3  where table_name like 'access_per_post%';

DATABASE_N TABLE_NAME             LOCATION
---------- ------------------------------ --------------------------------------------------
default    access_per_post        hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post

default    access_per_post_categories     hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post_categories

default    access_per_post_full       hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post_full

There are various ways to create the Oracle external tables over Hive tables in the linked Hadoop cluster, including using the new DBMS_HADOOP package to create the Oracle DDL from the Hive metastore table definitions or using SQL*Developer Data Modeler to generate the DDL from modelled Hive tables, but if you know the Hive table definition and its not too complicated, you might as well just write the DDL statement yourself using the new ORACLE_HIVE external table access driver. In my case, to create the corresponding external table for the Hive table I want to geo-code, it looks like this:

CREATE TABLE access_per_post_categories(
  hostname varchar2(100), 
  request_date varchar2(100), 
  post_id varchar2(10), 
  title varchar2(200), 
  author varchar2(100), 
  category varchar2(100),
  ip_integer number)
organization external
(type oracle_hive
 default directory default_dir
 access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

Then it’s just a case of importing the metadata for the external table over Hive, and the tables I’m going to join to and then load the results into, into ODI’s repository and then create a mapping to bring them all together.

NewImage

Importantly, I can create the join between the tables using the BETWEEN clause, something I just couldn’t do when working with Hive tables on their own.

NewImage

Running the mapping then joins the webserver log lookup table to the geocoding IP address range lookup table through the Oracle SQL engine, removing all the complexity of using Hive streaming, Pig or the other workaround solutions I used before. What I can then do is add a further step to the mapping to take the output of my join and use that to load the results back into Hive, like this:

NewImage

I’ll then use IKM SQL to to Hive-HBase-File (SQOOP) knowledge module to set up the export from Oracle into Hive.

NewImage

Now, when I run the mapping I can see the initial table join taking place between the Oracle native table and the Hive-sourced external table, and the results then being exported back into Hadoop at the end using the Sqoop KM.

NewImage

Finally, I can view the contents of the downstream Hive table loaded via Sqoop, and see that it does in-fact contain the country name for each of the page accesses.

NewImage

Oracle Big Data SQL isn’t a solution suitable for everyone; it only runs on the BDA and requires Exadata for the database access, and it’s an additional license cost on top of the base BDA software bundle. But if you’ve got it available it’s an excellent way to blend Hive and Oracle data, and a great way around some of the restrictions around HiveQL and the Hive JDBC/ODBC drivers. More on this topic later next week, when I’ll look at using Big Data SQL in conjunction with OBIEE 11g.

Categories: BI & Warehousing

News and Updates from Oracle Openworld 2014

Sat, 2014-10-04 08:48

It’s the Saturday after Oracle Openworld 2014, and I’m now home from San Francisco and back in the UK. It’s been a great week as usual, with lots of product announcements and updates to the BI, DW and Big Data products we use on current projects. Here’s my take on what was announced this last week.

New Products Announced

From a BI and DW perspective, the most significant product announcements were around Hadoop and Big Data. Up to this point most parts of an analytics-focused big data project required you to code the solution yourself, with the diagram below showing the typical three steps in a big data project – data ingestion, analysis and sharing the results.

NewImage

At the moment, all of these steps are typically performed from the command-line using languages such as Python, R, Pig, Hive and so on, with tools like Apache Flume and Apache Sqoop used to bring data into and out of the Hadoop cluster. Under the covers, these tools use technologies such as MapReduce or Spark to do their work, automatically running jobs in parallel across the cluster and making use of the easy scalability of Hadoop and NoSQL databases.

You can also neatly divide the work up on a big data project into two phases; the “discovery” phase typically performed by a data scientist where data is loaded, analysed, correlated and otherwise “understood” to provide the initial insights, and then an “exploitation” phase where we apply governance, provide the output data in a format usable by BI tools and otherwise share the results with the wider corporate audience. The updated Information Management Reference Architecture we collaborated on with Oracle and launched by in June this year had distinct discovery and exploitation phases, and the architecture itself made a clear distinction between the Innovation part that enabled the discovery phase of a project and the Execution part that delivered the insights and data in a more governed, production setting.

NewImage

This was the theme of the product announcements around analytics, BI, data warehousing and big data during Openworld 2014, with Oracle’s Omri Traub in the photo below taking us through Oracle’s big data product strategy. What Oracle are doing here is productising and “democratising” big data, putting it clearly in context of their existing database, engineered systems and BI products and linking them all together into an overall information management architecture and delivery process.

NewImage

So working through from ingestion through to data analysis, these steps have typically been performed by data scientists using scripting tools and rudimentary data visualisation engines, making them labour-intensive and reliant on a small set of people conversant with these tools and process. Oracle Big Data Discovery is aimed squarely at these steps, and combines Apache Spark-based data preparation and transformation capabilities with an analysis and visualisation engine based on Endeca Server.

NewImage

Key features of Big Data Discovery include:

  • Ability to analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine
  • Create a catalog of the data on your Hadoop cluster, and then search that catalog using Endeca Server search technologies
  • Create recommendations of other datasets that might interest you, based on what you’re looking at now
  • Visualize your datasets to help understand what they contain, and discover new insights

Under the covers it comprises two parts; the data loading, transformation and profiling part that uses Apache Spark to do its work in parallel across all the nodes in the cluster, and the analysis part, which takes data prepared by Apache Spark and loads into the Endeca Server in-memory engine to perform the analysis, aggregation and data visualisation. Unlike the Spark part the Endeca server element runs just on one node and limits the size of the analysis dataset to what can run in-memory in the Endeca Server engine, but in practice you’re going to work with a sample of the data rather than the entire dataset at that stage (in time the assumption is that the Endeca Server engine will be unbundled and run natively on YARN, giving it the same scalability as the Spark-based data ingestion and transformation part). Initially Big Data Discovery will run on-premise with a cloud version later on, and it’s not dependent on Big Data Appliance – expect to see something later this year / early next year.

Another new product that addresses the discovery phase and discovery lab part of a big data project is Oracle Data Enrichment Cloud Service, from the Oracle Data Integration team and designed to complement ODI and Oracle EDQ. Whilst Oracle positioned ODECS as something you’d use as well as Big Data Discovery and typically upstream from BDD, to me there seemed to be a fair bit of overlap between the products, with both tools doing data profiling and transformation but BDD being more focused on the exploration and discovery part, and ODECS being more focused on early-stage data profiling and transformation.

NewImage

ODECS is clearly more of an ETL tool complement and runs natively in the cloud, right from the start. It’s most probably aimed at customers with their Hadoop dataset already in the cloud, maybe using Amazon Elastic MapReduce or Oracle’s new Hadoop-as-a-Service and has more in common with the old Data Quality Option for Oracle Warehouse Builder than Endeca’s search-first analytic interface. It’s got a very nice interface including a mobile-enabled website and the ability to include and merge in external datasets, including Oracle’s own Data as a Service platform offering. Along with the new Metadata Management tool Oracle also launched at Openworld it’s a great addition to the Oracle Data Integration product suite, but I can’t help thinking that its initial availability only on Oracle’s public cloud platform is going to limit its use with Oracle’s typical customers – we’ll have to just wait and see.

The other major product that addresses big data projects was Oracle Big Data SQL. Partly addressing the discovery phase of big data projects but mostly (to my mind) addressing the exploitation phase, and the execution part of the information management architecture, Big Data SQL gives Oracle Exadata the ability to return data from Hive and NoSQL on the Big Data Appliance as well as data from its normal relational store. I covered Big Data SQL on the blog a few weeks ago and I’ll be posting some more in-depth articles on it next week, but the other main technical innovation with the product is its bringing of Exadata’s SmartScan feature to Hadoop, projecting and filtering data at the Hadoop storage node level and also giving Hadoop the ability to understand regular Oracle SQL, rather than the cut-down version you get with HiveQL.

NewImage

Where this then leaves us is with the ability to do most of a big data project using (Oracle) tools, bringing big data analysis within reach of organisations with Oracle-style budgets but without access to rare data scientist-type resources. Going back to my diagram earlier, a post-OOW big data project using the new products launched in this last week could look something like this:

NewImage

Big Data SQL is out now and depends on BDA and Exadata for its use; Big Data Discovery should be out in a few months time, runs on-premise but doesn’t require BDA, whilst ODECS is cloud-only and runs on a BDA in the background. Expect more news and more integration/alignment from the products as 2014 ends and 2015 starts, and we’re looking forward to using them on Oracle-centric Hadoop projects in the near future. 

Product Updates for BI, Data Integration, Exalytics, BI Applications and OBIEE

Other news announced over the week for products we more commonly use on projects include:

Finally, something that we were particularly pleased to see was the updated Oracle Information Management Architecture I mentioned earlier referenced in most of the analytics sessions, with Oracle’s Balaji Yelamanchili for example introducing it in his big data and business analytics general session mid-way through the week. 

NewImage  

We love the way this brings together the big data components and puts them in the context of the wider data warehouse and analytic processes, and compared to a few years ago when Hadoop and big data was considered completely separate to data warehousing and BI and done by staff completely different to the core business analytics team, this new reference architecture puts it squarely within the world of BI and analytics we work in. It also emphasises the new abilities Hadoop, NoSQL databases and big data can bring us – support for wider sets of data sources with dynamic schemas, the ability to economically work with and analyse much larger datasets, and support discovery-type upfront analysis work. Finally, it recognises that to get true value out of analysis you start on Hadoop, you eventually need to add proper data governance, make the results more widely available using full SQL tools, and use the right tools – relational databases, OLAP servers and the like – to analyse the data once its in a more structured form. 

If you missed our write-up on the updated Information Management Reference Architecture you can can read our two-part blog post here and here, read the Oracle white paper, or listen to the podcast with OTN Archbeat’s Bob Rhubart. For now though I’m looking forward to seeing the family after a week and a half away in San Francisco – thanks to OTN and the Oracle ACE Director Program for sponsoring my visit over to SF for Openworld, and we’ll post our conference presentation slides later next week when we’re back in the UK and US offices.

Categories: BI & Warehousing

EPM and BI Meetup at Next Week’s Openworld (and details of our Oracle DI Speakeasy)

Fri, 2014-09-26 10:08

Just a short note to help publicise the Oracle Openworld 2014 EPM and BI Meetup that’s running next week, organised by Cameron Lackpour and Tim Tow from the ODTUG board.

This is an excellent opportunity for EPM and BI developers and customers to get together and network over drinks and food, and chat with members of the ODTUG board and maybe some of the EPM and BI product management team. It’s running at Piattini, located at 2331 Mission St. (between 19th St & 20th St), San Francisco, CA 94110 from 7pm to late and there’s more details at this blog post by Cameron. The turnout should be pretty good, and if you’re an EPM or BI developer looking to meet up with others in your area this is a great opportunity to do so. Attendance is free and you just need to register using this form.

Similarly, if you’re into data warehousing and data integration you might be interested in our Rittman Mead / Oracle Data Integration’s Speakeasy event, running on the same evening (Tuesday September 30th 2014) from 7pm – 9pm at Local Edition, 691 Market St, San Francisco, CA. Aimed at ODI, OWB and data integration developers and customers and featuring members of the Rittman Mead team and Oracle’s Data Integration product team, again this is a great opportunity to meet with your peers and share stories and experiences. Registration is free and done through this registration form, with spaces still open at the time of posting.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Service Administration

Thu, 2014-09-25 20:17

Earlier in the week we’ve looked at the developer features within Oracle BI Cloud Service (BICS), aimed at departmental users who want the power of OBIEE 11g without the need to stand-up their own infrastructure. We looked at the process of uploading spreadsheets and other data to the Oracle Database Schema Service that accompanies BICS, how you create the BI Repository that translates the tables and columns you upload into measures, attributes and hierarchies, and then took a brief look at how dashboards and reports are created and then shared with other users in your department. If you’re coming in late, here’s the links to the previous posts in the series:

One of the design goals for BICS was to reduce the amount of administration work an end-user has to perform, and to simplify and consolidate any tasks that they do have to do. Behind the scenes BICS actually comprises a BI environment, and a database environment, with most of the administration work being concerned with the BI one. Let’s start by looking at the service administration page that you see when you first log into the BICS environment as an administrator, with the screenshot below showing the overview page for the overall service.

NewImage

Oracle BI Cloud Service is part of Oracle’s overall Oracle Platform-as-a-Service (PaaS) offering, with BICS being made up of a database service and a BI service. The screenshot above shows the overall availability of these two services over the past two weeks, and you click on either the database service or the BI service to drill into more detail. Let’s click on the BI service first.

NewImage

The BI service dashboard page shows the same availability statuses again, along with a few graphs to show usage over that period. Also on this page are details of the start and end date for the service contract, details of the SFTP user account you’ll need to for some import/archive operations, and a link to Presentation Services for this instance, to launch the OBIEE Home Page.

The OBIEE home page, as we saw in previous posts in this series, has menu items for model editing, data uploading and creating reports and dashboards. What it also has though is a Manage menu item, as shown in the screenshot below, that takes you through to an administration function that lets you set up application roles and backup/restore the system.

NewImage

Application roles are the way that OBIEE groups permissions and privileges and then assigns them to sets of users. With on-premise OBIEE the only way to manage application roles is through Enterprise Manager Fusion Middleware Control, but with BICS this functionality has been moved into OBIEE proper so that non-system administrators can perform this task. The list of users you work with are the ones defined for your service (tenancy) and using this tool you can assign them to existing application roles, create new ones, or group one set of roles within another. Users themselves are created as part of the instance creation process, with the minimum (license) number of users for an instance being 10.

NewImage

The Snapshots tab on this same Service Console page provides access to a new, system-wide snapshot and restore function that provides the means to version your system, restore it from a backup and transport a dev/test environment to your production instance. As I mentioned in previous postings in the series, each tenant for BICS comes with two instances, once for dev/test and one for prod, and the snapshot facility gives you a means to copy everything from one environment into another, for when you’ve completed development and testing and want to put your dashboards into production.

NewImage

Taking a snapshot, as shown in the screenshot above, creates an archive file containing your RPD, the catalog and all the security settings, and you can store a number of snapshots within each environments, giving you a (very coarse-grained) versioning ability. What you can also do is download these snapshots as what are called “BI Archive” files as shown in the screenshot below, and its these archive files that you can then upload into your other instance to give you your code promotion process – note however that applying an archive file overwrites everything that was there before, so you’ll need to be careful doing this when users start creating reports in your production environment – really, it’s just a once-only code promotion facility followed then by a way of backing up and restoring your environments.

NewImage

Note also that you’ll separately need to backup and restore any database elements, as these aren’t automatically included in the BI archive process. Backup and restoration of database elements is done via the separate database instance service page shown below, where you can export the whole schema or just parts of it, and then retrieve the export file via an SFTP transfer.

NewImage

So that’s in in terms of BICS administration, and for our initial look at the BI Cloud Service platform. Rittman Mead are of course offering services around BICS and cloud BI in-general so contact us if you’d like to give BICS a spin, and keep an eye on the blog over the next few weeks where we’ll take you through the example BICS application we built, reporting against Salesforce.com data using their REST API.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Building Dashboards & Reports

Thu, 2014-09-25 04:00

This week we’ve been looking at the new Oracle BI Cloud Service (BICS), the cloud version of OBIEE11g that went GA at the start of this week. Rittman Mead were part of the beta program for BICS and spend a couple of weeks building a sample BICS application to put the product through its paces, creating a reporting application for Salesforce.com that pulled in its data via the Salesforce REST API and staged it in the Oracle Database Schema Service that comes with BICS. Earlier in the week we looked at how data was uploaded or transferred into the accompanying database schema, and yesterday looked at how the repository was created using the new thin-client data modeller. Today, we’ll look at how you create the dashboards and reports that your users will use, using the Analysis and Dashboard Editors that are part of the service. If you’re arriving mid-way through the series, here’s the links to the other posts in the series:

In fact creating analyses and dashboards is the part of BICS that has least changed compared to the on-premise version. In keeping with the “self-service” theme for BICS there’s an introductory set of guidance notes when you first connect to BICS, like this:   NewImage   and the dashboard and analysis editors are available as menu options on the Home page, along with a link to the Catalog view, like this:   NewImage   From that point on though it’s standard Answers and Dashboards, with the normal four-tab editor view within Answers (the Analysis Editor) and the ability to create views, calculations, filters and so on. Anyone familiar with Answers will be at home within the cloud version, and there’s a new visualisation – the heat map view, as shown in the final screenshot later in this article – that hints at other visualisations that we’ll see featured first in the cloud version of OBIEE, expected to be updated more frequently than the on-premise version (one of the major selling points for customers looking to adopt new features as soon as possible with OBIEE).   NewImage   What’s missing from this environment though are features like Agents and alerts, scorecards and BI Publisher, or the ability to create actions other than links to other web pages or catalog content.   NewImage   These are features that Oracle are saying they’ll add-back in time though as the underlying infrastructure for BICS builds-out, and of course the whole UI is likely to go through a rev with the 12c release of OBIEE due sometime in 2015. Dashboards are also created in the same way as with on-premise OBIEE, with the same Dashboard Editor and access to features like conditional display of sections and support for presentation variables.

NewImage

So, that wraps-up our quick tour around the analysis and dashboard creation parts of Oracle BI Cloud Service; tomorrow, to finish-up the series we’ll look at the administration elements of BICS including new self-service application role provisioning, tools for administering and monitoring the instance and for backing-up and migrating content from one instance to another.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Creating the Repository

Wed, 2014-09-24 04:00

Earlier in this series we’ve looked at the overall product proposition for Oracle BI Cloud Service (BICS), and how you upload data to the Database Schema Service that comes with it. Today, we’re going to look at what’s involved in creating the BI Repository that holds the metadata about your logical tables, calculations and dimension hierarchies, using the new thin-client data modeller that like the rest of BICS runs entirely within your web browser. For anyone coming into the series mid-way, here’s the links to the other posts in the series:

So anyone familiar with OBIEE will know that a central part of the product, and the part of it that makes it easy for users to work with their data, is the business-orientated semantic model that you create over your source data. Held within what’s called the “BI Repository” and made-up of physical, logical and presentation layers, the semantic model turns what can be a complex set of source tables, joins and cross-application links into a simple to understand set of subject areas made up of fact tables and dimensions. Regular on-premise OBIEE semantic models can get pretty complex, with joins across different database types, logical tables with several different ways you can provide their data – for example, at detail-level from an Oracle data warehouse whilst at summary level, from an Essbase cube, and to edit them you use a dedicated Windows development tool called BI Administration.

Allowing these complex data models, and having a dependency on a Windows-based development tool, poses two main issues for any consumer-style version of OBIEE; first, if the aim of the service is to attract customers who want to create their systems “self-service”, you’ve got to made the repository development process a lot simpler than it currently is – you can’t expect customers to go on a course or buy my excellent book when they just want to get a dashboard up and running with the minimum fuss. You also can’t realistically expect them to install a Windows-only development tool back at the office as most of their target customers won’t have admin privileges on their workstations, or they might even be using Macs or work out of a browser; and then, even if they get it installed you’ll need to ensure there’s a network connection available to the BI Server in the cloud through their corporate firewall. Clearly, a browser-based repository creation tool was needed, ideally one that did some of the basic work automatically for the user and didn’t need hours or days of training to understand. Of course, the risk to this is that you create a repository editing tool that’s too “dumbed-down” for most developers to find useful, and we’ll consider that possibility later in the article.

So following the data upload process that we covered in yesterday’s post, we’re now in a position where we’ve got a number of tables sitting in Oracle Database Schema Service, and we’re ready to build a repository to report against them. To access the thin-client data modeller you click on the Model menu item on the BICS homepage, as shown in the screenshot below.

NewImage

The modeller itself supports a simplified subset of what you can create with the full BI Administration tool. You’ve got a single source, the Oracle Database Schema Service, and a single business model. Business model tables have a logical table source as you’d normally expect, but just the one LTS is currently supported. Calculations within logical tables are supported, but they’re logical-level only (i.e. post-aggregation) with no current support for physical-level (pre-aggregation) at this point.

NewImage

Level-based hierarchies within the business model are supported, including skip-level and ragged ones, and there’s support for time-series dimensions including their own editor.

NewImage

Where possible, introspection is used when creating the business model components, with table joins and matching column names used to create candidate logical joins. Static and dynamic repository variables, along with session variables are supported, with the front-end also supporting presentation and request variables – so all good there.

NewImage

Under the covers, each tenant within BICS has their own RPD and their own catalog, and any edits to the repository that you perform are effectively “online” edits. To make edits to an existing model the developer therefore has to first “lock” the model, make their changes and add their new entries and then validate them, and then either revert the model or publish the changes. 

NewImage

In the background BICS updates the RPD using the metadata web service API for the BI Server, with the RPD it creates the same format as the ones we create on-premise, just with a smaller set of features supported through the thin-client admin tool.

As I mentioned in the first post in the series, each tenant install of BICS comes with two instances; one for development or pre-prod and one for production. To move a completed repository out of one environment into another a new feature called a “BI Archive” is used, a snapshot of your BICS system that includes both the repository, the catalog and any security objects you create. In this first version of BICS each import is total and overwrites everything that was in the instance beforehand, so there’s no incremental import or ability to selectively import just certain objects or certain reports into a new environment, meaning that you’ll lose any reports or dashboards created in production if you subsequently refresh it from dev/pre-prod – something to bear in-mind.

One other thing to be aware of is that there’s no ability to create alias tables or opaque views in the thin-client modeller, so if you want to create additional copies of dimension table for more than one dimension role, or you want to create a table using an arbitrary SELECT statement you’ll need to go into ApEx and create a database view instead – not a huge imposition as ApEx comes with tools for creating these pretty easily, but something that will lead to a more complex database model in-time. The screenshot below shows one such database view then exposed through the thin-client modeller, where you can see the SELECT statement behind it (but not alter or amend it except through ApEx).

NewImage

Finally, the thin-client modeller supports row-level and subject area security, using filters or object permissions to set up manually or create by reference to application roles granted to your users. We’ll look at what’s involved in setting up security and application roles in the final post in this series, where we look at administering your BICS instance.

So, that’s a high-level view of the repository creation process; in tomorrow’s post, we’ll look at what’s involved in creating reports and dashboards.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Provisioning Data

Tue, 2014-09-23 04:00

In the first post in this series I looked at the new Oracle BI Cloud Service, which went GA over the weekend and which Rittman Mead have been using these past few weeks as part of a beta release. In the first post I looked at what BICS is and who its aimed at in this initial release, and went through the features at a high-level; over the rest of the week I’ll be looking at the features in-detail, starting today with the data upload and provisioning process. Here’s the links to the rest of the series, with the items getting updated over the week as I post each entry in the series:

As I mentioned in that first post, “Introduction to Oracle BI Cloud Service : Product Overview”, BICS in this initial release to my mind is aimed at departmental use-cases where someone wants to quickly upload and analyse an offline dataset and share the results with other members of their team. BICS comes bundled with Oracle Database Schema Service and 50GB of storage, and OBIEE in this setup reports just against this data source with no ability to reach-out dynamically to other data sources or blend those sources with the main one in Oracle’s cloud database. It’s aimed really at users with a single source of data to work with, who’ve probably obtained it as an export from some other system and just want to be able to report against it, though as we’ll see later in this post it is possible to link to other SaaS sources with a bit of PL/SQL wizardry.

So the first task you’re likely to perform when working with BICS is to upload some data to report on. There are three main options for uploading data to BICS, two of which are browser-based and aimed at end-users, and one that uses SQL*Developer and more aimed at devs. BICS itself comes with a menu items on the home page for uploading data, and this is what we’ll think users will use most as it’s built-into the tool and fairly prominent.

NewImage

Clicking on this menu item launches an ApEx application hosted in the Database Schema Service that comes with BICS, and which allows you to upload and parse XLS and delimited file-types to the database cloud instance and then store the contents in database tables.

NewImage

Oracle Database Schema Service also comes with Application Express (ApEx) as a front-end, and ApEx has similar tools for upload datasets into the service, with additional features for creating views and PL/SQL packages to process and manipulate the data, something we used in our beta program example to connect to Salesforce.com and download data using their REST API. In-theory you shouldn’t need to use these features much, but SIs and partners such as ourselves will no doubt use ApEx a lot to build out the loading infrastructure, data cleansing and other features that you might want for a packaged cloud app – so get your PL/SQL books out and brush-up on ApEx development.

NewImage

The other way to get data into BICS is to use Oracle SQLDeveloper, which has a special Oracle Cloud connector type that allows you to view and work with database objects as if they were regular database ones, and upload data to the cloud in the form of “carts”. I’d imagine these options will get extended over time, either by tools or utilities Oracle release for this v1.0 BICS release, or by BICS eventually supporting the full Oracle Database Instance Service that’ll support regular SQLNet connections from ETL tools.

NewImage

So once you’ve got some data uploaded into Database Schema Services, you’ll end up with a set of source tables from which you can create your BI Repository. Check back tomorrow for more details on how BICS’s new thin-client data modeller works and how you create your business model against this cloud data source, including how the repository editing and checkout process works in this new potentially multi-user development environment.

 

Categories: BI & Warehousing