Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 3 hours 30 min ago

Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization

16 hours 11 min ago

In the first two posts in this series, we looked at what Oracle Big Data Discovery is and how you can use it to sample, cleanse and then catalog data in your Hadoop-based data reservoir. At the end of that second post we’d loaded some webserver log data into BDD, and then uploaded some additional reference data that we then joined to the log file dataset to provide descriptive attributes to add to the base log activity. Once you’ve loaded the datasets into BDD you can do some basic searching and graphing of your data directly from the “Explore” part o the interface, selecting and locating attribute values from the search bar and displaying individual attributes in the “Scratchpad” area.

NewImage

With Big Data Discovery though you can go one step further and build complete applications to search and analyse your data, using the “Discover” part of the application. Using this feature you can add one or more charts to a dashboard page that go much further than the simple data visualisations you get on the Explore part of the application, based on the chart types and UI interactions that you first saw in Oracle Endeca Information Discovery Studio.

NewImage

Components you can add include thematic maps, summary bars (like OBIEE’s performance tiles, but for multiple measures), various bar, line and bubble charts, all of which can then be faceted-searched using an OEID-like search component.

NewImage

Each visualisation component is tied to a particular “view” that points to one or more underlying BDD datasets – samples of the full dataset held in the Hadoop cluster stored in the Endeca Server-based DGraph engine. For example, the thematic map above was created against the post comments dataset, with the theme colours defined using the number of comments metric and each country defined by a country name attribute derived from the calling host IP address.

NewImage

Views are auto-generated by BDD when you import a dataset, or when you join two or more datasets together. You can also use the Endeca EQL language to define your own views using a SQL-type language, and then define which columns represent attributes, which ones are metrics (measures) and how those metrics are aggregated.

NewImage

Like OEID before it, Big Data Discovery isn’t a substitute for a regular BI tool like OBIEE – beyond simple charts and visualizations its tricky to create more complex data selections, drill-paths in hierarchies, subtotals and so forth, and users will need to understand the concept of multiple views and datatypes, when to drop into EQL and so on – but for non-technical users working in an organization’s big data team it’s a great way to put a visual front-end onto the data in the data reservoir without having to understand tools like R Studio.

So that’s it for this three-part overview of Oracle Big Data Discovery and how it works with the Hadoop-based data reservoir. Keep an eye on the blog over the next few weeks as we get to grips with this new tool, and we’ll be covering it as part of the optional masterclass at the Brighton and Atlanta Rittman Mead BI Forum 2015 events this May.

Categories: BI & Warehousing

Why and How to use Oracle Metadata Management 12c. Part 2: Importing and Relating Metadata

Thu, 2015-02-26 05:27

In the first post of this series we have seen how to install and configure OEEM to start working with it. In this new post we are going to see how we import the metadata into OEMM from different sources like Oracle Database, OBIEE, ODI and OWB and then relate all of them inside OEMM.

oemm_main_page

After we have installed and configured OEMM, we need to start adding all the metadata from the different sources and applications that we use. In this example the sources will be some Oracle schemas and our applications will be ODI, OWB and OBIEE. To import the metadata for all of them we need to create one model in OEMM for each. A model in OEMM has all the connection details for a specific source or metadata provider (i.e: database schema, ODI repository, etc), and is also the container for the metadata of that specific source after the import process. So one model can connect to one specific source or application.

First and for organisational purposes we will create a Folder to contain the future models.  You can also create your models first, and then create the folder/s that you want and then just move the models under the correspondent folders. In addition, you can create folders within another folder.

To create a folder, right-click on the Repository entry under the Repository panel in the OEMM main page. Select New > Folder in the pop-up menu, enter a name and press the Create button.

 

oemm_folder_comb

The next step is creating the models and import the metadata of the different sources. The import or reverse engineering process is named harvesting in OEMM. We will start with the model for the Oracle Database. In this particular example I used Oracle 12c.

To create a model right click on the folder or the repository entry, and select New > Model. In the Create Model window that appears, enter a name for the new model and select the type of source that you want to import or to be more precise which will be the Import bridge that this model will use.

The Import Bridge is part of the Meta Integration® Model Bridge (MIMB) software and is the way that OEMM connect to the sources and applications to reverse engineering the metadata. You will find import bridges for a wide range of technologies like different databases, Business Intelligence  and Data Integration products from different vendors, Big Data stores, etc.

oemm_model_comb

For this first example we will select the Oracle Database (via JDBC) import bridge and in the Import Setup tab we will add all the usual connection details: host, port, service and user and password to connect to the Database. This user should have at least the CONNECT privilege and the SELEC_CATALOG_ROLE role. We can also define this model for specific schemas using the magnifying glass to choose the shown schemas or just write the schemas (in uppercase) separated by “;”. Also we can decide if we want that the stored procedures are going to be included in this imported metadata or not.

oemm_model_conn_db

After all the connection details have set, we test the connection and wait until we receive the Connection Successful message, and finally press the Create button. A message windows will appear asking if we want to “Import a new version now?” Press yes to start the harvesting process. A log window will show you the progress in the import process that can take several minutes. After the process is finished a new windows message ask if we want to open the model.

oemm_imp_succ_open_model

Choose yes to see all the objects that are imported for this model as it is shown in the figure below.

ORCL_training_model

We need to repeat the process explained above to create the models for the rest of sources and applications that we are going to use in this example. The process is the same for all of them but of course there are some differences in the connection details required after we chose the specific Import Bridge for each one.

In the next screenshot you will find the connection details for the ODI model after you choose the Oracle Data Integrator (ODI) Import Bridge. In the Import Setup tab, you need to select the appropriate driver to connect to the database where is the ODI Repository (that could be Oracle, SQLServer, DB2, etc), the ODI Home folder, the URL to connect to database, the schema and the password for the Master Repository, user and password for the ODI User (SUPERVISOR for example), the name of the Work Repository from that we want to select the ODI Objects and the Context.

oemm_odi_import_setup

We need to select the Scope for this model between two options: Projects, that will include packages and mappings (or interfaces for versions before 12c) or Load Plans and Scenarios, that includes the Load Plans and Scenarios.

After we chose the Scope we can also filter the Content of the scope pressing the magnifying glass icon and select the specific objects that we want for this model.

After you press the create button to start the harvesting process, open the model created and it will look similar to this if you choosing Projects as Scope.

oemm_odi_projects

For the connection details to create the OWB model , you need to take a couple of things into account. First, the version of the OWB from which you want to import the metadata. If it is 11.2.0.3 or later you will need to do the these two steps before:

  1. Copy the following .JAR files from: MetaIntegrationInstallationDir\java\ to %OWB_HOME%\owb\lib\ext\
  • jsr173_1.0_api.jar
  • MIR.jar
  • MIRModelBridge.jar
  • MIROracleWarehouseBuilderOmb.jar
  • MIRUtil.jar
  • stax-1.1.1-dev.jar
  1. Copy the MetaIntegrationInstallationDir\bin\mimbexec.bat file into the same OWB directory.

As the version that I have is 11.2.0.4, I copy the files detailed above, set the connection parameters like is shown in the following image and test the connection.

owb model config

When I started the import process the following error message appears in the log windows:

error_owb2

 

After trying many things unsuccessfully, I asked David Allan for help and he sent me another mimbexec.bat because apparently between 11.2.0.3 and 11.2.0.4 there were directory name changes.  This a temporary fix and a proper one is being worked on.

I substituted the bat file and I received another error message as it is shown in the next screenshot.

error_importing_owb_externaltable

 

After a while, I realised that the issue that OEMM reported was because I was using an external table in one of the mappings. I changed it for a common table and the import process worked well. This has reported as a bug and a solution is being worked on. I really want to thank David for all his invaluable help on that.

This is how it looks the OWB model after the import of the metadata.

owb_model

The last model that we need to create is the one based on OBIEE. There are different import bridges depending on the metadata that we need to import from OBIEE. Could be Oracle Business Intelligence (OBI) Server, Oracle Business Intelligence (OBI) Enterprise Edition and Oracle Business Intelligence (OBI) Answers.

The OBI Server import bridge needs the OBI repository in xml format as a parameter to import it, and the result model will contain all the objects defined in the three layers of the repository (Presentation, Business Model, Physical) as well as the repository connections, the variables and the initialisation blocks defined in the OBI repository.

oemm_biserver_comb

To use the OBI Enterprise Edition import bridge we need to set the login user and password to connect to OBIEE (usually weblogic or a user with admin privileges), the repository file in xml format, and we can also filter the amount of reports retrieved from the OBI Presentation Server.

There are a couple of interesting not mandatory options, one is for optimise the import of large models which if it sets to true doesn’t return some objects like joins, relationships, logical fk, etc., to consume less memory at run time. And another option is to set if we want to do an incremental import to import only the changes of the source or each time we want to import everything.

oemm_obiee_comb

The last import bridge to use with OBI is the OBI Answers, which will be import the content for a particular analysis or KPI report. This bridge needs to have the specific analysis in XML format.

oemm_bi_analysis_comb

 

About models, there are a couple of additional things that you need to take note. First if you want to see the configuration details you need to right-click the model and choose the settings option from the pop-up menu. In case that you want to open the model to see the objects that contains, double-click on it.

Another thing is for every parameter that you have in a model, you will find a very detailed help at the right in the import setup tab; and if you click on the name of the Import Bridge in the same tab, you have the documentation of this particular bridge which I find it very useful.

There are two tabs more in the folder and model definition that we won’t use in this example but that we talk in future posts: one for security and another to executing scripts when an event happens to this object. Models also have an additional tab Import Schedule, to create a plan to do the harvest process.

Relate the models

Once we have defined our models we need to relate them and to validate their relationship. The automated process of relate these models through the validation is named stitching. In order to do that we must create a Configuration first. A configuration in OEMM is a collection of models and another objects like mappings, glossaries, etc, that are related in someway.

According to the online documentation we need to consider a configuration as any of these options:

  • Repository workspace: a collection of Repository Objects to be analyzed together (search, browse, reports, etc.) as a technical scope, or business area under the same access permission scope.
  • Enterprise architecture – a collection of data store Models (ODS, data staging areas, data warehouses, data marts, etc.) and data process Models (ETL/DI, and BI) connected together through data flow stitching.
  • Design workflow – a collection of conceptual, logical and physical Models connected (semantically stitched) together through semantic mappings modeling the design process.

To create a Configuration, just right-click on a selected folder or the repository entry and choose New> Configuration. Enter a name for the configuration and press the Create button.

oemm_new_config

The configuration is opened and you need to drag the models that you want to be stitched inside this configuration as it is shown in the following screenshot

oemm_models_config

As you drag and drop your models, you can see that some of them have a warning icon after you include them in the configuration, and that is because we need to connect that model with the appropriate source of data.

To do that, select the model in the configuration and press Edit Connection. Choose the correspondent store for each connection and press OK.

oemm_edit_conn_config

oemm_conn_edit1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

After you finish with all the models, press the Validate button, to start stitching or relate them.

oemm_validate_config

In most of the cases, OEMM can assign the correspondent default schema for each of the connections in the model. If in some cases cannot do it , like in OWB, you need to do it manually.

oemm_conn_config_owb

In the following image you will see all the models validated. For this example, I’ve created four databases models, one that contains the source (transactional system), one for the staging schema, and another two that contains different data warehouses. Also an ODI model, an OWB model and an OBIEE model.

oemm_models_validated

You can see also the relationship between the models that belong to a configuration in a graphical view in the Architecture Diagram tab of the Configuration. If the diagram looks like a little messy, you  can press the Edit button and then Layout to order the way the components are shown.

oemm_conf_arch_diag

In summary, we create and harvesting (reverse-engineer) the models and then relate or stitching them to can analyse them together. In the next post, we will see some interesting stuff that we can do with configurations and models like trace data lineage and trace data impact.

Categories: BI & Warehousing

Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration

Wed, 2015-02-25 09:51

In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?

The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:

NewImage

As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.

The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:

NewImage

For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:

CREATE EXTERNAL TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) 
([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"
[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE
LOCATION '/user/oracle/rm_logs';

If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.

create table access_logs as
select * 
from apachelog;

At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_logs;

The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.

NewImage

The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.

Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:

NewImage

At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.

NewImage

Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.

NewImage

Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.

NewImage

In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.

Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.

NewImage

I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.

NewImage

These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.

NewImage

So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.

Categories: BI & Warehousing

Introducing Oracle Big Data Discovery Part 1: “The Visual Face of Hadoop”

Mon, 2015-02-23 20:38

Oracle Big Data Discovery was released last week, the latest addition to Oracle’s big data tools suite that includes Oracle Big Data SQL, ODI and it’s Hadoop capabilities and Oracle GoldenGate for Big Data 12c. Introduced by Oracle as “the visual face of Hadoop”, Big Data Discovery combines the data discovery and visualisation elements of Oracle Endeca Information Discovery with data loading and transformation features built on Apache Spark to deliver a tool aimed at the “Discovery Lab” part of the Oracle Big Data and Information Management Reference Architecture.

NewImage

Most readers of this blog will probably be aware of Oracle Endeca Information Discovery, based on the Endeca Latitude product acquired as part of the Endeca aquisition. Oracle positioned Endeca Information Discovery (OEID) in two main ways; on the one hand as a data discovery tool for textual and unstructured data that complemented the more structured analysis capabilities of Oracle Business Intellligence, and on the other hand, as a fast click-and-refine data exploration tool similar to Qlikview and Tableau.

The problem for Oracle though was that data discovery against files and documents is a bit of a “solution looking for a problem” and doesn’t have a naturally huge market (especially considering the license cost of OEID Studio and the Endeca Server engine that stores and analyzes the data), whereas Qlikview and Tableau are significantly cheaper than OEID (at least at the start) and are more focused on BI-type tasks, making OEID a good too but not one with a mass market. To address this, whilst OEID will continue as a standalone tool the data discovery and unstructured data analysis parts of OEID are making their way into this new product called Oracle Big Data Discovery, whilst the fast click-and-refine features will surface as part of Visual Analyzer in OBIEE12c.

More importantly, Big Data Discovery will run on Hadoop making it a solution for a real problem – how to catalog, explore, refine and visualise the data in the data reservoir, where data has been landed that might be in schema-on-read databases, might need further analysis and understanding, and users need large-scale tooling to extract the nuggets of information that in time make their way into the “Execution” part of the Big Data and Information Management Reference Architecture. As some who’s admired the technology behind Endeca Information Discovery but sometimes struggled to find real-life use-cases or customers for it, I’m really pleased to see its core technology applied to a problem space that I’m encountering every day with Rittman Mead’s customers.

NewImage

In this first post, I’ll look at how Big Data Discovery is architected and how it works with Cloudera CDH5, the Hadoop distribution we use with our customers (Hortonworks HDP support is coming soon). In the next post I’ll look at how data is loaded into Big Data Discovery and then cataloged and transformed using the BDD front-end; then finally, we’ll take a look at exploring and analysing data using the visual capabilities of BDD evolved from the Studio tool within OEID. Oracle Big Data Discovery 1.0 is now GA (Generally Available) but as you’ll see in a moment you do need a fairly powerful setup to run it, at least until such time as Oracle release a compact install version running on VM.

To run Big Data Discovery you’ll need access to a Hadoop install, which in most cases will consist of 6 (minumum 3 or 4, but 6 is the minimum we use) to 18 or so Hadoop nodes running Cloudera CDH5.3. BDD generally runs on its own server nodes and itself can be clustered, but for our setup we ran 1 BDD node alongside 6 CDH5.3 Hadoop nodes looking like this:

NewImage

Oracle Big Data Discovery is made up of three component types highlighted in red in the above diagram, two of which typically run on their own dedicated BDD nodes and another which runs on each node in the Hadoop cluster (though there are various install types including all on one node, for demo purposes)

  • The Studio web user interface, which combines the faceted search and data discovery parts of Endeca Information Discovery Studio with a lightweight data transformation capability
  • The DGraph Gateway, which brings Endeca Server search/analytics capabilities to the world of Hadoop, and
  • The Data Processing component that runs on each of the Hadoop nodes, and uses Hive’s HCatalog feature to read Hive table metadata and Apache Spark to load and transform data in the cluster

The Studio component can run across several nodes for high-availability and load-balancing, which the DGraph element can run on a single node as I’ve set it up, or in a cluster with a single “leader” node and multiple “follower” nodes again for enhanced availability and throughput. The DGraph part them works alongside Apache Spark to run intensive search and analytics on subsets of the whole Hadoop dataset, with sample sets of data being moved into the DGraph engine and any resulting transformations then being applied to the whole Hadoop dataset using Apache Spark. All of this then runs as part of the wider Oracle Big Data product architecture, which uses Big Data Discovery and Oracle R for the discovery lab and Oracle Exadata, Oracle Big Data Appliance and Oracle Big Data SQL to take discovery lab innovations to the wider enterprise audience.

NewImage

So how does Oracle Big Data Discovery work in practice, and what’s a typical workflow? How does it give us the capability to make sense of structured, semi-structured and unstructured data in the Hadoop data reservoir, and how does it look from the perspective of an Oracle Endeca Information Discovery developer, or an OBIEE/ODI developer? Check back for the next parts in this three part series where I’ll first look at the data transformation and exploration capabilities of Big Data Discovery, and then look at how the Studio web interface brings data discovery and data visualisation to Hadoop.

Categories: BI & Warehousing

Data Integration Tips: ODI – Use Scenarios in packages

Tue, 2015-02-10 07:16

 

Here is another question I often hear during ODI Bootcamp or I often read in the ODI Space on OTN :
In a package, should I directly use mappings (or interfaces prior to 12c) or is it better to use the scenarios generated from these mappings?

An ODI package is a workflow used to sequence the different steps of execution of a data integration process. Some of these steps might change the value of a variable, evaluate a variable, perform an administrative task like sending an email or retrieving a file from an FTP. But the core thing is the execution of a mapping and this can be done in two different ways.

 

Direct Use

It is possible to directly drag and drop a mapping in the package.

Mapping in a Package

In this example the mapping M_FACT_SALES_10 is directly executed within the package, in the same session. This means that the value of the variables will be the same as the ones in the package.

Execution of a package holding a mapping

 

If there are several mappings, they will be executed sequentially, there is no parallelism here. (Damn, I will never get the double L at the right place in that word… Thanks autocorrect!)

The code for the mapping is generated using the current state of the mapping in the work repository, including any changes done since the package was created. But if a scenario is created for the package PCK_FACT_LOAD_MAP, it will need to be regenerated – or a new version should be generated – to take any change of M_FACT_SALES_10 into account.

One good point for this approach is that we can build a data integration process that will be entirely rolled back in case of failure. If one of the step fails, the data will be exactly as it was before the execution started. This can be done by using the same transaction for all the mappings and disable the commit option for all of them but the last one. So it’s only when everything succeeds that the changes will be committed and available for everyone. Actually it’s not even required to commit on the last step as ODI will anyway issue a commit at the end of a successful session. This technique works of course only if you don’t have any DDL statement on the target tables. You will find more information about it on this nice blog post from Bhabani Rajan : Transaction Control in ODI. Thanks to David Allan for reminding me this!

 

Using Scenarios

But you can also generate a scenario and then drag and drop it in a packageScenario in a package

Sessions and variables

In this case, the execution of the mapping will not be done within the package session. Instead, on step of this session will use the OdiStartScen command to trigger a new session to execute the scenario.

Execution of a package holding a scenario

We can see here that the second step of the package session (401) got only one task which will run the command to start a new session. The only step in the new session (402) is the mapping and it has the three same tasks as the previous example. Thanks to the fact this is a different session, you could choose to execute it using another context or agent. My colleague Michael brought an interesting use case to me. When one step of your package must extract data from a file on a file server that has its own standalone agent installed to avoid permission issues, you could specify that agent in the Scenario step of your package. So the session for that scenario will be started on that standalone agent while all the other sessions will use the execution agent of the parent session.

 

But what about variables configured to keep no history? As this is a different session, the values are lost outside of the scope of one session. We therefore need to pass it as startup parameters to the scenario of the mapping.

Passing variables as startup parameters To do this, I go to the Command tab on the scenario step and I add my variables there with the syntax

"-<PROJECT_CODE>.<VARIABLE_NAME>=<VALUE_TO_ASSIGN>"

where <VALUE_TO_ASSIGN> can be the value of variable in the current session itself (401 in our case).

 Code executed

The scenario step name was originally “Execution of the Scenario M_FACT_SALES_10 version 001″ and I slightly modified it to remove the mention of a version number. As I want to always execute the last version of the scenario, I also changed the version in the parameters and set it to -1 instead of 001. There is a small bug in the user interface so if you click somewhere else, it will change back to 001. The workaround is to press the Return key (or Enter) after typing your value.

So by using a scenario, you can choose which version of a scenario you want to execute. Either a specific version (e.g. 003) or the last one, using the value -1. It also means that if you won’t break anything if the package is executed while you are working on the mapping. The package will still use the frozen code of the scenario even if you changed the mapping.

If you create a scenario for the package PCK_FACT_LOAD_SCEN, there is no need to regenerate it when a new scenario is (re-)generated for the mapping. By using -1, it will reference the newly generated scenario.

 Asynchronous Execution

Another advantage of using scenarios is that it supports parallelism (Yes, I got it right!!).

Asynchronous scenario executions

Here I set the “Synchronous / Asynchronous” option to “Asynchronous Mode” for my two scenarios so the two sessions will start at the same time. By adding an OdiWaitForChildSessions step, I can wait for the end of all the sessions before doing anything else. This step will also define in which case you want to report an error. By default, it’s when all the child sessions are in error. But I usually change to parameter to zero, so any failure will cause my package execution to fail as well.

getPrevStepLog()

Just a short note, be careful when using the method getPrevStepLog() from the substitution API in a step after executing a scenario. That method will retrieve information about the previous step in the current Session. So about the ODI command execution OdiStartScen, and not about the execution of the scenario itself.

 

Summary

Here is a small recap table to compare both approach :

comparison table

In conclusion, development is generally more robust when using scenarios in packages. There is more control on what is executed and on the parallelism. The only good reason to use mappings directly is to keep everything within the same transaction.

Keep also in mind that Load Plans might be a better alternative than packages with better parallelism, error handling and restartability. Unless you need loops or a persistent session… A comparison between Load Plans and Packages can be found in the Oracle blog post introducing Load Plans.

 

More tips and core principles demystification coming soon so stay tuned. And follow @mRainey, @markrittman and myself – @JeromeFr – on twitter to be informed of anything happening in the ODI world.

Categories: BI & Warehousing

OBIEE Monitoring and Diagnostics with InfluxDB and Grafana

Sun, 2015-02-08 15:21

In this article I’m going to look at collecting time-series metrics into the InfluxDB database and visualising them in snazzy Grafana dashboards. The datasets I’m going to use are OS metrics (CPU, Disk, etc) and the DMS metrics from OBIEE, both of which are collected using the support for a Carbon/Graphite listener in InfluxDB.

The Dynamic Monitoring System (DMS) in OBIEE is one of the best ways of being able to peer into the internals of the product and find out quite what’s going on. Whether performing diagnostics on a specific issue or just generally monitoring to make sure things are ticking over nicely, using the DMS metrics you can level-up your OBIEE sysadmin skills beyond what you’d get with Fusion Middleware Control out of the box. In fact, the DMS metrics are what you can get access to with Cloud Control 12c (EM12c) – but for that you need EM12c and the BI Management Pack. In this article we’re going to see how to easily set up our DMS dashboard.
N.B. if you’ve read my previous articles, what I write here (use InfluxDB/Grafana) supersedes what I wrote in those (use Graphite) as my recommended approach to working with arbitrary time-series metrics.

Overview

To get the DMS data out of OBIEE we’re going to use the obi-metrics-agent tool that Rittman Mead open-sourced last year. This connects to OPMN and pulls the data out. We’ll store the data in InfluxDB, and then visualise it in Grafana. Whilst not mandatory for the DMS stats, we’ll also setup collectl so that we can show OS stats alongside the DMS ones.

InfluxDB

InfluxDB is a database, but unlike a RDBMS such as Oracle – good for generally everything – it is a what’s called a Time-Series Database (TSDB). This category of database focuses on storing data for a series, holding a given value for a point in time. Generally they’re optimised for handling large quantities of inbound metrics (think Internet of Things), rather than necessarily excelling at handling changes to the data (update/delete) – but that’s fine here since metric events in the past don’t generally change.

I’m using InfluxDB here for a few reasons:

  1. Grafana supports it as a source, with lots of active development for its specific features.
  2. It’s not Graphite. Whilst I have spent many a happy hour using Graphite I’ve spent many a frustrating day and night trying to install the damn thing – every time I want to use it on a new installation. It’s fundamentally long in the tooth, a whilst good for its time is now legacy in my mind. Graphite is also several things – a data store (whisper), a web application (graphite web), and a data collector (carbon). Since we’re using Grafana, the web front end that Graphite provides is redundant, and is where a lot of the installation problems come from.
  3. KISS! Yes I could store time series data in Oracle/mySQL/DB2/yadayada, but InfluxDB does one thing (storing time series metrics) and one thing only, very well and very easily with almost no setup.

For an eloquent discussion of Time-Series Databases read these couple of excellent articles by Baron Schwarz here and here.

Grafana

On the front-end we have Grafana which is a web application that is rapidly becoming accepted as one of the best time-series metric visualisation tools available. It is a fork of Kibana, and can work with data held in a variety of sources including Graphite and InfluxDB. To run Grafana you need to have a web server in place – I’m using Apache just because it’s familiar, but Grafana probably works with whatever your favourite is too.

OS

This article is based around the OBIEE SampleApp v406 VM, but should work without modification on any OL/CentOS/RHEL 6 environment.

InfluxDB and Grafana run on both RHEL and Debian based Linux distros, as well as Mac OS. The specific setup steps detailed here might need some changes according on the OS.

Getting Started with InfluxDB InfluxDB Installation and Configuration as a Graphite/Carbon Endpoint

InfluxDB is a doddle to install. Simply download the rpm, unzip it, and run. BOOM. Compared to Graphite, this makes it a massive winner already.

wget http://s3.amazonaws.com/influxdb/influxdb-latest-1.x86_64.rpm
sudo rpm -ivh influxdb-latest-1.x86_64.rpm

This downloads and installs InfluxDB into /opt/influxdb and configures it as a service that will start at boot time.

Before we go ahead an start it, let’s configure it to work with existing applications that are sending data to Graphite using the Carbon protocol. InfluxDB can support this and enables you to literally switch Graphite out in favour of InfluxDB with no changes required on the source.

Edit the configuration file that you’ll find at /opt/influxdb/shared/config.toml and locate the line that reads:

[input_plugins.graphite]

In v0.8.8 this is at line 41. In the following stanza set the plugin to enabled, specify the listener port, and give the name of the database that you want to store data in, so that it looks like this.

# Configure the graphite api
[input_plugins.graphite]
enabled = true
# address = "0.0.0.0" # If not set, is actually set to bind-address.
port = 2003
database = "carbon"  # store graphite data in this database
# udp_enabled = true # enable udp interface on the same port as the tcp interface

Note that the file is owned by a user created at installation time, influxdb, so you’ll need to use sudo to edit the file.

Now start up InfluxDB:

sudo service influxdb start

You should see it start up successfully:

[oracle@demo influxdb]$ sudo service influxdb start
Setting ulimit -n 65536
Starting the process influxdb [ OK ]
influxdb process was started [ OK ]

You can see the InfluxDB log file and confirm that the Graphite/Carbon listener has started:

[oracle@demo shared]$ tail -f /opt/influxdb/shared/log.txt
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/cluster.func·005:1187) Recovered local server
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:133) recovered
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/coordinator.(*Coordinator).ConnectToProtobufServers:898) Connecting to other nodes in the cluster
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:139) Starting admin interface on port 8083
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:152) Starting Graphite Listener on 0.0.0.0:2003
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:178) Collectd input plugins is disabled
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:187) UDP server is disabled
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:187) UDP server is disabled
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:216) Starting Http Api server on port 8086
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).reportStats:254) Reporting stats: &client.Series{Name:"reports", Columns:[]string{"os", "arch", "id", "version"}, Points:[][]interface {}{[]interface {}{"linux", "amd64", "e7d3d5cf69a4faf2", "0.8.8"}}}

At this point if you’re using the stock SampleApp v406 image, or if indeed any machine with a firewall configured, you need to open up ports 8083 and 8086 for InfluxDB. Edit /etc/sysconfig/iptables (using sudo) and add:

-A INPUT -m state --state NEW -m tcp -p tcp --dport 8083 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8086 -j ACCEPT

immediately after the existing ACCEPT rules. Restart iptables to pick up the change:

sudo service iptables restart

If you now go to http://localhost:8083/ (replace localhost with the hostname of the server on which you’ve installed InfluxDB), you’ll get the InfluxDB web interface. It’s fairly rudimentary, but suffices just fine:

Login as root/root, and you’ll see a list of nothing much, since we’ve not got any databases yet. You can create a database from here, but for repeatability and a general preference for using the command line here is how to create a database called carbon with the HTTP API called from curl (assuming you’re running it locally; change localhost if not):

curl -X POST 'http://localhost:8086/db?u=root&p=root' -d '{"name": "carbon"}'

Simple huh? Now hit refresh on the web UI and after logging back in again you’ll see the new database:

You can call the database anything you want, just make sure what you create in InfluxDB matches what you put in the configuration file for the graphite/carbon listener.

Now we’ll create a second database that we’ll need later on to hold the internal dashboard definitions from Grafana:

curl -X POST 'http://localhost:8086/db?u=root&p=root' -d '{"name": "grafana"}'

You should now have two InfluxDB databases, primed and ready for data:

Validating the InfluxDB Carbon Listener

To make sure that InfluxDB is accepting data on the carbon listener use the NetCat (nc) utility to send some dummy data to it:

echo "example.foo.bar 3 `date +%s`"|nc localhost 2003

Now go to the InfluxDB web interface and click Explore Data ». In the query field enter

list series

To see the first five rows of data itself use the query

select * from /.*/ limit 5

InfluxDB Queries

You’ll notice that what we’re doing here (“SELECT … FROM …”) looks pretty SQL-like. Indeed, InfluxDB support a SQL-like query language, which if you’re coming from an RDBMS background is nicely comforting ;-)

The syntax is documented, but what I would point out is the apparently odd /.*/ constructor for the “table” is in fact a regular expression (regex) to match the series for which to return values. We could have written select * from example.foo.bar but the .* wildcard enclosed in the / / regex delimiters is a quick way to check all the series we’ve got.

Going off on a bit of a tangent (but hey, why not), let’s write a quick Python script to stick some randomised data into InfluxDB. Paste the following into a terminal window to create the script and make it executable:

cat >~/test_carbon.py<<EOF
#!/usr/bin/env python
import socket
import time
import random
import sys

CARBON_SERVER = sys.argv[1]
CARBON_PORT = int(sys.argv[2])

while True:
        message = 'test.data.foo.bar %d %d\n' % (random.randint(1,20),int(time.time()))
        print 'sending message:\n%s' % message
        sock = socket.socket()
        sock.connect((CARBON_SERVER, CARBON_PORT))
        sock.sendall(message)
        time.sleep(1)
        sock.close()
EOF
chmod u+x ~/test_carbon.py

And run it: (hit Ctrl-C when you’ve had enough)

$ ~/test_carbon.py localhost 2003
sending message:
test.data.foo.bar 3 1422910401

sending message:
test.data.foo.bar 5 1422910402
[...]

Now we’ve got two series in InfluxDB:

  • example.foo.bar – that we sent using nc
  • test.data.foo.bar – using the python script

Let’s go back to the InfluxDB web UI and have a look at the new data, using the literal series name in the query:

select * from test.data.foo.bar

Well fancy that – InfluxDB has done us a nice little graph of the data. But more to the point, we can see all the values in the series.

And a regex shows us both series, matching on the ‘foo’ part of the name:

select * from /foo/ limit 3

Let’s take it a step further. InfluxDB supports aggregate functions, such as max, min, and so on:

select count(value), max(value),mean(value),min(value) from test.data.foo.bar

Whilst we’re at it, let’s bring in another way to get data out – with the HTTP API, just like we used for creating the database above. Given a query, it returns the data in json format. There’s a nice little utility called jq which we can use to pretty-print the json, so let’s install that first:

sudo yum install -y jq

and then call the InfluxDB API, piping the return into jq:

curl --silent --get 'http://localhost:8086/db/carbon/series?u=root&p=root' --data-urlencode "q=select count(value), max(value),mean(value),min(value) from test.data.foo.bar"|jq '.'

The result should look something like this:

[
  {
    "name": "test.data.foo.bar",
    "columns": [
      "time",
      "count",
      "max",
      "mean",
      "min"
    ],
    "points": [
      [
        0,
        12,
        14,
        5.666666666666665,
        1
      ]
    ]
  }
]

We could have used the Web UI for this, but to be honest the inclusion of the graphs just confuses things because there’s nothing to graph and the table of data that we want gets hidden lower down the page.

Setting up obi-metrics-agent to Send OBIEE DMS metrics to InfluxDB

obi-metrics-agent is an open-source tool from Rittman Mead that polls your OBIEE system to pull out all the lovely juicy DMS metrics from it. It can write them to file, insert them to an RDBMS, or as we’re using it here, send them to a carbon-compatible endpoint (such as Graphite, or in our case, InfluxDB).

To install it simply clone the git repository (I’m doing it to /opt but you can put it where you want)

# Install pre-requisite
sudo yum install -y libxml2-devel python-devel libxslt-devel python-pip
sudo pip install lxml
# Clone the git repository
git clone https://github.com/RittmanMead/obi-metrics-agent.git ~/obi-metrics-agent
# Move it to /opt folder
sudo mv ~/obi-metrics-agent /opt

and then run it:

cd /opt/obi-metrics-agent

./obi-metrics-agent.py \
--opmnbin /app/oracle/biee/instances/instance1/bin/opmnctl \
--output carbon \
--carbon-server localhost

I’ve used line continuation character \ here to make the statement clearer. Make sure you update opmnbin for the correct path of your OPMN binary as necessary, and localhost if your InfluxDB server is not local to where you are running obi-metrics-agent.

After running this you should be able to see the metrics in InfluxDB. For example:

select * from /Oracle_BI_DB_Connection_Pool\..+\.*Busy/ limit 5

Setting up collectl to Send OS metrics to InfluxDB

collectl is an excellent tool written by Mark Seger and reports on all sorts of OS-level metrics. It can run interactively, write metrics to file, and/or send them on to a carbon endpoint such as InfluxDB.

Installation is a piece of cake, using the EPEL yum repository:

# Install the EPEL yum repository
sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm
# Install collectl
sudo yum install -y collectl
# Set it to start at boot
sudo chkconfig --level 35 collectl on

Configuration to enable logging to InfluxDB is a simple matter of modifying the /etc/collectl.conf configuration file either by hand or using this set of sed statements to do it automagically.
The localhost in the second sed command is the hostname of the server on which InfluxDB is running:

sudo sed -i.bak -e 's/^DaemonCommands/#DaemonCommands/g' /etc/collectl.conf
sudo sed -i -e '/^#DaemonCommands/a DaemonCommands = -f \/var\/log\/collectl -P -m -scdmnCDZ --export graphite,localhost:2003,p=.os,s=cdmnCDZ' /etc/collectl.conf

If you want to log more frequently than ten seconds, make this change (for 5 second intervals here):

sudo sed -i -e '/#Interval =     10/a Interval = 5' /etc/collectl.conf

Restart collectl for the changes to take effect:

sudo service collectl restart

As above, a quick check through the web UI should confirm we’re getting data through into InfluxDB:

Note the very handy regex lets us be lazy with the series naming. We know there is a metric called in part ‘cputotal’, so using /cputotal/ can match anything with it in.

Installing and Configuring Grafana

Like InfluxDB, Grafana is also easy to install, although it does require a bit of setting up. It needs to be hooked into a web server, as well as configured to connect to a source for metrics and storing dashboard definitions.

First, download the binary (this is based on v1.9.1, but releases are frequent so check the downloads page for the latest):

cd ~
wget http://grafanarel.s3.amazonaws.com/grafana-1.9.1.zip

Unzip it and move it to /opt:

unzip grafana-1.9.1.zip
sudo mv grafana-1.9.1 /opt

Configuring Grafana to Connect to InfluxDB

We need to do a bit of configuration, so first create the configuration file based on the template given:

cd /opt/grafana-1.9.1
cp config.sample.js config.js

And now open the config.js file in your favourite text editor. Grafana supports various sources for metrics data, as well as various targets to which it can save the dashboard definitions. The configuration file helpfully comes with configuration elements for many of these, but all commented out. Uncomment the InfluxDB stanzas and amend them as follows:

datasources: {
    influxdb: {
        type: 'influxdb',
        url: "http://sampleapp:8086/db/carbon",
        username: 'root',
        password: 'root',
    },
    grafana: {
        type: 'influxdb',
        url: "http://sampleapp:8086/db/grafana",
        username: 'root',
        password: 'root',
        grafanaDB: true
    },
},

Points to note:

  1. The servername is the server host as you will be accessing it from your web browser. So whilst the configuration we did earlier was all based around ‘localhost’, since it was just communication within components on the same server, the Grafana configuration is what the web application from your web browser uses. So unless you are using a web browser on the same machine as where InfluxDB is running, you must put in the server address of your InfluxDB machine here.
  2. The default InfluxDB username/password is root/root, not admin/admin
  3. Edit the database names in the url, either as shown if you’ve followed the same names used earlier in the article or your own versions of them if not.
Setting Grafana up in Apache

Grafana runs within a web server, such as Apache or nginx. Here I’m using Apache, so first off install it:

sudo yum install -y httpd

And then set up an entry for Grafana in the configuration folder by pasting the following to the command line:

cat > /tmp/grafana.conf <<EOF
Alias /grafana /opt/grafana-1.9.1

<Location /grafana>
Order deny,allow
Allow from 127.0.0.1
Allow from ::1
Allow from all
</Location>
EOF

sudo mv /tmp/grafana.conf /etc/httpd/conf.d/grafana.conf

Now restart Apache:

sudo service httpd restart

And if the gods of bits and bytes are smiling on you, when you go to http://yourserver/grafana you should see:

Note that as with InfluxDB, you may well need to open your firewall for Apache which is on port 80 by default. Follow the same iptables instructions as above to do this.

Building Grafana Dashboards on Metrics Held in InfluxDB

So now we’ve set up our metric collectors, sending data into InfluxDB.

Let’s see now how to produce some swanky dashboards in Grafana.

Grafana has a concept of Dashboards, which are made up of Rows and within those Panels. A Panel can have on it a metric Graphs (duh), but also static text or single figure metrics.

To create a new dashboard click the folder icon and select New:

You get a fairly minimal blank dashboards. On the left you’ll notice a little green tab: hover over that and it pops out to form a menu box, from where you can choose the option to add a graph panel:

Grafana Graph Basics

On the blank graph that’s created click on the title (with the accurate text “click here”) and select edit from the options that appear. This takes you to the graph editing page, which looks equally blank but from here we can now start adding metrics:

In the box labelled series start typing Active_Sessions and notice that Grafana will autocomplete it to any available metrics matching this:

Select Oracle_BI_PS_Sessions.Active_Sessions and your graph should now display the metric.

To change the time period shown in the graph, use the time picker at the top of the screen.You can also click & drag (“brushing”) on any graph to select a particular slice of time.

So, set the time filter to 15 minutes ago and from the Auto-refresh submenu set it to refresh every 5 seconds. Now login to your OBIEE instance, and you should see the Active Sessions value increase (one per session login):

To add another to the graph you can click on Add query at the bottom right of the page, or if it’s closely related to the one you’ve defined already click on the cog next to it and select duplicate:

In the second query add Oracle_BI_General.Total_sessions (remember, you can just type part of the string and Grafana autocompletes based on the metric series stored in InfluxDB). Run a query in OBIEE to cause sessions to be created on the BI Server, and you should now see the Total sessions increase:

To save the graph, and the dashboard, click the Save icon. To return to the dashboard to see how your graph looks alongside others, or to add a new dashboards, click on Back to dashboard.

Grafana Graph Formatting

Let’s now take a look at the options we’ve got for modifying the styling of the graph. There are several tabs/sections to the graph editor – General, Metrics (the default), Axes & Grid, and Display Styles. The first obvious thing to change is the graph title, which can be changed on the General tab:

From here you can also change how the graph is sized on the dashboard using the Span and Height options. A new feature in recent versions of Grafana is the ability to link dashboards to help with analysis paths – guided navigation as we’d call it in OBIEE – and it’s from the General tab here that you can define this.

On the Metrics tab you can specify what text to use in the legend. By default you get the full series name, which is usually too big to be useful as well as containing a lot of redundant repeating text. You can either specify literal text in the alias field, or you can use segments of the series name identified by $x where x is the zero-based segment number. In the example I’ve hardcoded the literal value for the second metric query, and used a dynamic segment name for the first:

On the Axes & Grid tab you can specify the obvious stuff like min/max scales for the axes and the scale to use (bits, bytes, etc). To put metrics on the right axis (and to change the colour of the metric line too) click on the legend line, and from there select the axis/colour as required:

You can set thresholds to overlay on the graph (to highlight warning/critical values, for example), as well as customise the legend to show an aggregate value for each metric, show it in a table, or not at all:

The last tab, Display Styles, has even more goodies. One of my favourite new additions to Grafana is the Tooltip. Enabling this gives you a tooltip when you hover over the graph, displaying the value of all the series at that point in time:

You can change the presentation of the graph, which by default is a line, adding bars and/or points, as well as changing the line width and fill.

  • Solid Fill:

  • Bars only

  • Points and translucent fill:

Advanced InfluxDB Query Building in Grafana Identifying Metric Series with RegEx

In the example above there were two fairly specific metrics that we wanted to report against. What you will find is much more common is wanting to graph out a set of metrics from the same ‘family’. For example, OBIEE DMS metrics include a great deal of information about each Connection Pool that’s defined. They’re all in a hierarchy that look like this:

obi11-01.OBI.Oracle_BI_DB_Connection_Pool.Star_01_-_Sample_App_Data_ORCL_Sample_Relational_Connection

Under which you’ve got

Capacity
Current Connection Count
Current Queued Requests

and so on.

So rather than creating an individual metric query for each of these (similar to how we did for the two session metrics previously) we’ll use InfluxDB’s rather smart regex method for identifying metric series in a query. And because Grafana is awesome, writing the regex isn’t as painful as it could be because the autocomplete validates your expression in realtime. Let’s get started.

First up, let’s work out the root of the metric series that we want. In this case, it’s the orcl connection pool. So in the series box, enter /orcl/. The / delimiters indicate that it is a regex query. As soon as you enter the second / you’ll get the autocomplete showing you the matching series:

/orcl/

If you scroll down the list you’ll notice there’s other metrics in there beside Connection Pool ones, so let’s refine our query a bit

/orcl_Connection_Pool/

That’s better, but we’ve now got all the Connection Pool metrics, which whilst are fascinating to study (no, really) complicate our view of the data a bit, so let’s pick out just the ones we want. First up we’ll put in the dot that’s going to precede any of the final identifiers for the series (.Capacity, .Current Connection Count, etc). A dot is a special character in regex so we need to escape it \.

/orcl_Connection_Pool\./

And now let’s check we’re on the right lines by specifying just Capacity to match:

/orcl_Connection_Pool\.Capacity/

Excellent. So we can now add in more permutations, with a bracketed list of options separated with the pipe (regex OR) character:

/orcl_Connection_Pool\.(Capacity|Current)/

We can use a wildcard .* for expressions that are not directly after the dot that we specified in the match pattern. For example, let’s add any metric that includes Queued:

/orcl_Connection_Pool\.(Capacity|Current|.*Queued)/

But now we’ve a rather long list of matches, so let’s refine the regex to narrow it down:

/orcl_Connection_Pool\.(Capacity|Current|Peak.*Queued).+(Requests|Connection)/

(Something else I tried before this was regex negative look-behind, but it looks like Go (which InfluxDB is written in) doesn’t support it).

Setting the Alias to $4, and the legend to include values in a table format gives us this:

Now to be honest here, in this specific example, I could have created four separate metric queries in a fraction of the time it took to construct that regex. That doesn’t detract from the usefulness and power of regex though, it simply illustrates the point of using the right tool for the right job, and where there’s a few easily identified and static metrics, a manual selection may be quicker.

Aggregates

By default Grafana will request the mean of a series at the defined grain of time from InfluxDB. The grain of time is calculated automatically based on the time window you’ve got shown in your graph. If you’re collecting data every five seconds, and build a graph to show a week’s worth of data, showing all 120960 data points will end up in a very indistinct line:

So instead Grafana generates an InfluxDB query that rolls the data up to more sensible intervals – in the case of a week’s worth of data, every 10 minutes:

You can see, and override, the time grouping in the metric panel. By default it’s dynamic and you can see the current value in use in lighter text, like this:

You can also set an optional minimal time grouping in the second of the “group by time” box (beneath the first). This is a time grouping under which Grafana will never go, so if you always want to roll up to, say, at least a minute (but higher if the duration of the graph requires it), you’d set that here.

So I’ve said that InfluxDB can roll up the figures – but how does it roll up multiple values into one? By default, it takes the mean of all the values. Depending on what you’re looking at, this can be less that desirable, because you may miss important spikes and troughs in your data. So you can change the aggregate rule, to look at the maximum value, minimum, and so on. Do this by clicking on the aggregation in the metric panel:

This is the same series of data, but shown as 5 second samples rolled up to a minute, using the mean, max, and min aggregate rules:

For a look at how all three series can be better rendered together see the discussion of Series Specific Overrides later in this article.

You can also use aggregate functions with measures that may not be simple point in time values. For example, an incrementing/accumulating measure (such as a counter like “number of requests since launch”) you actually want to graph the rate of change, the delta between each point. To do this, use the derivative function. In this graph you can see the default aggregation (mean, in green) against derivative, in yellow. One is in effect the “actual” value of the measure, the other is the rate of change, which is much more useful to see in a time series.

Note that if you are using derivative you may need to fix the group by time to the grain at which you are storing data. In my example I am storing data every 5 seconds, but if the default time grain on the graph is 1s then it won’t show the derivative data.

See more details about the aggregations available in InfluxDB in the docs here. If you want to use an aggregation (or any query) that isn’t supported in the Grafana interface simply click on the cog icon and select Raw query mode from where you can customise the query to your heart’s content.

Drawing inverse graphs

As mentioned just above, you can customise the query sent to InfluxDB, which means you can do this neat trick to render multiple related series that would otherwise overlap by inverting one of them. In this example I’ve got the network I/O drawn conventionally:

But since metrics like network I/O, disk I/O and so on have a concept of adding and taking, it feels much more natural to see the input as ‘positive’ and output as ‘negative’.

Which certainly for my money is easier to see at a glance whether we’ve got data coming or going, and at what volume. To implement this simply set up your series as usual, and then for the series you want to invert click on the cog icon and select Raw query mode. Then in place of

mean(value)

put

mean(value*-1)

Series Specific Overrides

The presentation options that you specify for a graph will by default apply to all series shown in the graph. As we saw previously you can change the colour, width, fill etc of a line, or render the graph as bars and/or points instead. This is all good stuff, but presumes that all measures are created equal – that every piece of data on the graph has the same meaning and importance. Often we’ll want to change how display a particular set of data, and we can use Series Specific Overrides in Grafana to do that.

For example in this graph we can see the number of busy connections and the available capacity:

But the actual (Busy Connections) is the piece of data we want to see at a glance, against the context of the available Capacity. So by setting up a Series Specific Override we can change the formatting of each line individually – calling out the actual (thick green) and making the threshold more muted (purple):

To configure a Series Specific Override got to the Display Styles panel and click Add series override rule. Pick the specific series or use a regex to identify it, and then use the + button to add formatting options:

A very useful formatting option is Z-index, which enables you to define the layering on the graph so that a given series is rendered on top (or below) another. To bring something to the very front use a Z-index of 3; for the very back use -3. Series Specific Overrides are also a good way of dynamically assigning multiple Y axes.

Another great use of Series Specific Overrides is to show the min/max range for data as a shaded area behind the main line, thus providing more context for aggregate data. I discussed above how Grafana can get InfluxDB to roll up (aggregate) values across time periods to make graphs more readable when shown for long time frames – and how this can mask data exceptions. If you only show the mean, you miss small spikes and troughs; if you only show the max or min then you over or under count the actual impact of the measure. But, we can have the best of all worlds! The next two graphs show the starting point – showing just the mean (missing the subtleties of a data series) and showing all three versions of a measure (ugly and unusable):

Instead of this, let’s bring out the mean, but still show it in context of the range of the values within the aggregate:

I hope you’d agree that this a much cleaner and clearer way of presenting the data. To do it we need two steps:

  1. Make sure that each metric has an alias. This is used in the label but importantly is also used in the next step to identify each data series. You can skip this bit if you really want and regex the series to match directly in the next step, but setting an alias is much easier

  2. On the Display Styles tab click Add series override rule at the bottom of the page. In the alias or regex box you should see your aliases listed. Select the one which is the maximum series. Then choose the formatting option Fill below to and select the minimum series

    You’ll notice that Grafana automagically adds in a second rule to disable lines for the minimum series, as well as on the existing maximum series rule.

    Optionally, add another rule for your mean series, setting the Z-index to 3 to bring it right to the front.

    All pretty simple really, and a nice result:

Variables in Grafana (a.k.a. Templating)

In lots of metric series there is often going to be groups of measures that are associated with reoccurring instances of a parent. For example, CPU details for multiple servers, or in the OBIEE world connection pool details for multiple connection pools.

centos-base.os.cputotals.user
db12c-01.os.cputotals.user
gitserver.os.cputotals.user
media02.os.cputotals.user
monitoring-01.os.cputotals.user
etc

Instead of creating a graph for each permutation, or modifying the graph each time you want to see a different instance, you can instead use Templating, which is basically creating a variable that can be incorporated into query definitions.

To create a template you first need to enable it per dashboard, using the cog icon in the top-right of the dashboard:

Then open the Templating option from the menu opened by clicking on the cog on the left side of the screen

Now set up the name of the variable, and specify a full (not partial, as you would in the graph panel) InfluxDB query that will return all the values for the variable – or rather, the list of all series from which you’re going to take the variable name.

Let’s have a look at an example. Within the OBIEE DMS metrics you have details about the thread pools within the BI Server, and there are different thread pool types, and it is that type that I want to store. Here’s a snippet of the series:

[...]
obi11-01.OBI.Oracle_BI_Thread_Pool.DB_Gateway.Peak_Queued_Requests
obi11-01.OBI.Oracle_BI_Thread_Pool.DB_Gateway.Peak_Queued_Time_milliseconds
obi11-01.OBI.Oracle_BI_Thread_Pool.DB_Gateway.Peak_Thread_Count
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Accumulated_Requests
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Average_Execution_Time_milliseconds
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Average_Queued_Requests
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Average_Queued_Time_milliseconds
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Avg_Request_per_sec
[...]

Looking down the list, it’s the DB_Gateway and Server values that I want to extract. First up is some regex to return the series with the thread pool name in:

/.*Oracle_BI_Thread_Pool.*/

and now build it as part of an InfluxDB query:

list series /.*Oracle_BI_Thread_Pool.*/

You can validate this against InfluxDB directly using the web UI for InfluxDB or curl as described much earlier in this article. Put the query into the Grafana Template definition and hit the green play button. You’ll get a list back of all series returned by the query:

Now we want to extract out the threadpool names and we do this using the regex capture group ( ):

/.*Oracle_BI_Thread_Pool\.(.*)\./

Hit play again and the results from the first query are parsed through the regex and you should have just the values you need:

If the values are likely to change (for example, Connection Pool names will change in OBIEE depending on the RPD) then make sure you select Refresh on load. Click Add and you’re done.

You can also define variables with fixed values, which is good if they’re never going to change, or they are but you’ve not got your head around RegEx. Simply change the Type to Custom and enter comma-separated values.

To use the variable simply reference it prefix with a dollar sign, in the metric definition:

or in the title:

To change the value selected just use the dropdown from the top of the screen:

Annotations

Another very nice feature of Grafana is Annotations. These are overlays on each graph at a given point in time to provide additional context to the data. How I use it is when analysing test data to be able to see what script I ran when:

There’s two elements to Annotations – setting them up in Grafana, and getting the data into the backend (InfluxDB in this case, but they work with other data sources such as Graphite too).

Storing an Annotation

An annotation is nothing more than some time series data, but typically a string at a given point in time rather than a continually changing value (measure) over time.

To store it just just chuck the data at InfluxDB and it creates the necessary series. In this example I’m using one called events but it could be called foobar for all it matters. You can read more about putting data into InfluxDB here and choose one most suitable to the event that it is you want to record to display as an annotation. I’m running some bash-based testing, so curl fits well here, but if you were using a python program you could use the python InfluxDB client, and so on.

Sending data with curl is easy, and looks like this:

curl -X POST -d '[{"name":"events","columns":["id","action"],"points":[["big load test","start"]]}]' 'http://monitoring-server.foo.com:8086/db/carbon/series?u=root&p=root'

The main bit of interest, other than the obvious server name and credentials, is the JSON payload that we’re sending. Pulling it out and formatting it a bit more nicely:

{  
   "name":"events",
   "columns":[  
      "test-id",
      "action"
   ],
   "points":[  
      [  
         "big load test",
         "start"
      ]
   ]
}

So the series (“table”) we’re loading is called events, and we’re going to store an entry for this point in time with two columns, test-id and action storing values big load test and start respectively. Interestingly (and something that’s very powerful) is that InfluxDB’s schema can evolve in a way that no traditional RDBMS could. Never mind that we’re not had to define events before loading it, we could even load it at subsequent time points with more columns if we want to simply by sending them in the data payload.

Coming back to real-world usage, we want to make the load as dynamic as possible, so with a few variables and a bit of bash magic we have something like this that will automatically load to InfluxDB the start and end time of every load test that gets run, along with the name of the script that ran it and the host on which it ran:

INFLUXDB_HOST=monitoring-server.foo.com
INFLUXDB_PORT=8086
INFLUXDB_USER=root
INFLUXDB_PW=root
HOSTNAME=$(hostname)
SCRIPT=`basename $0`
curl -X POST -d '[{"name":"events","columns":["host","id","action"],"points":[["'"$HOSTNAME"'","'"$SCRIPT"'","start"]]}]' "http://$INFLUXDB_HOST:$INFLUXDB_PORT/db/carbon/series?u=$INFLUXDB_USER&p=$INFLUXDB_PW"

echo 'Load testing bash code goes here. For now let us just go to sleep'
sleep 60

curl -X POST -d '[{"name":"events","columns":["host","id","action"],"points":[["'"$HOSTNAME"'","'"$SCRIPT"'","end"]]}]' "http://$INFLUXDB_HOST:$INFLUXDB_PORT/db/carbon/series?u=$INFLUXDB_USER&p=$INFLUXDB_PW"

Displaying annotations in Grafana

Once we’ve got a series (“table”) in InfluxDB with our events in, pulling them through into Grafana is pretty simple. Let’s first check the data we’ve got, by going to the InfluxDB web UI (http://influxdb:8083) and from the Explore Data » page running a query against the series we’ve loaded:

select * from events

The time value is in epoch milliseconds, and the remaining values are whatever you sent to it.

Now in Grafana enable Annotations for the dashboard (via the cog in the top-right corner)

Once enabled use the cog in the top-left corner to open the menu from which you select the Annotations dialog. Click on the Add tab. Give the event group a name, and then the InfluxDB query that pulls back the relevant data. All you need to do is take the above query that you used to test out the data and append the necessary time predicate where $timeFilter so that only events for the time window currently being shown are returned:

select * from events where $timeFilter

Click Add and then set your time window to include a period when an event was recorded. You should see a nice clear vertical line and a marker on the x-axis that when you hover over it gives you some more information:

You can use the Column Mapping options in the Annotations window to bring in additional information into the tooltip. For example, in my event series I have the id of the test, action (start/end), and the hostname. I can get this overlaid onto the tooltip by mapping the columns thus:

Which then looks like this on the graph tooltips:

N.B. currently (Grafana v1.9.1) when making changes to an Annotation definition you need to refresh the graph views after clicking Update on the annotation definition, otherwise you won’t see the change reflected in the annotations on the graphs.

Sparklines

Everything I’ve written about Grafana so far has revolved around the graphs that it creates, and unsurprisingly because this is the core feature of the tool, the bread and butter. But there are other visualisation options available – “Singlestat”, and “Text”. The latter is pretty obvious and I’m not going to discuss it here, but Singlestat, a.k.a. Sparkline and/or Performance Tiles, is awesome and well worth a look. First, an illustration of what I’m blathering about:

A nice headline figure of the current number of active sessions, along with a sparkline to show the trend of the metric.

To add one of these to your dashboard go to the green row menu icon on the left (it’s mostly hidden and will pop out when you hover over it) and select Add Panel -> singlestat.

On the panel that appears go to the edit screen as shown :

In the Metrics panel specify the series as you would with a graph, but remember you need to pull back just a single series – no point writing a regex to match multiple ones. Here I’m going to show the number of queued requests on a connection pool. Note that because I want to show the latest value I change the aggregation to last:

In the General tab set a title for the panel, as well as the width of it – unlike graphs you typically want these panels to be fairly narrow since the point is to show a figure not lots of detail. You’ll notice that I’ve also defined a Drilldown / detail link so that a user can click on the summary figure and go to another dashboard to see more detail.

The Options tab gives you the option to set font size, prefixes/suffixes, and is also where you set up sparkline and conditional formatting.

Tick the Spark line box to draw a sparkline within the panel – if you’re not seen them before sparklines are great visualisations for showing the trend of a metric without fussing with axes and specific values. Tick the Background mode to use the entire height of the panel for the graph and overlay the summary figure on top.

Now for the bit I think is particularly nice – conditional formatting of the singlestat panel. It’s dead easy and not a new concept but is really great way to let a user see at a real glance if there’s something that needs their attention. In the case of this example here, queueing connections, any queueing is dodgy and more than a few is bad (m’kay). So let’s colour code it:

You can even substitute values for words – maybe the difference between 61 queued sessions and 65 is fairly irrelevant, it’s the fact that there are that magnitude of queued sessions that is more the problem:

Note that the values are absolutes, not ranges. There is an open issue for this so hopefully that will change. The effect is nice though:

Conclusion

Hopefully this article has given you a good idea of what is possible with data stored in InfluxDB and visualised in Grafana, and how to go about doing it.

If you’re interested in OBIEE monitoring you might also be interested in the ELK suite of tools that complements what I have described here well, giving an overall setup like this:

You can read more about its use with OBIEE here, or indeed get in touch with us if you’d like to learn more or have us come and help with your OBIEE monitoring and diagnostics.

Categories: BI & Warehousing

Rittman Mead BI Forum Abstract Voting Now Open – For One Week Only!

Mon, 2015-02-02 06:52

The call for papers for the Rittman Mead BI Forum 2015 closed a couple of weeks ago and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, cloud, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.

The voting forms, and event details, are below:

Voting is open for just one week, and will close at 5pm this coming Sundy (Feb 8th 2015) . Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. For info, we’ve got a couple of additional abstracts coming in from Oracle on OBIEE12c, BI Cloud and Big Data Discovery, which I’ll be considering as part of the final line-up for the two events.

Keep an eye on the blog later in February for the final speaker line-up, and details of how to register for Brighton and Atlanta.

Categories: BI & Warehousing

Data Lineage and Impact Analysis in Oracle BI Apps 11.1.1.8.1

Sun, 2015-02-01 09:41

In yesterday’s post I took a look at one of the new features in the 11.1.1.8.1 release of the BI Applications; integration between BI Apps 11g and Oracle Endeca Information Discovery. Whilst we’re on the topic then, I thought it’d be worth taking a look at another new feature introduced with BI Apps 11.1.1.8.1 – data lineage and impact analysis.

So what exactly is data lineage, and impact analysis? Data lineage is the path that data takes through your system from source to the final target reports and dashboards, and describes the lifecycle from raw data through to processed, validated and transformed information presented to your users. Impact analysis is what you do to determine what downstream data items will be affected by a change to a source table, column or data mapping, and has been a feature in many Oracle data integration tools in the past including Oracle Warehouse Builder and ODI11g, as shown in the screenshots below.

NewImage

You could also trace data lineage through the earlier 7.9.x releases of Oracle BI Applications by starting at the DAC Console, which recorded the source table and columns for a particular target warehouse table, and the DAC tasks (usually corresponding to Informatica workflows of the same name) used to load those tables. The DAC stopped at the warehouse layer though and didn’t contain any details of the dashboards and reports that used the warehouse data, and so if you wanted to trace data lineage back from a particular report or presentation layer column you had to step through the process manually in a way that I illustrated in this blog post from a few years ago. What these new data lineage and impact analysis features in BI Apps 11.1.1.8.1 do for us is bring the two sets of metadata together, along with configuration data from the BI Applications Configuration Manager application, to create an end-to-end data lineage view of the BI Apps dataset.

Data within the BI Apps 11.1.1.8.1 system can be thought of as going through seven different layers, as shown in the diagram below. Starting at the top-level dashboards and reports, these map onto OBIEE presentation tables and columns, which in-turn are selected from business model columns that then map back to the physical tables and columns in the Oracle Business Intelligence Applications data warehouse. These data warehouse tables are loaded in two stages, first using source-specific data mappings from for example Oracle E-Business Suite 12.1.3, and then using a set of source-independant mappings that take standardised staging datasets from these sources and map them into the target data warehouse tables. Our data lineage and impact analysis routines have to be aware of these seven stages and show us how data moves and is transformed between each stage.

NewImage

The way that BI Apps 11.1.1.8.1 data lineage works is to use ODI11g to extract metadata from the BI Apps Configuration Manager underlying tables, and the ODI11g repository, and combine that with RPD and catalog metadata you have to manually extract and copy into files for ODI to also upload. An ODI load plan supplied by Oracle then combines these datasets into a final set of data lineage tables also stored on the target data warehouse schema, and you can create your own data lineage and impact analysis reports or start with the ones Oracle also provide with this new feature.

NewImage

To load these data lineage tables, you run a predefined load plan from ODI Studio or ODI console after checking all connections to the various sources are set up correctly. The load plan in-turn runs a number of interfaces that load lineage information from the RPD and catalog extracts, BI Configuration Manager tables and ODI repository tables, with this load plan having to run outside of the main BI Apps managed data loads – which makes sense as you have to manually re-extract the RPD and catalog metadata anyway, and you’ll probably want to run the data lineage reload after every development release of the BI Apps system rather than every day, for example.

NewImage

Once you’ve loaded the data lineage tables, the subject area you can then select from to create lineage and impact reports covers all the stages in the data load, and also extends to OTBI (Oracle Transactional Business Intelligence, more on that in a future post) if you use that in combination with the BI Apps (or OTBI EE, as it’s called for cloud-based Fusion installations).

NewImage

You also get a set of starter dashboards and analyses for displaying the lineage for dashboard objects, presentation tables and columns, down to tables and columns in the BI Apps data warehouse, and impact for source models, columns, variables and so on.

NewImage

It’s definitely a good start, a useful resource. Going back to the days of OWB it’d be nice if this were build-in directly into ODI Studio, and the steps to identify and then export the RPD and catalog metadata are pretty manual, but it’s better than having to step through the metadata layers yourself as you had to do with the previous 7.9.x versions of the BI Apps. More details on data lineage and impact analysis in BI Apps 11.1.1.8.1 can be found in the online docs, including the configuration steps you’ll need to carry out before doing the first data lineage load.

Categories: BI & Warehousing

Endeca Information Discovery Integration with Oracle BI Apps 11.1.1.8.1

Sat, 2015-01-31 10:46

One of the new features that made its way into Oracle BI Apps 11.1.1.8.1, and that completely passed me by at the time, was integration with Oracle Endeca Information Discovery 3.1 via a new ODI11g plug-in. As this new integration brings the Endeca “faceted search” and semi-structured data analysis capabilities to the BI Apps this is actually a pretty big deal, so let’s take a quick look now at what this integration brings and how it works under-the-covers.

Oracle BI Apps 11.1.1.8.1 integration with Oracle Endeca Information Discovery 3.1 requires OEID to be installed either on the same host as BI Apps or on its own server, and uses a custom ODI11g KM called “IKM SQL to Endeca Server 3.1.0” that needs to be separately downloaded from Oracle’s Edelivery site. The KM download also comes with an ODI Technology XML import file that adds Endeca Server as a technology type in the ODI Topology Navigator, and its here that the connection to the OEID 3.1 server is set up for subsequent data loads from BI Apps 11g.

NewImage

As the IKM has “SQL” as the loading technology specified this means any source that can be extracted from using standard JDBC drivers can be used to load the Endeca Server, and the KM itself has options for creating the schema to go with the data upload, creating the Endeca Server record spec, specifying the collection name and so on. BI Apps 11.1.1.8.1 ships with a number of interfaces (mappings) and ODI load plans to populate Endeca Server domains for specific BI Apps fact groups, with each interface taking all of the relevant repository fact measures and dimension attributes and loading them into one big Endeca Server denormalized schema.

NewImage

When you take a look at the physical (flow) tab for the interface, you can see that the data is actually extracted via the BI Apps RPD and the BI Server, then staged in the BI Apps DW database before being loaded into the Endeca Server domain. Column metadata in the BI Server repository is then used to define the datatypes, names, sizes and other details of the Endeca Server domain schema, an improvement over just pointing Endeca Information Discovery Integrator at the BI Apps database schema.

NewImage

The 11.1.1.8.1 ODI repository also provides a pre-built Endeca Server load plan that allows you to turn-on and turn-off loads for particular fact groups, and sets up the data domain name and other variables needed for the Endeca Server data load.

NewImage

Once you’ve loaded the BI Apps data into one or more Endeca Server data domains, there are a number of pre-build sample Endeca Studio applications you can use to get a feel for the capabilities of the Endeca Information Discovery Studio interface. For example, the Student Information Analytics Admissions and Recruiting sample application lets you filter on any attribute from the dataset on the left-hand side, and then returns instantaneously the filtered dataset displayed in the form of a number of graphs, word clouds and other visualizations.

NewImage

Another interesting sample application is around employee expense analysis. Using features such as word clouds, its relatively easy to see which expense categories are being used the most, and you can use Endeca Server’s search and text parsing capabilities to extract keywords and other useful bits of information out of free-text areas, for example supporting text for expense claims.

NewImage

Typically, you’d use Endeca Information Discovery as a “dig around the data, let’s look for something interesting” tool, with all of your data attributes listed in one place and automatic filtering and focusing on the filtered dataset using the Endeca Server in-memory search and aggregation engine. The sample applications that ship with 11.1.1.8.1 are just meant as a starting point though (and unlike the regular BI Apps dashboards and reports, aren’t themselves supported as an official part of the product), but as you can see from the Manufacturing Endeca Studio application below, there’s plenty of scope to create returns, production and warranty-type applications that let you get into the detail of the attribute and textual information held in the BI Apps data warehouse.

NewImage

More details on the setup process are in the online docs, and if you’re new to Endeca Information Discovery take a look at our past articles on the blog on this product.

Categories: BI & Warehousing

Rittman Mead at BIWA Summit 2015

Sun, 2015-01-25 20:16

I’m writing this in my hotel room in downtown San Francisco, with my colleague Francesco Tisiot flying in tonight and US colleagues Jordan Meyer, Daniel Adams and Andy Rocha travelling down tomorrow and Monday for next week’s BIWA Summit 2015. The Business Intelligence, Warehousing and Analytics SIG is a part of IOUG and this year also hosts the 11th Annual Oracle Spatial Summit, giving us three days of database-centric content touching most areas of the Oracle BI+DW stack. 

NewImage

Apart from our own sessions (more in a moment), BIWA Summit 2015 has a great-line up of speakers from the Oracle Database and also Hadoop worlds, featuring Cloudera’s Doug Cutting and Oracle’s Paul Sondereggar and with most of the key names from the Oracle BI+DW community including Christian Screen, Tim & Dan Vlamis, Tony Heljula, Kevin McGinley and Stewart Bryson, Brendan Tierney, Eric Helmer, Kyle Hailey and Rene Kuipers. From a Rittman Mead perspective we’ve delivering a number of sessions over the days, details below:

  • Jordan Meyer, “Intro to Data Science for Oracle Professionals” – Tuesday Jan 27th, 9.00am – 9.50am
  • Mark Rittman, “Bringing Oracle Big Data SQL to OBIEE and ODI” – Wednesday Jan 28th 11.00am – 11.50am
  • Daniel Adams & Andy Rocha, “OBIEE Data Visualization: The How and the Why” – Wednesday Jan 28th, 1.00pm – 1.50pm (pt.1) and 2.00 – 2.50pm (pt.2)
  • Francesco Tisiot, “Oracle BI Cloud Service : What is It, and Where Will it Be Useful?” – Wednesday Jan 28th, 4.00pm – 4.50pm
  • Mark Rittman, “End-to-End Hadoop Development using OBIEE, ODI and Oracle Big Data” – Thursday Jan 29th, 9.30am – 10.20am

Rittman Mead are also proud to be one of the media sponsors for the BIWS Summit 2015, so look out for blogs and other activity from us, and if you’re coming to the event we’ll look forward to seeing you there.

Categories: BI & Warehousing

Rittman Mead’s Development Cluster, EM12c and the Blue Mendora VMware EM Plugin

Fri, 2015-01-23 15:56

For development and testing purposes, Rittman Mead run a VMWare VSphere cluster made up of a number of bare-metal servers hosting Linux, Windows and other VMs. Our setup has grown over the years from a bunch of VMs running on Mac Mini servers to where we are now, and was added-to considerably over the past twelve months as we started Hadoop development – a typical Cloudera CDH deployment we work with requires six or more nodes along with the associated LDAP server, Oracle OBIEE + ODI VMs and NAS storage for the data files. Last week we added our Exalytics server as a repurposed 1TB ESXi VM server giving us the topology shown in the diagram below.

NewImage

One of the purposes of setting up a development cluster like this was to mirror the types of datacenter environments our customers run, and we use VMWare VSphere and VCenter Server to manage the cluster as a whole, using technologies such as VMWare VMotion to test out alternatives to WebLogic, OBIEE and Oracle Database HA. The screenshot below shows the cluster setup in VMWare VCenter.

NewImage

We’re also big advocates of Oracle Enterprise Manager as a way of managing and monitoring a customer’s entire Oracle BI & data warehousing estate, using the BI Management Pack to manage OBIEE installations as whole, building alerts off of OBIEE Usage Tracking data, and creating composite systems and services to monitor a DW, ETL and BI system from end-to-end. We register the VMs on the VMWare cluster as hosts and services in a separate EM12cR4 install and use it to monitor our own development work, and show the various EM Management Packs to customers and prospective clients.

NewImage

Something we’ve wanted to do for a while though is bring the actual VM management into Enterprise Manager as well, and to do this we’ve also now setup the Blue Mendora VMWare Plugin for Enterprise Manager, which connects to your VMWare VCenter, ESXi, Virtual Machines and other infrastructure components and brings them into EM as monitorable and manageable components. The plugin connects to VCenter and the various ESXi hosts and gives you the ability to list out the VMs, Hosts, Clusters and so on, monitor them for resource usage and set up EM alerts as you’d do with other EM targets, and perform VCenter actions such as stopping, starting and cloning VMs.

NewImage

What’s particularly useful with such a virtualised environment though is being able to include the VM hypervisors, VM hosts and other VMWare infrastructure in the composite systems we define; for example, with a CDH Hadoop cluster that authenticates via LDAP and Kerberos, is used by OBIEE and ODI and is hosted on two VMWare ESXi hosts part of a VSphere cluster, we can get an overall picture of the system health that doesn’t stop at the host level.

NewImage

If your organization is using VMWare to host your Oracle development, test or production environments and you’re interested in how Enterprise Manager can help you monitor and manage the whole estate, including the use of Blue Mendora’s VMWare EM Plugin, drop me a line and I’d be happy to take you through what’s involved.

Categories: BI & Warehousing

Enable Your Dashboard Designers to Concentrate on User Experience Rather Than Syntax (or How to Add a Treemap in Two Lines)

Mon, 2015-01-19 08:07

JavaScript is a powerful tool that can be used to add functionality to OBIEE dashboards.  However, for many whose wheelhouses are more naturally aligned with Stephen Few rather than John Resig, adding JavaScript to a dashboard can be intimidating. To facilitate this process, steps can be taken to centralize and simplify the invocation of this code.  In this post, I will demonstrate how to create your very own library of custom HTML tags.  These tags will empower anyone to add 3rd party visualizations from libraries like D3 without a lick of JavaScript experience.

What is a “Custom Tag”?

Most standard HTML tags provide very simple behaviors.  Complex behaviors have typically  been reserved for JavaScript.  While, for the most part, this is still the case, custom tags can be used to provide a more intuitive interface to the JavaScript.  The term “custom tag” library refers to a developer defined library of HTML tags that are not natively supported by the HTML standard, but are instead included at run-time.  For example, one might implement  a <RM-MODAL> tag to produce a button that opens a modal dialog.  Behind the scenes, JavaScript will be calling the shots, but the code in your narrative view or dashboard text section will look like plain old HTML tags.

Developing a JavaScript Library

The first step when incorporating an external library onto your dashboard is to load it.  To do so, it’s often necessary to add JavaScript libraries and css files to the <head> of a document to ensure they have been loaded prior to being called.  However, in OBIEE we don’t have direct access to the <head> from the Dashboard editor.  By accessing the DOM, we can create style and script src objects on the fly and append them to the <head>.  The code below appends external scripts to the document’s <head> section.

Figure 1. dashboard.js

01 function loadExtFiles(srcname, srctype){
02    if (srctype=="js"){
03      var src=document.createElement('script')
04      src.setAttribute("type","text/JavaScript")
05      src.setAttribute("src", srcname)
06    } else if (srctype=="css"){
07       var src=document.createElement("link")
08       src.setAttribute("rel", "stylesheet")
09       src.setAttribute("type", "text/css")
10       src.setAttribute("href", srcname)
11    }
12
13    if ((typeof src!==undefined) && (src!==false)) {
14       parent.document.getElementsByTagName("head")[0].appendChild(src)
15    }
16 }
17
18 window.onload = function() {
19   loadExtFiles("/rm/js/d3.v3.min.js", "js")
20   loadExtFiles("/rm/css/visualizations.css", "css")
21   loadExtFiles("/rm/js/visualizations.js", "js")
22 }

In addition to including the D3 library, we have included a CSS file and a JavaScript file, named visualizations.css and visualizations.js respectively.  The visualizations.css file contains the default formatting for the visualizations and visualizations.js is our library of functions that collect parameters and render visualizations.

The D3 gallery provides a plethora of useful and not so useful examples to fulfill all your visualizations needs.  If you have a background in programming, these examples are simple enough to customize.  If not, this is a tall order.  Typically the process would go something like this:

  1. Determine how the data is currently being sourced.
  2. Rewrite that section of the code to accept data in a format that can be produced by OBIEE.  Often this requires a bit more effort in the refactoring as many of the examples are sourced from CSV files or JSON.  This step will typically involve writing code to create objects and add those objects to an array or some other container.  You will then have to determine how you are passing this data container to the D3 code.  Will the D3 code be rewritten as a function that takes in the array as a parameter? Will the array be scoped in a way that the D3 code can simply reference it?
  3. Identify how configurations like colors, sizing, etc. are set and determine how to customize them as per your requirements.
  4. Determine what elements need to be added to the narrative view to render the visualization.

If you are writing your own visualization from scratch, these same steps are applied in the design phase.  Either way, the JavaScript code that results from performing these steps should not be the interface exposed to a dashboard designer.  The interface should be as simple and understandable as possible to promote re-usability and avoid implementation syntax errors.  That’s where custom HTML tags come in.

Wait…  Why use tags rather than exposing the Javascript function calls?

Using custom tags allow for a more intuitive implementation than JavaScript functions.  Simple JavaScript functions do not support named arguments.  What this means is JavaScript depends on order to differentiate arguments.

<script>renderTreemap("@1", "@2", @3, null, null, "Y");</script>

In the example above, anyone viewing this call without being familiar with the function definition would have a hard time deciphering the parameters.  By using a tag library to invoke the function, the parameters are more clear.  Parameters that are not applicable for the current invocation are simply left out.

<rm-treemap name="@1" grouping="@2" measure=@3 showValues="Y"/>
 

That being said, you should still familiarize yourself with the correct usage prior to using them.

Now some of you may be saying that named arguments can be done using object literals, but the whole point of this exercise is to reduce complexity for front end designers, so I wouldn’t recommend this approach within the context of OBIEE.

What do these tags look like and how do they pass the data to the JavaScript?

For this example, we will be providing a Treemap visualization.  As could be expected, the example provided by the link is sourced by a JSON object.  For our use, we will have to rewrite that code to source the data from the attributes in our custom HTML tags.  The D3 code is expecting a hierarchical object made up of leaf node objects contained within grouping objects.  The leaf node objects consists of a “name” field and a “size” field.  The grouping object consists of a “name” field and a “children” field that contains an array of leaf node objects.  By default, the size values, or measures, are not displayed and are only used to size the nodes.  Additionally, the dimensions of the treemap are hard coded values.  Inevitably users will want to change these settings, so for each of the settings which we want to expose for configuration we will provide attribute fields on the custom tag we build. Ultimately, that is the purpose of this design pattern.

  1. Name you custom tag
  2. Identify all your inputs
  3. Create a tag attribute for each input
  4. Within a javascript library, extract and organize the the values
  5. Pass those values to D3

For this example we will configure behaviours for a tag called <rm-treemap>.  Note: It is a good practice to add a dash to your custom tags to ensure they will not match an existing HTML tag.  This tag will support the following attributes:

  • name – Name of the dimension being measured
  • measure: – Used to size the node boxes
  • grouping: – Used to determine color for node boxes
  • width: Width in pixels
  • height: Height in pixels
  • showValues: Y/N

It will be implemented within a narrative view like so:

<rm-treemap name="@1" grouping="@2" measure=@3 width="700" height="500" showValues="Y"/>
 

In order to make this tag useful, we need to bind behaviors to it that are controlled by the tag attributes.  To extract the attribute values from <rm-treemap>, the javascript code in visualizations.js will use two methods from the Element Web API, Element.getElementsByTagName and Element.getAttributes.

Fig 2. Lines 8-11 use these methods to identify the first <rm-treemap> tag and extract the values for width, height and showValues.  It was necessary specify a single element, in this case the first one, as getElementsByTagName returns an array of all matching elements within the HTML document.  There will most likely be multiple matches as the OBIEE narrative field will loop through query results and produce a <rm-treemap> tag for each row.

In Fig 2. Lines 14-41, the attributes for name, measure and grouping will be extracted and bound to either leaf node objects or grouping objects.  Additionally lines 11 and 49-50 configure the displayed values and the size of the treemap.  The original code was further modified on line 62 to use the first <rm-treemap> element to display the output.

Finally, lines 99-101 ensure that this code only executed when the <rm-treemap> is detected on the page.  The last step before deployment is documentation.  If you are going to go through all the trouble of building a library of custom tags, you need to set aside the time to document their usage.  Otherwise, regardless of how much you simplified the usage, no one will be able to use them.

Figure 2. visualizations.js

01 var renderTreemap = function () {
 02    // Outer Container (Tree)
 03    var input = {};
 04    input.name = "TreeMap";
 05    input.children = [];
 06
 07    //Collect parameters from first element
 08    var treeProps = document.getElementsByTagName("rm-treemap")[0];
 09    canvasWidth = treeProps.getAttribute("width") ? treeProps.getAttribute("width") : 960;
 10    canvasHeight = treeProps.getAttribute("height") ? treeProps.getAttribute("height") : 500;
 11    showValues = treeProps.getAttribute("showValues").toUpperCase();
 12    
 13    // Populate collection of data objects with parameters
 14    var mapping = document.getElementsByTagName("rm-treemap");
 15    for (var i = 0; i < mapping.length; i++) {
 16          var el = mapping[i];
 17          var box = {};
 18          var found = false;
 19
 20          box.name = (showValues == "Y") ? el.getAttribute("name") +
 21                                          "<br> " +
 22                               el.getAttribute("measure") : el.getAttribute("name");
 23          box.size = el.getAttribute("measure");
 24          curGroup = el.getAttribute("grouping");
 25
 26          // Add individual items to groups
 27         for (var j = 0; j < input.children.length; j++) {
 28            if (input.children[j].name === curGroup) {
 29                input.children[j].children.push(box);
 30                found = true;
 31            }
 32          }
 33
 34          if (!found) {
 35            var grouping = {};
 36            grouping.name = curGroup;
 37            grouping.children = [];
 38            grouping.children.push(box);
 39            input.children.push(grouping);
 40          }
 41    }
 42
 43    var margin = {
 44        top: 10,
 45        right: 10,
 46        bottom: 10,
 47        left: 10
 48    },
 49    width = canvasWidth - margin.left - margin.right,
 50    height = canvasHeight - margin.top - margin.bottom;
 51
 52    // Begin D3 visualization
 53     var color = d3.scale.category20c();
 54
 55     var treemap = d3.layout.treemap()
 56        .size([width, height])
 57        .sticky(true)
 58        .value(function (d) {
 59        return d.size;
 60    });
 61
 62    var div = d3.select("rm-treemap").append("div")
 63        .style("position", "relative")
 64        .style("width", (width + margin.left + margin.right) + "px")
 65        .style("height", (height + margin.top + margin.bottom) + "px")
 66        .style("left", margin.left + "px")
 67        .style("top", margin.top + "px");
 68
 69    var node = div.datum(input).selectAll(".treeMapNode")
 70        .data(treemap.nodes)
 71        .enter().append("div")
 72        .attr("class", "treeMapNode")
 73        .call(position)
 74        .style("background", function (d) {
 75        return d.children ? color(d.name) : null;
 76    })
 77        .html(function (d) {
 78        return d.children ? null : d.name;
 79    });
 80
 81    function position() {
 82        this.style("left", function (d) {
 83            return d.x + "px";
 84        })
 85            .style("top", function (d) {
 86            return d.y + "px";
 87        })
 88            .style("width", function (d) {
 89            return Math.max(0, d.dx - 1) + "px";
 90        })
 91            .style("height", function (d) {
 92            return Math.max(0, d.dy - 1) + "px";
 93        });
 94    }
 95    //End D3 visualization
 96 }
 97
 98  // Invoke visualization code only if rm-treemap tag exists
 99  var doTreemap = document.getElementsByTagName("rm-treemap");
100  if (doTreemap !== null) {
101    renderTreemap();
102  }

 

Figure 3. visualizations.css

01  .treeMapNode {
02     border: solid 1px white;
03    border-radius: 5px;
04    font: 10px sans-serif;
05    line-height: 12px;
06    overflow: hidden;
07    position: absolute;
08    text-indent: 2px;
09  }
Putting it all together

The first step to implementing this code is to make is accessible.  To do this, you will need to deploy your code to the weblogic server.  Many years ago, Venkatakrishnan Janakiraman, detailed how to deploy code to weblogic in his blog about skinning.  For this application this process still applies, however you don’t need to be concerned with the bits about modifying the instanceconfig.xml or skinning.
 

Once that the code has been deployed to the server, there are literally only two lines of code required to implement this visualization.  First the libraries need to be included.  This is done by sourcing in the dashboard.js file.  This can be done within the Narrative view’s prefix field, but I have chosen to add it to a text section on the dashboard.  This allows multiple analyses to use the libraries without duplicating the load process in multiple places.

blog-edit-dashboard

The text section should be configured as follows. (Note: The path to Dashboard.js is relative to the root path specified in your deployment.)

blog-text-section

 

From the Narrative View, add the <rm-treemap> tag to the Narrative field and populate the attributes with the appropriate data bind variables and your desired settings.

blog-narrative-view

 

This should result in the following analysis.

blog-treemap

In summary:

  1. Deploy the dashboard.js, visualization.js and visualization.css files to weblogic
  2. From a dashboard text section, source in dashboard.js, which will in turn include visualization.js and visualization.css
  3. Add the <rm-treemap> tag to the Narrative field of a Narrative view.

As you can see implementing custom HTML tags to serve as the interface for a D3 visualization will save your dashboard designers from having to sift through dozens if not hundreds of lines of confusing code.  This will reduce implementation errors, as the syntax is much simpler than JavaScript and will promote conformity, as all visualizations will be sourced from a common library.  Hopefully, this post was informative and will inspire you to consider this pattern or a similarly purposed one to make your code easier to implement.

Categories: BI & Warehousing

Why and How to use Oracle Metadata Management 12c. Part 1: Getting Started

Thu, 2015-01-15 07:34

At OOW 2014, Oracle announced the new Oracle Metadata Management solution and later in the middle of October released its first version – OMM 12.1.3.0.0

At the end of November of 2014, the second version was released -OMM 12.1.3.1.0- with new features and some bugs fixed.

But first things first, what is Oracle Metadata Management? And why we want to use it?

One of the biggest problems that we face today, is the proliferation of different systems, data sources, solutions for BI, for ETL, etc in the same company. So not only for final users but also for technical people (from SysAdmin, Data Scientist, Data Steward to Developers) is quite difficult to track which data is used by which applications. In some cases is almost impossible to perform an impact analysis if someone wants to change a table or if the way that a sales measure is calculated needs to change. With more systems involved, the problem is bigger.

Oracle Metadata Management (OMM) comes to provide a solution to this problem. It is a complete metadata management platform that can reverse engineer (harvest) and catalog metadata from any source: relational, Big data, ETL, BI, data modelling, etc.

OMM allows us to perform interactive searching, data lineage, impact analysis, semantic definition and semantic usage analysis within the catalog. And the really important thing is the metadata from different providers (Oracle or/and third-party) can be related (stitched) so you will have the complete path of data from source to report or vice versa. In addition, it manages versioning and comparison of metadata models.

The Oracle Metadata Management solution offers two products: OEMM (Oracle Enterprise Metadata Management) and OMM for OBI (Oracle Metadata Management for Oracle Business Intelligence). With the first one we can use metadata providers from Oracle and third-party technologies. Using OMM for OBI allows us to use metadata for databases, OBIEE, ODI and DAC.

We will see in this series of posts how to use each of these options, the difference between them and which will be the best option depending of your environment.

In this first post we will focus on the installation process and the requirements for it.

Minimum Requirements for a small test environment 

It is important to note and it is also well explained in the Readme document, that the following are the minimum requirements for a tutorial or a small business case, not for a larger system.

Browser

Any of these browsers or newer versions of them with at least Adobe Flash v8 plugging can be used: Microsoft Internet Explorer (IE) v10, Mozilla Firefox v30 or newer, Google Chrome v30, Apple Safari v6.

Hardware

2 GHZ or higher quad core processor

4 GB RAM (8 GB if 64bit OS using 64bits Web Application Server)

10 GB of disk space (all storage is primarily in the database server)

Operating System

Microsoft Windows 2008 Server, Windows 2012 Server, Windows 7, Windows 8, or Windows 8.1. Be sure that the you have full Administrator privilege when run the installer and that the Microsoft .NET Framework 3.5 or higher is installed.

Other operating systems require manual install/setup, so are not supported by this version.

Web Application Server

The installer comes with the Apache Tomcat  as Web Application Server and Oracle JRE 6 as Java Run Environment. Others web application servers (including Oracle WebLogic) require manual install/setup, and are not supported by this version.

Database Server

For the Database Server you can only use an Oracle Database from 10gR2 to 12 64-bit as a repository for OMM. You can create a new instance or reuse your existing Oracle database server but we need to have admin privileges in the database.

A very important observation is that the character set MUST be AL32UTF8 (UTF8). This is because the Oracle Intermedia Search can only index columns of type VARCHAR or CLOB (not the national variants NVARCHAR and NCLOB respectively). Otherwise you will receive this error message when you run the OMM for the first time:

erroromm2

To solve this, you can create a new instance of the database, or if your database has data already, there a couple of notes in My Oracle Support  260192.1 and 788156.1 to change any character set to AL32UTF8.

In addition, the CTXSYS user must be exist in the database. In case it doesn’t exist, the creation and granting privileges script can be found in <ORACLE_HOME>/ctx/admin/catctx.sql.

Preparing to install

Step 1 - Download the software. You can download the software from the  OTN site  http://www.oracle.com/technetwork/middleware/oemm/downloads/index.html or using e-delivery.oracle.com instead.

Step 2 – Create a Database Schema as Repository. Before start the installation, a database schema needs to be created as a repository for OMM to keep all its objects like models, configurations, etc (we will see all of these objects in next posts)

For that reason create a user in the database:

create user MIR identified by <password> quota unlimited on users

And give to it the following grants:

     “grant create session to MIR;

      grant create procedure to MIR;

     grant create sequence to MIR;

     grant create table to MIR;

     grant create trigger to MIR;

     grant create type to MIR;

     grant create view to MIR”

We also need to give grants to the new to user to execute a package from CTXSYS and another one from SYS.

    “grant execute on CTXSYS.CTX_DDL to MIR;

     grant execute on SYS.DBMS_LOCK TO MIR;”

If you prefer (and also could be a more accurate solution) you can create specific tablespaces (user tablespace and temp tablespace)  for that user. I asked to David Allan, who is always very generous with his time and knowledge, if this schema will be part of the RCU in future releases but there is no plan to incorporate the MIR schema to it.

Installation and Post-Install tasks

Step 3 – Install the software. We can start now to run the installation. The downloaded zip file contains an exe file, double-click on it to start the installation.

In the first screen, select the type of product that you want to install: OEMM or OMM for OBI. We choose the Oracle Enterprise Metadata Management and press Next.

install1

In the next screen, you have access to the Readme document and release notes pressing the View Readme button. After the installation you can find them in the OMM_Home/Documentation folder.

omm_install2

 

The next screen show you the destination location that you can change if you want. Keep the ports number suggested on the next screen.

oemm_comb_screens

The last screen of the installation ask you to restart the computer in order to use the product.

omm_install5

Step 4 – Start OMM Server as a service. After you restart the computer, you need to configure the OMM Server as a Service and start it. You can do this through the option that is showed in the start menu and press the Start button or going directly to the windows services screen and press the right button on the OMM service and start it.

oemm_config_service oemm_service1

Step 5 – Initialize OEMM. Run the OEMM for the first time. We have everything ready to start using Oracle Metadata Management. Go to the URL: http://localhost:11580/MM or execute the shortcut that was created on your desktop after the installation or use the Windows Start Menu.

We need to enter the connection details using the schema that we created in the database. Enter MIR as the Database User Id, its password and the database URL, and then press the Test Connection button. After you receive the Successful message, press the Save button to run the initialization process where OEMM create the objects in the database schema to manage the repository.

oemm_comb_screens2

This process takes some minutes until you get the confirmation that the initialization process is also successful.

oemm_initialization_succ

Step 6 – Start OEMM. Close the browser tab and open again the OEMM URL (http://localhost:11580/MM).  A login page appears. User and password to login is  Administrator/Administrator

oemm_loginpage

This is the main page of the OEMM where we are going to harvest (reverse-engineer) the metadata from different providers in the next posts.

oemm_main_page

In case you want to change the password of the Administrator user go to Tools > Administration on the top right of the page. Select the Administrator user and the user will be appear below.

oemm_change_pass

If you prefer to create another user with Administration privileges, just press the Add User button (plus icon) in the Administration page and enter the details for the new user:

oemm_add_new_user

We are using the Native LDAP authentication approach for this demo, but OEMM can also use an External LDAP for authentication.

About the product documentation you can access it through the Help option which is on the top right of the page. In the Contents tab you have all the topics (Harvesting, Administration, etc) separated by folder and in each of them all the details about the specific topic

Installation of OMM for OBI

There are no differences in the installation process for OEMM and OMM for OBI. Just be sure to select the one that you want in the first screen of the installation. This is the page to login to the OMM for OBI.

 

omm4bi_loginpage

In the next post, we will see how is the harvest (importing metadata) process using different metadata providers like OBIEE, ODI and others.

Categories: BI & Warehousing

Concurrent RPD Development in OBIEE

Wed, 2015-01-14 07:06

OBIEE is a well established product, having been around in various incarnations for well over a decade. The latest version, OBIEE 11g, was released 3.5 years ago, and there are mutterings of OBIEE 12c already. In all of this time however, one thing it has never quite nailed is the ability for multiple developers to work with the core metadata model – the repository, known as the RPD – concurrently and in isolation. Without this, development is doomed to be serialised – with the associated bottlenecks and inability to scale in line with the number of developers available.

My former colleague Stewart Bryson wrote a series of posts back in 2013 in which he outlines the criteria for a successful OBIEE SDLC (Software Development LifeCycle) method. The key points were :

  • There should be a source control tool (a.k.a version control system, VCS) that enables us to store all artefacts of the BI environment, including RPD, Presentation Catalog, etc etc. From here we can tag snapshots of the environment at a given point as being ready for release, and as markers for rollback if we take a wrong turn during development.
  • Developers should be able to do concurrent development in isolation.
    • To do this, source control is mandatory in order to enable branch-based development, also known as feature-driven development, which is a central tenet of an Agile method.

Oracle’s only answer to the SDLC question for OBIEE has always been MUDE. But MUDE falls short in several respects:

  • It only manages the RPD – there is no handling of the Presentation Catalog etc
  • It does not natively integrate with any source control
  • It puts the onus of conflict resolution on the developer rather than the “source master” who is better placed to decide the outcome.

Whilst it wasn’t great, it wasn’t bad, and MUDE was all we had. Either that, or manual integration into source control (1, 2) tools, which was clunky to say the least. The RPD remained a single object that could not be merged or managed except through the Administration Tool itself, so any kind of automatic merge strategies that the rest of the software world were adopting with source control tools were inapplicable to OBIEE. The merge would always require the manual launching of the Administration Tool, figuring out the merge candidates, before slowly dying in despair at having to repeat such a tortuous and error-prone process on a regular basis…

Then back in early 2012 Oracle introduced a new storage format for the RPD. Instead of storing it as a single binary file, closed to prying eyes, it was instead burst into a set of individual files in MDS XML format.

For example, one Logical Table was now one XML files on disk, made up of entities such as LogicalColumn, ExprText, LogicalKey and so on:

It even came with a set of configuration screens for integration with source control. It looked like the answer to all our SDLC prayers – now us OBIEE developers could truly join in with the big boys at their game. The reasoning went something like:

  1. An RPD stored in MDS XML is no longer binary
  2. git can merge code that is plain text from multiple branches
  3. Let’s merge MDS XML with git!

But how viable is MDS XML as a storage format for the RPD used in conjunction with a source control tool such as git? As we will see, it comes down to the Good, the Bad, and the Ugly…

The Good

As described here, concurrent and unrelated developments on an RPD in MDS XML format can be merged successfully by a source control tool such as git. Each logical object is an file, so git just munges (that’s the technical term) the files modified in each branch together to come up with a resulting MDS XML structure with the changes from each development in it.

The Bad

This is where the wheels start to come off. See, our automagic merging fairy dust is based on the idea that individually changed files can be spliced together, and that since MDS XML is not binary, we can trust a source control tool such as git to also work well with changes within the files themselves too.

Unfortunately this is a fallacy, and by using MDS XML we expose ourselves to greater complications than we would if we just stuck to a simple binary RPD merged through the OBIEE toolset. The problem is that whilst MDS XML is not binary, is not unstructured either. It is structured, and it has application logic within it (mdsid, of which see below).

Within the MDS XML structure, individual first-class objects such as Logical Tables are individual files, and structured within them in the XML are child-objects such as Logical Columns:

Source control tools such as git cannot parse it, and therefore do not understand what is a real conflict versus an unrelated change within the same object. If you stop and think for a moment (or longer) quite what would be involved in accurately parsing XML (let alone MDS XML), you’ll realise that you basically need to reverse-engineer the Administration Tool to come up with an accurate engine.

We kind of get away with merging when the file differences are within an element in the XML itself. For example, the expression for a logical column is changed in two branches, causing clashing values within ExprText and ExprTextDesc. When this happens git will throw a conflict and we can easily resolve it, because the difference is within the element(s) themselves:

Easy enough, right?

But taking a similarly “simple” merge conflict where two independent developers add or modify different columns within the same Logical Table we see what a problem there is when we try to merge it back together relying on source control alone.

Obvious to a human, and obvious to the Administration Tool is that these two new columns are unrelated and can be merged into a single Logical Table without problem. In a paraphrased version of MDS XML the two versions of the file look something like this, and the merge resolution is obvious:

But a source control tool such as git looks as the MDS XML as a plaintext file, not understanding the concept of an XML tree and sibling nodes, and throws its toys out of the pram with a big scary merge conflict:

Now the developer has to roll up his or her sleeves and try to reconcile two XML files – with no GUI to support or validate the change made except loading it back into the Administration Tool each time.

So if we want to use MDS XML as the basis for merging, we need to restrict our concurrent developments to completely independent objects. But, that kind of hampers the ideal of more rapid delivery through an Agile method if we’re imposing rules and restrictions like this.

The Ugly

This is where is gets a bit grim. Above we saw that MDS XML can cause unnecessary (and painful) merge conflicts. But what about if two developers inadvertently create the same object concurrently? The behaviour we’d expect to see is a single resulting object. But what we actually get is both versions of the object, and a dodgy RPD. Uh Oh.

Here are the two concurrently developed RPDs, produced in separate branches isolated from each other:

And here’s what happens when you leave it to git to merge the MDS XML:

The duplicated objects now cannot be edited in the Administration Tool in the resulting merged RPD – any attempt to save them throws the above error.

Why does it do this? Because the MDS XML files are named after a globally unique identifier known as the mdsid, and not their corresponding RPD qualified name. And because the mdsid is unique across developments, two concurrent creations of the same object end up with different mdsid values, and thus different filenames.

Two files from separate branches with different names are going to be seen by source control as being unrelated, and so both are brought through in the resulting merge.

As with the unnecessary merge conflict above, we could define process around same object creation, or add in a manual equalise step. The issue really here is that the duplicates can arise without us being aware because there is no conflict seen by the source control tool. It’s not like merging an un-equalised repository in the Administration Tool where we’d get #1 suffixes on the duplicate object so that at least (a) we spot the duplication and (b) the repository remains valid and the duplicate objects available to edit.

MDS XML Repository opening times

Whether a development strategy based on MDS XML is for you or not, another issue to be aware of is that for anything beyond a medium sized RPD opening times of an MDS XML repository are considerable. As in, a minute from binary RPD, and 20 minutes from MDS XML. And to be fair, after 20 minutes I gave up on the basis that no sane developer would write off that amount of their day simply waiting for the repository to open before they can even do any work on it. This rules out working with any big repositories such as that from BI Apps in MDS XML format.

So is MDS XML viable as a Repository storage format?

MDS XML does have two redeeming features :

  1. It reduces the size of your source control repository, because on each commit you will be storing just a delta of the overall repository change, rather than the whole binary RPD each time.
  2. For tracking granular development progress and changes you can identify what was modified through the source control tool alone – because the new & modified objects will be shown as changes:

But the above screenshots both give a hint of the trouble in store. The mdsid unique identifier is used not only in filenames – causing object duplication and strange RPD behaviour- but also within the MDS XML itself, referencing other files and objects. This means that as a RPD developer, or RPD source control overseer, you need to be confident that each time you perform a merge of branches you are correctly putting Humpty Dumpty back together in a valid manner.

If you want to use MDS XML with source control you need to view it as part of a larger solution, involving clear process and almost certainly a hybrid approach with the binary RPD still playing a part — and whatever you do, the Administration Tool within short reach. You need to be aware of the issues detailed above, decide on a process that will avoid them, and make sure you have dedicated resource that understands how it all fits together.

If not MDS XML, then what?…

Source control (e.g. git) is mandatory for any kind of SDLC, concurrent development included. But instead of storing the RPD in MDS XML, we store it as a binary RPD.

Wait wait wait, don’t go yet ! … it gets better

By following the git-flow method, which dictates how feature-driven development is done in source control (git), we can write a simple script that determines when merging branches what the candidates are for an OBIEE three-way RPD merge.

In this simple example we have two concurrent developments – coded “RM–1” and “RM–2”. First off, we create two branches which take the code from our “mainline”. Development is done on the two separate features in each branch independently, and committed frequently per good source control practice. The circles represent commit points:

The first feature to be completed is “RM–1”, so it is merged back into “develop”, the mainline. Because nothing has changed in develop since RM–1 was created from it, the binary RPD file and all other artefacts can simply ‘overwrite’ what is there in develop:

Now at this point we could take “develop” and start its deployment into System Test etc, but the second feature we were working on, RM–2, is also tested and ready to go. Here comes the fancy bit! Git recognises that both RM–1 and RM–2 have made changes to the binary RPD, and as a binary RPD git cannot try to merge it. But now instead of just collapsing in a heap and leaving it for the user to figure out, it makes use of git and the git-flow method we have followed to work out the merge candidates for the OBIEE Administration Tool:

Even better, it invokes the Administration Tool (which can be run from the command line, or alternatively use command line tools comparerpd/patchrpd) to automatically perform the merge. If the merge is successful, it goes ahead with the commit in git of the merge into the “develop” branch. The developer has not had to do any kind of interaction to complete the merge and commit.

If the merge is not a slam-dunk, then we can launch the Administration Tool and graphically figure out the correct resolution – but using the already-identified merge candidates in order to shorten the process.

This is not perfect, but there is no perfect solution. It is the closest thing that there is to perfection though, because it will handle merges of :

  • Unique objects
  • Same objects, different modifications (c.f. two new columns on same table example above)
  • Duplicate objects – by equalisation
Conclusion

There is no single right answer here, nor are any of the options overly appealing.

If you want to work with OBIEE in an Agile method, using feature-driven development, you will have to adopt and learn specific processes for working with OBIEE. The decision you have to make is on how you store the RPD (binary or multiple MDS XML files, or maybe both) and how you handle merging it (git vs Administration Tool).

My personal view is that taking advantage of git-flow logic, combined with the OBIEE toolset to perform three-way merges, is sufficiently practical to warrant leaving the RPD in binary format. The MDS XML format is a lovely idea but there are too few safeguards against dodgy/corrupt RPD (and too many unnecessary merge conflicts) for me to see it as a viable option.

Whatever option you go for, make sure you are using regression testing to test the RPD after you merge changes together, and ideally automate the testing too. Here at Rittman Mead we’ve written our own suite of tools that do just this – get in touch to find out more.

Categories: BI & Warehousing

Using Oracle Big Data SQL to Add Dimensions and Attributes to Hadoop Reporting

Sun, 2015-01-11 06:18

In a previous post I looked at using Oracle’s new Big Data SQL product with ODI12c, where I used Big Data SQL to expose two Hive tables as Oracle external tables, and then join them using the BETWEEN operator, something that’s not possible with regular HiveQL. In this post I’m going to look at using Oracle Big Data SQL with OBIEE11g, to enable reporting against Hive tables without the need to use Hive ODBC drivers and to bring in reference data without having to stage it in Hive tables in the Hadoop cluster.

In this example I’ve got some webserver log activity from the Rittman Mead Blog stored as a Hive table in Hadoop, which in its raw form only has a limited amount of descriptive data and wouldn’t be all that useful to users reporting against it using OBIEE. Here’s the contents of the Hive table as displayed via SQL*Developer:

NewImage

When I bring this table into OBIEE, I really want to add details of the country that each user is visiting from, and also details of the category that each post referenced in the webserver logs belongs to. Tables for these reference data items can be found in an accompanying Oracle database, like this:

NewImage

The idea then is to create an ORACLE_HIVE external table over the Hive table containing the log activity, and then import all of these tables into the OBIEE RPD as regular Oracle tables. Back in SQL*Developer, connected to the database that has the link setup to the Hadoop cluster via Big Data SQL, I create the external table using the new ORACLE_HIVE external table access driver:

NewImage

And now with the Hive table exposed as the Oracle external table BDA_OUTPUT.ACCESS_PER_POST_EXTTAB, I can import all four tables into the OBIEE repository.

NewImage

I can now create joins across the two Oracle schemas and four tables:

NewImage

and then create a business model and presentation model to define a simple star schema against the combined dataset:

NewImage

Once the RPD is saved and made available to the Presentation layer, I can now go and create some simple reports against the Hive and Oracle tables, with the Big Data SQL feature retrieving the Hive data using SmartScan technology running directly on the Hadoop cluster – bypassing MapReduce and filtering, projecting and just returning the results dataset back to the Exadata server running the Oracle SQL query.

NewImage

In the previous ODI12c and Big Data SQL posting, I used the Big Data SQL feature to enable a join between the Hive table and a table containing IP address range lookups using the BETWEEN operator, so that I could return the country name for each visitor to the website. I can do a similar thing with OBIEE, by first recreating the main incoming fact table source as a view over the ORACLE_HIVE external table and adding an IP integer calculation that I can then use for the join to the IP range lookup table (and also take the opportunity to convert the log-format date string into a proper Oracle DATE datatype):

NewImage

and then using that to join to a new table I’ve imported from the BLOG_REFDATA Oracle schema that contains the IP range lookups:

NewImage

Now I can add country as a dimension, and create reports that break down site visits by country of access.

NewImage

Similarly, I can break the date column in the view over the Hive external table out into its own logical dimension table, and then create some reports to show site access over time.

NewImage

and with the final RPD looking like this:

NewImage

If you’re interested in reading more about Oracle Big Data SQL I also covered it earlier on the blog around the launch date, with this post introducing the feature and another looking at how it extends Oracle security over your Hadoop cluster.

Categories: BI & Warehousing

Rittman Mead BI Forum 2015 Call for Papers Now Open – Closes on Jan 18th 2015

Thu, 2015-01-08 16:05

The Call for Papers for the Rittman Mead BI Forum 2015 is currently open, with abstract submissions open to January 18th 2015. As in previous years the BI Forum will run over consecutive weeks in Brighton, UK and Atlanta, GA, with the provisional dates and venues as below:

  • Brighton, UK : Hotel Seattle, Brighton, UK : May 6th – 8th 2015
  • Atlanta, GA : Renaissance Atlanta Midtown Hotel, Atlanta, USA : May 13th-15th 2015

Now on it’s seventh year, the Rittman Mead BI Forum is the only conference dedicated entirely to Oracle Business Intelligence, Oracle Business Analytics and the technologies and processes that support it – data warehousing, data analysis, data visualisation, big data and OLAP analysis. We’re looking for session around tips & techniques, project case-studies and success stories, and sessions where you’ve taken Oracle’s BI products and used them in new and innovative ways. Each year we select around eight-to-ten speakers for each event along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Charles Elliott and Jordan Meyer.

NewImage

Last year we had a big focus on cloud, and a masterclass and several sessions on bringing Hadoop and big data to the world of OBIEE. This year we’re interested in project stories and experiences around cloud and Hadoop, and we’re keen to hear about any Oracle BI Apps 11g implementations or migrations from the earlier 7.9.x releases. Getting back to basics we’re always interested in sessions around OBIEE, Essbase and data warehouse data modelling, and we’d particularly like to encourage session abstracts on data visualization, BI project methodologies and the incorporation of unstructured, semi-structured and external (public) data sources into your BI dashboards. For an idea of the types of presentations that have been selected in the past, check out the BI Forum 2014, 2013 and 2012 homepages, or feel free to get in touch via email at mark.rittman@rittmanmead.com

The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues if you can speak at both. All session this year will be 45 minutes long, all we’ll be publishing submissions and inviting potential attendees to vote on their favourite sessions towards the end of January. Other than that – have a think about abstract ideas now, and make sure you get them in by January 18th 2015 – just over a week from now!.

Categories: BI & Warehousing

Top 10 Rittman Mead Blog Posts from 2014

Wed, 2014-12-31 09:27

It’s the afternoon of New Year’s Eve over in the UK, so to round the year off here’s the top 10 blog posts from 2014 from the Rittman Mead blog, based on Google Analytics stats (page views for 2014 in brackets, only includes articles posted in 2014)

  1. Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM – Mark Rittman, March 22, 2014 (8466)
  2. OBIEE Dashboard prompt: at least one mandatory – Gianni Ceresa, March 17th 2014 (7683)
  3. Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse – Peter Scott, February 20th 2014 (6993)
  4. The Secret Life of Conditional Formatting in OBIEE – Gianni Ceresa, March 26th 2014 (5606)
  5. Trickle-Feeding Log Files to HDFS using Apache Flume – Mark Rittman, May 18th 2014 (5494)
  6. The State of the OBIEE11g World as of May 2014 – Mark Rittman, May 12th 2014 (4932)
  7. Date formatting in OBIEE 11g – setting the default Locale for users  – Robin Moffatt, February 12th 2014 (4840)
  8. Automated Regression Testing for OBIEE – Robin Moffatt, Jan 23rd 2014 (4040)
  9. OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala – Mark Rittman, Jan 18th 2014 (3439)
  10. Introduction to Oracle BI Cloud Service : Product Overview – Mark Rittman, Sep 22nd 2014 (3190)

In all, the blog in one form or another has been going for 10 years now, and our most popular post of all time over the same period is Robin Moffatt’s “Upgrading OBIEE to 11.1.1.7” – well done Robin. To everyone else, have a Happy New Year and a prosperous 2015, and see you next year when it all starts again!

Categories: BI & Warehousing

Data Integration Tips: ODI – One Data Server with several Physical Schemas

Tue, 2014-12-30 16:25

Yes, I’m hijacking the “Data Integration Tips” series of my colleague Michael Rainey (@mRainey) and I have no shame!

DISCLAIMER
This tip is intended for newcomers in the ODI world and is valid with all the versions of ODI. It’s nothing new, it has been posted by other authors on different blogs. But I see so much people struggling with that on the ODI Space on OTN that I wanted to explain it in full details, with all the context and with my own words. So next time I can just post a link to this instead of explaining it from scratch.

The Problem

I’m loading data from a schema to another schema on the same Oracle database but it’s slower than when I write a SQL insert statement manually. The bottle neck of the execution is in the steps from the LKM SQL to SQL. What should I do?

Why does it happen?

Loading Knowledge Modules (LKMs) are used to load the data from one Data Server to another. It usually connects to the source and the target Data Server to execute some steps on each of them. This is required when working with different technologies or different database instances for instance. So if we define two Data Servers to connect to our two database schemas, we will need a LKM.

In this example, I will load a star schema model in HR_DW schema, using the HR schema from the same database as a source. Let’s start with the approach using two Data Servers. Note that here we use directly the database schema to connect to our Data Servers.

Two Data Servers connecting to the same database instance, using directly the database schema to connect.

And here are the definitions of the physical schemas :

Physical Schemas

Let’s build a simple mapping using LOCATIONS, COUNTRIES and REGIONS as source to denormalize it and load it into a single flattened DIM_LOCATIONS table. We will use Left Outer joins to be sure we don’t miss any location even if there is no country or region associated. We will populate LOCATION_SK with a sequence and use an SCD2 IKM.

Mapping - Logical tab

If we check the Physical tab, we can see two different Execution Groups. This mean the Datastores are in two different Data Servers and therefore a LKM is required. Here I used LKM SQL to SQL (Built-In) which is a quite generic one, not particularly designed for Oracle databases. Performances might be better with a technology-specific KM, like LKM Oracle to Oracle Pull (DB Link). By choosing the right KM we can leverage the technology-specific concepts – here the Oracle database links – which often improve performance. But still, we shouldn’t need any database link as everything lies in the same database instance.

Mapping - Physical tab

 

Another issue is that temporary objects needed by the LKM and the IKM are created in the HR_DW schema. These objects are the C$_DIM_LOCATIONS table created by the LKM to bring the data in the Target Data Servers and the I$_DIM_LOCATIONS table created by the IKM to detect when a new row is needed or when a row needs to be updated according to the SCD2 rules. Even though these objects are deleted in the clean-up steps at the end of the mapping execution, it would be better to use another schema for these temporary objects instead of target schema that we want to keep clean.

The Solution

If the source and target Physical Schemas are located on the same Data Server – and the technology can execute code – there is no need for a LKM. So it’s a good idea to try to reuse as much as possible the same Data Server for data coming from the same place. Actually, the Oracle documentation about setting up the topology recommends to create an ODI_TEMP user/schema on any RDBMS and use it to connect.

This time, let’s create only one Data Server with two Physical schemas under it and let’s map it to the existing Logical schemas. Here I will use ODI_STAGING name instead of ODI_TEMP because I’m using the excellent ODI Getting Started virtual machine and it’s already in there.

One Data Server with two Physical Schemas under it

As you can see in the Physical Schema definitions, there is no other password provided to connect with HR or HR_DW directly. At run-time, our agent will only use one connection to ODI_STAGING and execute code through it, even if it needs to populate HR_DW tables. It means that we need to be sure that ODI_STAGING has all the required privileges to do so.

Physical schemas

Here are the privileges I had to grant to ODI_STAGING :

GRANT SELECT on HR.LOCATIONS TO ODI_STAGING;
GRANT SELECT on HR.COUNTRIES TO ODI_STAGING;
GRANT SELECT on HR.REGIONS TO ODI_STAGING;

GRANT SELECT, INSERT, UPDATE, DELETE on HR_DW.DIM_LOCATIONS to ODI_STAGING;
GRANT SELECT on HR_DW.DIM_LOCATIONS_SEQ to ODI_STAGING;

Let’s now open our mapping again and go on the physical tab. We now have only one Execution Group and there is no LKM involved. The code generated is a simple INSERT AS SELECT (IAS) statement, selecting directly from the HR schema and loading into the HR_DW schema without any database link. Data is loaded faster and our first problem is addressed.

Mapping - Physical tab without LKM

Now let’s tackle the second issue we had with temporary objects being created in HR_DW schema. If you scroll upwards to the Physical Schema definitions (or click this link, if you are lazy…) you can see that I used ODI_STAGING as Work Schema in all my Physical Schemas for that Data Server. This way, all the temporary objects are created in ODI_STAGING instead of the source or target schema. Also we are sure that we won’t have any issue with missing privileges, because our agent uses directly ODI_STAGING to connect.

So you can see it has a lot of advantages using a single Data Server when sources come from the same place. We get rid of the LKM and the schema used to connect can also be used as Work Schema so we keep the other schemas clean without any temporary objects.

The only thing you need to remember is to give the right privileges to ODI_STAGING (or ODI_TEMP) on all the objects it needs to handle. If your IKM has a step to gather statistics, you might also want to grant ANALYZE ANY. If you need to truncate a table before loading it, you have two approaches. You can grant DROP ANY table to ODI_STAGING, but this might be a dangerous privilege to give in production. A safer way is to create a stored procedure ODI_TRUNCATE in all the target database schema. This procedure takes a table name as a parameter and truncates that table using the Execute Immediate statement. Then you can grant execute on that procedure to ODI_STAGING and edit your IKM step to execute that procedure instead of using the truncate syntax.

 

That’s it for today, I hope this article can help some people to understand the reason of that Oracle recommendation and how to implement it. Stay tuned on this blog and on Twitter (@rittmanmead, @mRainey, @markrittman, @JeromeFr, …) for more tips about Data Integration!

Categories: BI & Warehousing

Connecting OBIEE11g on Windows to a Kerberos-Secured CDH5 Hadoop Cluster using Cloudera HiveServer2 ODBC Drivers

Sun, 2014-12-28 16:55

In a few previous posts and magazine articles I’ve covered connecting OBIEE11g to a Hadoop cluster, using OBIEE 11.1.1.7 and Cloudera CDH4 and CDH5 as the examples. Things get a bit complicated in that the DataDirect Apache Hive ODBC drivers that Oracle ship are only for HiveServer1 and not the HiveServer2 version that CDH4 and CDH5 use, and the Linux version of OBIEE 11.1.1.7 won’t work with the Cloudera Hive ODBC drivers that you have to use to connect to Hive on CDH4/5. You can however connect OBIEE 11.1.1.7 on Windows to HiveServer2 on CDH4 and CDH5 if you use the Cloudera Hive ODBC drivers for Windows, and although this isn’t supported by Oracle in my experience it does work, albeit with the general OBIEE11g Hive restrictions and caveats detailed in the Metadata Repository Builder’s Guide, and the fact that in-practice Hive is too slow to use for ad-hoc reporting.

However … most enterprise-type customers who run Hadoop on their internal networks have their clusters configured as “secured”, rather than the unsecured cluster examples that you see in most OBIEE connection examples. By default, Hadoop clusters are very trusting of incoming network and client connections and assume that whoever’s connecting is who they say they are, and HDFS and the other cluster components don’t perform any authentication themselves of incoming client connections. In addition, by default all network connections between Hadoop cluster components run in clear text and without any mutual authentication, which is great for a research cluster or PoC but not really appropriate for enterprise customers looking to use Hadoop to store and analyse customer data.

Instead, these customers configure their clusters to run in secured mode, using Kerberos authentication to secure incoming connections, encrypt network traffic and secure connections between the various services in the cluster. How this affects OBIEE though is that your Hive connections through to the cluster also need to use Kerberos authentication, and you (and the OBIEE BI Server) need to have a valid Kerberos ticket when connecting through the Hive ODBC driver. So how do we set this up, and how do we get hold of a secure Hadoop cluster using Kerberos authentication to test against? A few of our customers have asked this question recently, so I thought it’d be worth jotting down a few notes on how to set this up.

At a high-level, if you want to connect OBIEE 11.1.1.7 to a secure, Kerberos-authenticated CDH cluster, there’s three main steps you need to carry out:

  1. Get hold of a Kerberos-secured CDH cluster, and establish the connection details you’ll need to use to connect to it
  2. Make sure the Kerberos server has the correct entries/principals/user details for the user you’re going to securely-connect as
  3. Configure the host environment for OBIEE to work with Kerberos authentication, and then create the connection from OBIEE to the CDH cluster using the correct Kerberos credentials for your user

In my case, I’ve got a Cloudera CDH5.3.0 cluster running in the office that’s been configured to use MIT Kerebos 5 for authentication, set up using an OEL6 VM as the KDC (Key Distribution Centre) and the cluster configured using the new Kerebos setup wizard that was introduced with CDH5.1. Using this wizard automates the creation of the various Kerberos service account and host principals in the Kerberos database, and configures each of the cluster components – YARN, Hive, HDFS and so on – to authenticate with each other using Kerberos authentication and use encrypted network connections for inter-service and inter-node communication.

NewImage

Along with the secured Hadoop cluster, key bits of information and configuration data you’ll need for the OBIEE side are:

  • The krb5.conf file from the Kerberos KDC, which contains details of the Kerberos realm, URL for the KDC server, and other key connection details
  • The name of the Kerberos principal used for the Hive service name on the Hadoop cluster – typically this is “hive”; if you want to connect to Hive first using a JDBC tool such as beeline, you’ll also need the full principal name for this service, in my case “hive/bda3node2.rittmandev.com@RITTMANDEV.COM”
  • The hostname (FQDN) of the node in the CDH cluster that contains the HiveServer2 RPC interface that OBIEE connects to, to run HiveQL queries
  • The Port that HiveServer2 is running on – typically this is “10000”, and the Hive database name (for example, “default’)
  • The name of the Kerebos Realm you’ll be connecting to – for example, MYCOMPANY.COM or in my case, RITTMANDEV.COM (usually in capitals)

In my case, the krb5.conf file that is used to configure Kerebos connections to my KDC looks like this – in your company it might be a bit more complex, but this example defines a simple MIT Kerebos 5 domain:

[logging]
    default = FILE:/var/log/krb5libs.log
    kdc = FILE:/var/log/krb5kdc.log
    admin_server = FILE:/var/log/kadmind.log
[libdefaults]
    default_realm = RITTMANDEV.COM
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
[realms]
    RITTMANDEV.COM = {
    kdc = auth.rittmandev.com
    admin_server = auth.rittmandev.com
}
[domain_realm]
    .rittmandev.com = RITTMANDEV.COM
    rittmandev.com = RITTMANDEV.COM

In my setup, the CDH Hadoop cluster has been configured to use Kerberos authentication for all communications between cluster components and any connections from the outside that use those components; the cluster itself though can still be accessed via unsecured (non-Kerebos authenticated) SSH, though of course this aspect could be secured too. To test out the Hive connectivity before we get into the OBIEE details you can use the beeline CLI that ships with CDH5, and to do this you’ll need to be able to SSH into one of the cluster nodes (if you’ve not got beeline installed on your own workstation) and you’ll need an account (principal) created for you in the Kerebos database to correspond to the Linux user and HDFS/Hive user that has access to the Hive tables you’re interested in. To create such a Kerebos principal for my setup, I used the kadmin.local command on the KDC VM to create a user that matched my Linux/HDFS username and gave it a password:

kadmin.local:  addprinc mrittman
WARNING: no policy specified for mrittman@RITTMANDEV.COM; defaulting to no policy
Enter password for principal "mrittman@RITTMANDEV.COM":
Re-enter password for principal "mrittman@RITTMANDEV.COM":
Principal "mrittman@RITTMANDEV.COM" created.

SSH’ing into one of the secure CDH cluster nodes, I first have to authenticate using the kinit command which when successful, creates a Kerebos ticket that gets cached for a set amount of time, and beeline can thereafter use as part of its own authentication process:

officeimac:.ssh markrittman$ ssh mrittman@bda3node4
mrittman@bda3node4's password: 
[mrittman@bda3node4 ~]$ kinit -p mrittman
Password for mrittman@RITTMANDEV.COM: 
[mrittman@bda3node4 ~]$

Now I can use beeline, and pass the Hive service principal name in the connection details along with the usual host, port and database name. When beeline prompts for my username and password, I use the Kerberos principal name that matches the Linux/HDFS one, and enter that principal’s password:

[mrittman@bda3node4 ~]$ beeline
Beeline version 0.13.1-cdh5.3.0 by Apache Hive
beeline> !connect jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM
scan complete in 2ms
Connecting to jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM
Enter username for jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM: mrittman
Enter password for jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM: ********
Connected to: Apache Hive (version 0.13.1-cdh5.3.0)
Driver: Hive JDBC (version 0.13.1-cdh5.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://bda3node2:10000/default> show tables;
+------------------+--+
|     tab_name     |
+------------------+--+
| posts            |
| things_mrittman  |
+------------------+--+
2 rows selected (0.162 seconds)
0: jdbc:hive2://bda3node2:10000/default> select * from things_mrittman;
+---------------------------+-----------------------------+--+
| things_mrittman.thing_id  | things_mrittman.thing_name  |
+---------------------------+-----------------------------+--+
| 1                         | Car                         |
| 2                         | Dog                         |
| 3                         | Hat                         |
+---------------------------+-----------------------------+--+
3 rows selected (0.251 seconds)

So at this point I’ve covered off the first two steps; established the connection details for the secure CDH cluster, and got hold of and confirmed the Kerberos principal details that I’ll need to connect to Hive – now its time to set up the OBIEE element.

In this particular example we’re using Windows to host OBIEE 11.1.1.7, as this is the only platform that we can get the HiveServer2 ODBC drivers to work, in this case the Cloudera Hive ODBC drivers available on their website (free download but registration may be needed). Before we can get this ODBC driver to work though, we need to install the Kerberos client software on the Windows machine so that we can generate the Kerberos ticket that the ODBC driver will need to pass over as part of the authentication process.

To configure the Windows environment for Kerberos authentication, in my case I used the Kerberos for Windows 4.x client software downloadable for free from the MIT website and copied across the krb5.conf file from the KDC server, renaming it to krb5.ini and storing it the default location of c:\ProgramData\MIT\Kerberos5.

NewImage

You also need to define a system environment variable, KRB5CCNAME, to point to a directory where the Kerebos tickets can be cached, in my case I used c:\temp\krb5cache. Once this is done, reboot the Windows environment and you should then be prompted after login to authenticate yourself to the Kerebos KDC.

NewImage

The ticket then stays valid for a set number of days/hours, or you can configure OBIEE itself to authenticate and cache its own ticket – for now though, we’ll create the ticket manually and connect to the secured cluster using these cached ticket details.

After installing the Cloudera Hive ODBC drivers, I create the connection using Kerebos as the Authentication Mechanism, and enter the realm name, HiveServer2 host and the Hive Kerebos principal name, like this:

NewImage

In my case both the BI Administration tool and the OBIEE BI Server were on the same Windows VM, and therefore shared the same ODBC driver install, so I then moved over to the BI Administration tool to import the Hive table metadata details into the RPD and create the physical, logical and presentation layer RPD elements. Depending on how your CDH cluster is set up you might be able to test the connection now by using the View Data… menu item in BI Administration, but in my case I had to do two more things on the CDH cluster itself before I could get Hive queries under this Kerberos principal to run properly.

NewImage

First, as secured CDH Hadoop clusters usually configure HiveServer2 to use “user impersonation” (connecting to Hive as the user you authenticate as, not the user that HiveServer2 authenticates to the Hive service as), YARN and MapReduce jobs run under your account and not the usual “Hive” account that unsecured Hive connections use. Where this causes a problem on CDH installations on RHEL-derived platforms (RHEL, OEL, Centos etc) is that YARN normally blocks jobs running on behalf of users with a UID of <1000 (as this on other Linux distributions typically signifies a system account), RHEL starts user UIDs at 500 and YARN therefore blocks them from running jobs. To fix this, you need to go into Cloudera Manager and edit the YARN configuration settings to lower this UID threshold to something under 500, for example 250:

NewImage

I also needed to alter the group ownership of the temporary directory each node used for the YARN NodeManager’s user files so that YARN could write its temporary files correctly; on each node in the cluster I ran the following Linux commands as root to clear down any files YARN had created before, and recreate the directories with the correct permissions (Hive jobs would fail until I did this, with OBIEE just reporting an ODBC error):

rm -rf /yarn
mkdir -p /yarn/nm
chown -R yarn /yarn
chgrp -R yarn /yarn

Once this is done, queries from the BI Administration tool and from the OBIEE BI Server should connect to the Kerberos-secured CDH cluster successfully, using the Kerberos ticket you obtained using the MIT Kerberos Ticket Manager on login and then passing across the user details under which the YARN, and then Hive job should run.

NewImage

If you’re interested, you can go back to the MIT Kerberos Ticket Manager and see the other Kerberos tickets that were requested and then cached by the Cloudera Hive ODBC driver when it mutually authenticated with the HiveServer2 RPC interface – Kerebos authenticates both ways to ensure that who you’re connecting to is actually who they say they are, in this case checking the HiveServer2 connection you’re connecting to isn’t being spoofed by someone else.

NewImage

So that’s the process for connecting OBIEE to a Kerberos-secured CDH Hadoop cluster in a nutshell; in the New Year I’ll put something together on using Apache Sentry to provide role-based access control for Hive and Impala tables and as of CDH 5.3, HDFS directories, and I’ll also take a look at the new extended ACLs feature in CDH5.2 that goes beyond HDFS’s standard POSIX security model.

Categories: BI & Warehousing

Happy Christmas from Rittman Mead, and Where To Find Us in 2015

Tue, 2014-12-23 09:35

It’s the day before Christmas Eve over in the UK, and we’re finishing up for the Christmas break and looking forward to seeing friends and family over the next few days. To all of our readers, customers, friends and partners, have a great Holiday season and New Year, and we’re looking forward to seeing you at various events and on customer projects in 2015.

If you’re thinking of submitting an abstract for the Rittman Mead BI Forum 2015 the call for papers is now open. with abstracts accepted through to January 18th, 2015. In addition, as well as the BI Forum in May you can also catch-up with us at these events in the first-half of the New Year:

That’s it for now though – have a great Christmas and New Year, and see you in 2015!

Categories: BI & Warehousing