Skip navigation.

Feed aggregator

Starting a Pivotal GemFireXD server from Java

Pas Apicella - Mon, 2014-11-03 21:09
The FabricServer interface provides an easy way to start an embedded GemFire XD server process in an existing Java application.

In short code as follows will get you started. Use this in DEV/TEST scenarios not for production use.
  
package pivotal.au.gemfirexd.demos.startup;

import com.pivotal.gemfirexd.FabricServer;
import com.pivotal.gemfirexd.FabricServiceManager;

import java.sql.SQLException;
import java.util.Properties;

public class StartServer1
{
public static void main(String[] args) throws SQLException, InterruptedException {
// TODO Auto-generated method stub
FabricServer server = FabricServiceManager.getFabricServerInstance();

Properties serverProps = new Properties();
serverProps.setProperty("server-groups", "mygroup");
serverProps.setProperty("persist-dd", "false");
serverProps.setProperty("sys-disk-dir","./gfxd/server1");
serverProps.setProperty("host-data","true");

server.start(serverProps);

server.startNetworkServer("127.0.0.1", 1527, null);

Object lock = new Object();
synchronized (lock) {
while (true) {
lock.wait();
}
}

}
}

More Information

http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#developers_guide/topics/server-side/fabricserver.htmlhttp://feeds.feedburner.com/TheBlasFromPas
Categories: Fusion Middleware

RDX Services: Optimization [VIDEO]

Chris Foot - Mon, 2014-11-03 15:57

Transcript

Hi, welcome to RDX. When searching for a database administration service, it's important to look for a company that prioritizes performance, security and availability.

How does RDX deliver such a service? First, we assess all vulnerabilities and drawbacks that are preventing your environments from operating efficiently. Second, we make any applicable changes that will ensure your business software is running optimally. From there, we regularly conduct quality assurance audits to prevent any performance discrepancies from arising. 

In addition, we offer 24/7 support for every day of the year. We recognize that systems need to remain online on a continuous basis, and we're committed to making sure they remain accessible. 

Thanks for watching!

The post RDX Services: Optimization [VIDEO] appeared first on Remote DBA Experts.

Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

Rittman Mead Consulting - Mon, 2014-11-03 15:21
Introduction

I’ve recently started learning more about the tools and technologies that fall under the loose umbrella term of Big Data, following a lot of the blogs that Mark Rittman has written, including getting Apache log data into Hadoop, and bringing Twitter data into Hadoop via Mongodb.

What I wanted to do was visualise the data I’d brought in, looking for patterns and correlations. Obviously the de facto choice at our shop would be Oracle BI, which Mark previously demonstrated reporting on data in Hadoop through Hive and Impala. But, this was more at the “Data Discovery” phase that is discussed in the new Information Management and Big Data Reference Architecture that Rittman Mead helped write with Oracle. I basically wanted a quick and dirty way to start chucking around columns of data without yet being ready to impose the structure of the OBIEE metadata model on it. One of the tools I’ve worked with recently is a visualisation tool called Kibana which is part of the ELK stack (that I wrote about previously for use in building a monitoring solution for OBIEE). In this article we’ll take a look at making data available to Kibana and then the kind of analytics and visualisations you can do with it. In addition, we’ll see how loading the data into ElasticSearch has the benefit of extremely fast query times compared to through Hive alone.

The Data

I’ve got three sources of data I’m going to work with, all related to the Rittman Mead website:

  • Website logs, from Apache webserver
  • Tweets about Rittman Mead blog articles, via Datasift
  • Metadata about blog posts, extracted from the WordPress MySQL database

At the moment I’ve focussed on just getting the data in, so it’s mostly coming from static files, with the exception of the tweets which are held in a noSQL database (MongoDB).

The Tools

This is where ‘big data’ gets fun, because instead of “Acme DI” and “Acme Database” and “Acme BI”, we have the much more interesting – if somewhat silly – naming conventions of the whackier the better. Here I’m using:

  • Kibana – data visualisation tool for Elasticsearch
  • Elasticsearch – data store & analytics / search engine
  • HDFS – Hadoop’s distributed file system
  • MongoDB – NoSQL database
  • Hive – enables querying data held in various places including HDFS (and Elasticsearch, and MongoDB) with a SQL-like query language
  • Beeline – Hive command line interface
  • Datasift – online service that streams tweets matching a given pattern to a nominated datastore (such as MongoDB)
  • mongo-hadoop – a connector for MongoDB to Hadoop including Hive
  • elasticsearch-hadoop – a connector for Elasticsearch to Hadoop including Hive

Kibana only queries data held in Elasticsearch, which acts as both the data store and the analytics engine. There are various ways to get data into Elasticsearch directly from source but I’ve opted not to do that here, instead bringing it all in via HDFS and Hive. I’ve done that because my – albeit fairly limited – experience is that Elasticsearch is great once you’ve settled on your data and schema, but in the same way I’m not building a full OBIEE metadata model (RPD) yet, nor did I want to design my Elasticsearch schema up front and have to reload from source if it changed. Options for reprocessing and wrangling data once in Elasticsearch seem limited and complex, and by making all my data available through Hive first I could supplement it and mash it up as I wanted, loading it into Elasticsearch only when I had a chunk of data to explore. Another approach that I haven’t tried but could be useful if the requirement fits it would be to load the individual data elements directly into their own Elasticsearch area and then using the elasticsearch-hadoop connector run the required mashups with other data through Hive, loading the results back into Elasticsearch. It all depends on where you’re coming from with the data.

Overview

Here’s a diagram of what I’m building:

I’ll explain it in steps as follows:

  1. Loading the data and making it accessible through Hive
  2. Loading data from Hive to Elasticsearch
  3. Visualising and analysing data in Kibana
Getting the data into Hive

Strictly speaking we’re not getting the data into Hive, so much as making it available through Hive. Hive simply enables you to define and query tables sitting on top of data held in places including HDFS. The beauty of the Hadoop ecosystem is that you can physicalise data in a bunch of tools and the components will most often support interoperability with each other. It’s only when you get started playing with it that you realise how powerful this is.

The Apache log files and WordPress metadata suit themselves fairly well to a traditional RDBMS format of [de]normalised tables, so we can store them in HDFS with simple RDBMS tables defined on top through Hive. But the twitter data comes in JSON format (like this), and if we were going to store the Twitter data in a traditional RDBMS we’d have to work out how to explode the document into a normalised schema, catering for varying structures depending on the type of tweet and data payload within it. At the moment we just want to collect all the data that looks useful, and then look at different ways to analyse it afterwards. Instead of having to compromise one way (force a structure over the variable JSON) or another (not put a relational schema over obviously relational data) we can do both, and decide at run-time how to best use it. From there, we can identify important bits of data and refactor our design as necessary. This “schema on read” approach is one of the real essences of Hadoop and ‘big data’ in general.

So with that said, let’s see how we get the data in. This bit is the easy part of the article to write, because a lot of it is pretty much what Mark Rittman has already written up in his articles, so I’ll refer to those rather than duplicate here.

Apache log data

References:

I’ve used a variation on the standard Apache log SerDe that the interwebs offers, because I’m going to need to work with the timestamp quite closely (we’ll see why later) so I’ve burst it out into individual fields.

The DDL is:

CREATE EXTERNAL TABLE apachelog (
host STRING,    identity STRING,    user STRING,
time_dayDD STRING,  time_monthMMM STRING,   time_yearYYYY STRING,
time_hourHH STRING, time_minmm STRING,  time_secss STRING,  time_tzZ STRING,
http_call STRING,   url STRING, http_status STRING, status STRING,  size STRING,    referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(\\d{2})\\/(\\w{3})\\/(\\d{4}):(\\d{2}):(\\d{2}):(\\d{2}) (.*?)\\] \\\"(\\w*) ([^ ]*?)(?:\\/)? ([^ \\\"]*)\\\" (\\d*) (\\d*) \\\"(.*?)\\\" \\\"(.*?)\\\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s")
STORED AS TEXTFILE LOCATION '/user/oracle/apache_logs';

The EXTERNAL is important on the table definition as it stops Hive moving the HDFS files into its own area on HDFS. If Hive does move the files it is annoying if you want to also access them through another program (or Hive table), and downright destructive if you DROP the table since it’ll delete the HDFS files too – unless it’s EXTERNAL. Note the LOCATION must be an HDFS folder, even if it just holds one file.

For building and testing the SerDe regex Rubular is most excellent, but note that it’s Java regex you’re specifying in the SerDe which has its differences from Python or Ruby regex that Rubular (and most other online regex testers) support. For the final validation of Java regex I use the slightly ugly but still useful regexplanet, which also gives you the fully escaped version of your regex which you’ll need to use for the actual Hive DDL/DML.

A sample row from the apache log on disk looks like this:

74.208.161.70 - - [12/Oct/2014:03:47:43 +0000] "GET /2014/09/sunday-times-tech-track-100/ HTTP/1.0" 301 247 "-" "-"

and now in Hive:

0: jdbc:hive2://bigdatalite:10000> !outputformat vertical
0: jdbc:hive2://bigdatalite:10000> select * from apachelog limit 1;
host           74.208.161.70
identity       -
user           -
time_daydd     12
time_monthmmm  Oct
time_yearyyyy  2014
time_hourhh    03
time_minmm     47
time_secss     43
time_tzz       +0000
http_call      GET
url            /2014/09/sunday-times-tech-track-100/
http_status    HTTP/1.0
status         301
size           247
referer        -
agent          -

Twitter data

Reference:

The twitter data we’ve got includes the Hive ARRAY datatype for the collections of hashtag(s) and referenced url(s) from within a tweet. A point to note here is that the author_followers data appears in different locations of the JSON document depending on whether it’s a retweet or not. I ended up with two variations of this table and a UNION on top.

The table is mapped on data held in MongoDB and as with the HDFS data above the EXTERNAL is crucial to ensure you don’t trash your data when you drop your table.

CREATE EXTERNAL TABLE tweets
(
id string,
url string,
author string,
content string,
created_at string,
hashtags ARRAY<string>,
referenced_urls ARRAY<string>,
sentiment STRING,
author_handle string,
author_id string,
author_followers string,
author_friends string
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","url":"interaction.interaction.link","author":"interaction.interaction.author.name","content":"interaction.interaction.content","created_at":"interaction.interaction.created_at","hashtags":"interaction.interaction.hashtags","referenced_urls":"interaction.links.url","sentiment":"interaction.salience.content.sentiment","author_handle":"interaction.interaction.author.username","author_id":"interaction.interaction.author.id","author_followers":"interaction.twitter.user.followers_count","author_friends":"interaction.twitter.user.friends_count"}')
TBLPROPERTIES('mongo.uri'='mongodb://bigdatalite.localdomain:27017/rm_tweets.rm_tweets')
;

The other point to note is that we’re now using mongo-hadoop for Hive to connect to MongoDB. I found that I had to first build the full set of jar files by running ./gradlew jar -PclusterVersion='cdh5', and also download the MongoDB java driver, before copying the whole lot into /usr/lib/hadoop/lib. This is what I had by the end of it:

[oracle@bigdatalite mongo-hadoop-r1.3.0]$ ls -l /usr/lib/hadoop/lib/mongo-*
-rw-r--r--. 1 root root 105446 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-core-1.3.0.jar
-rw-r--r--. 1 root root  21259 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-hive-1.3.0.jar
-rw-r--r--. 1 root root 723219 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-pig-1.3.0.jar
-rw-r--r--. 1 root root    261 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-r1.3.0.jar
-rw-r--r--. 1 root root 697644 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-streaming-1.3.0.jar
-rw-r--r--. 1 root root 591189 Oct 24 00:44 /usr/lib/hadoop/lib/mongo-java-driver-2.12.4.jar

After all that, the data as it appears in Hive looks like this:

id                5441097d591f90cf2c8b45a1
url               https://twitter.com/rmoff/status/523085961681317889
author            Robin Moffatt
content           Blogged: Using #rlwrap with Apache #Hive #beeline for improved readline functionality http://t.co/IoMML2UDxp
created_at        Fri, 17 Oct 2014 12:19:46 +0000
hashtags          ["rlwrap","Hive","beeline"]
referenced_urls   ["http://www.rittmanmead.com/2014/10/using-rlwrap-with-apache-hive-beeline-for-improved-readline-functionality/"]
sentiment         4
author_handle     rmoff
author_id         82564066
author_followers  790
author_friends    375

For reference, without the mongo-hadoop connectors I was getting the error

Error in loading storage handler.com.mongodb.hadoop.hive.MongoStorageHandler

and with them installed but without the MongoDB java driver I got:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON (state=08S01,code=1)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON

WordPress metadata

WordPress holds its metadata in a MySQL database, so it’s easy to extract out:

  1. Run a query in MySQL to generate the CSV export files, such as:

    SELECT p.ID, p.POST_TITLE,p.POST_DATE_GMT,
           p.POST_TYPE,a.DISPLAY_NAME,p.POST_NAME,
           CONCAT('/', DATE_FORMAT(POST_DATE_GMT, '%Y'), '/', LPAD(
           DATE_FORMAT(POST_DATE_GMT, '%c'), 2, '0'), '/', p.POST_NAME) AS
           generated_url
    FROM   posts p
           INNER JOIN users a
                   ON p.POST_AUTHOR = a.ID
    WHERE  p.POST_TYPE IN ( 'page', 'post' )
           AND p.POST_STATUS = 'publish' 
    into outfile '/tmp/posts.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';

  2. Copy the CSV file to your Hadoop machine, and copy it onto HDFS. Make sure each type of data goes in its own HDFS folder:

    hadoop fs -mkdir posts
    hadoop fs -copyFromLocal /tmp/posts.csv posts

  3. Define the Hive table on top of it:

    CREATE EXTERNAL TABLE posts 
    ( post_id string,title string,post_date string,post_type string,author string,url string ,generated_url string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    "input.regex" = "^(\\d*),\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\"",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s")
    location '/user/oracle/posts'
    ;

Rinse & repeat for the category data, and post->category relationships.

The data once modelled in Hive looks like this:

0: jdbc:hive2://bigdatalite:10000> select * from posts limit 1;
post_id        788
title          Blog
post_date      2007-03-07 17:45:07
post_type      page
author         Mark Rittman
url            blog
generated_url  /2007/03/blog

0: jdbc:hive2://bigdatalite:10000> select * from categories limit 1;
category_id    5
cat2_id        5
category_name  category
category_code  BI (General)
catslug        bi

0: jdbc:hive2://bigdatalite:10000> select * from post_cats limit 5;
post_id      8046
category_id  1

The WordPress metadata quite obviously joins together, as it is already from the relational schema in which it was held on MySQL. Here is an example of where “schema on read” comes into play, because you could look at the above three tables (posts / post_cats / categories) and conclude it was redundant to export all three from WordPress and instead a single query listings posts and their respective category would be sufficient. But, some posts have more than one category, which then leads to a design/requirements decision. Either we retain one row per post – and collapse down the categories, but in doing so lose ability to easily treat categories as individual data – or have one row per post/category, and end up with multiple rows per post which if we’re doing a simple count of posts complicates matters. So we bring it in all raw from source, and then decide how we’re going to use it afterwards.

Bringing the data together

At this point I have six tables in Hive that I can query (albeit slowly) with HiveQL, a close relation to SQL with a few interesting differences running through the Hive client Beeline. The data is tweets, website visits, and details about the blog posts themselves.

0: jdbc:hive2://bigdatalite:10000> show tables;
+------------------------+
|        tab_name        |
+------------------------+
| apachelog              |
| categories             |
| post_cats              |
| posts                  |
| retweets               |
| tweets                 |
+------------------------+

As well as time, the other common element running throughout all the data is the blog article URL, whether it is a post, a visit to the website, or a tweet about it. But to join on it is not quite as simple as you’d hope, because all the following are examples of recorded instances of the data for the same blog post:

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So whether it’s querying the data within Hive, or loading it joined together to another platform, we need to be able to unify the values of this field.

Tangent: RegEx

And now it’s time, if you’d not already for your SerDe against the Apache file, to really immerse yourself in Regular Expressions (RegEx). Part of the “schema on read” approach is that it can get messy. You need to juggle and wrangle and munge data in ways that it really might not want to, and RegEx is an essential tool with which to do this. Regex isn’t specific to Hadoop – it’s used throughout the computing world.

My journey with regex over quite a few years in computing has gone in stages something like this:

  1. To be a fully rounded geek, I should learn regex. Looks up regex. Hmm, looks complicated….Squirrel!
    1. To be a fully round (geddit?!) geek, I should keep eating these big breakfasts
  2. I’ve got a problem, I’ve got a feeling regex will help me. But my word it looks complicated … I’ll just do it by hand.
  3. I’ve got another problem, I need to find this text in a file but with certain patterns around it. Here’s a regex I found on google. Neat!
  4. Hmmm another text matching problem, maybe I should really learn regex instead of googling it to death each time
  5. Mastered the basic concepts of regex
  6. Still a long way to go…

If you think you’ll nail RegEx overnight, you won’t (or at least, you’re a better geek than me). It’s one of those techniques, maybe a bit like SQL, that to fully grok takes a period of exposure and gradually increasing usage, before you have an “ah hah!” moment. There’s a great site explaining regex here: www.regular-expressions.info. My best advice is to take a real example text that you want to work with (match on, replace bits of, etc), and stick it in one of these parsers and experiment with the code:

Oh and finally, watch out for variations in regex – what works in a Java-based program (most of the Hadoop world) may not in Python and visa versa. Same goes for PHP, Ruby, and so on – they all have different regex engines that may or may not behave as you’d expect.

Back on track : joining data on non-matching columns

So to recap, we want to be able to analyse our blog data across tweets, site hits and postings, using the common field of the post URL, which from the various sources can look like any of the following (and more):

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So out comes the RegEx. First off, we’ll do the easy one – strip the http:// and server bit. Using the Hive function REGEXP_REPLACE we can use this in the query:

regexp_replace(ref_url,'http:\\/\\/www.rittmanmead.com','')

This means, take the ref_url column and if you find http://www.rittmanmead.com then replace it with nothing, i.e. delete it. The two backslashes before each forward slash simply escape them since a forward slash on its own has a special meaning in regex. Just to keep you on your toes – Java regex requires double backspace escaping, but all other regex (including the online parser I link to below) uses a single one.

So now our list possible join candidates has shrunk by one to look like this:

/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

The variation as you can see is whether there is a trailing forward slash (/) after the post ‘slug’ , and whether there is additional cruft after that too (feed, foobar+foorbar, etc). So let’s build it up a piece at a time. On each one, I’ve linked to an online parser that you can use to see it in action.

  1. We’ll match on the year and month (/2014/01/) because they’re fixed pattern, so using \d to match on digits and {x} to match x repetitions: (see example on Rubular.com)

    \/\d{4}\/\d{2}\/

    This will match /2014/01/.

  2. Now we need to match the slug, but we’re going to ditch the forward slash suffix if there is one. This is done with two steps.

    First, we define a “match anything except x” group, which is what the square brackets (group) and the caret ^ (negate) do, and in this case x is the forward slash character, escaped.

    Secondly, the plus symbol + tells regex to match at least one repetitions of the preceeding group – i.e. any character that is not a forward slash. (example)

    [^\/]+

    Combined with the above regex from the first step we will now match /2014/01/automated-regression-testing-for-obiee.

  3. The final step is to turn the previous REGEXP_REPLACE on its head and instead of replacing out content from the string that we don’t want, instead we’ll extract the content that we do want, using a regex capture group which is defined by regular brackets (parantheses, just like these). We’ve now brought in a couple of extra bits to make it hang together, seen in the completed regex here:

    \S*(\/\d{4}\/\d{2}\/[^\/]+).*$

    1. The \S* at the beginning means match any non-whitespace character, which will replace the previous regex replace we were doing to strip out the http://www.rittmanmead.com
    2. After the capture group, which is the content from steps one and two above, surround by parentheses (\/\d{4}\/\d{2}\/[^\/]+) there is a final .* to match anything else that might be present (eg trailing forward slash, foobar, etc etc)

    Now all we need to do is escape it for Java regex, and stick it in the Hive REGEXP_EXTRACT function, specifying 1 as the capture group number to extract: (example)

    regexp_extract(url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)

So now all our URLs will look like this, regardless of whether they’re from tweet data, website hits, or wordpress:

/2014/01/automated-regression-testing-for-obiee

Which is nice, because it means we can use it as the common join in our queries. For example, to look up the title of the blog post that someone has tweeted about, and who wrote the post:

SELECT
x.author AS tweet_author, 
x.tweet ,
x.tweet_url, 
x.created_at, 
p.author as post_author, 
p.title as post_title
FROM            ( 
SELECT 'tweets' , 
t.url AS tweet_url , 
t.author , 
t.content AS tweet , 
t.created_at ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
FROM   tweets t 
LATERAL VIEW EXPLODE (referenced_urls) refs as ref_url 
WHERE  t.author_followers IS NOT NULL 
AND    ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*' ) x 
INNER JOIN posts p 
ON regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url ;

[...]
tweet_author  Dain Hansen
tweet         Like a Big Data kid in a Hadoop candy store: Presos on #bigdata for BI, DW, Data Integration http://t.co/06DLnvxINx via @markrittman
tweet_url     https://twitter.com/dainsworld/status/520463199447961600
created_at    Fri, 10 Oct 2014 06:37:51 +0000
post_author   Mark Rittman
post_title    Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

tweet_author  Robin Moffatt
tweet         Analyzing Twitter Data using Datasift, MongoDB and Pig http://t.co/h67cd4kJo2 via @rittmanmead
tweet_url     https://twitter.com/rmoff/status/524197131276406785
created_at    Mon, 20 Oct 2014 13:55:09 +0000
post_author   Mark Rittman
post_title    Analyzing Twitter Data using Datasift, MongoDB and Pig
[...]

Note here also the use of LATERAL VIEW EXPLODE () as a way of denormalising out the Hive ARRAY of referenced url(s) in the tweet so there is one row returned per value.

Summary

We’ve got our three sources of data available to us in Hive, and can query across them. Next we’ll take a look at loading the data into Elasticsearch, taking advantage of our conformed url column to join data that we load. Stay tuned!

Categories: BI & Warehousing

Filtering PeopleTools SQL from Performance Monitor Traces

David Kurtz - Mon, 2014-11-03 15:01

I have been doing some on-line performance tuning on a PeopleSoft Financials system using PeopleSoft Performance Monitor (PPM).  End-users have collect verbose PPM traces. Usually, when I use PPM in a production system, all the components are fully cached by the normal activity of the user (except when the application server caches have recently been cleared).  However, when working in a user test environment it is common to find that the components are not fully cached. This presents two problems.
  • The application servers spend quite a lot of time executing queries on the PeopleTools tables to load the components, pages and PeopleCode into their caches. We can see in the screenshot of the component trace that there is a warning message that component objects are not fully cached, and that these  cache misses skew timings.
  • In verbose mode, the PPM traces collect a lot of additional transactions capturing executions and fetches against PeopleTools tables. The PPM analytic components cannot always manage the resultant volume of transactions.
    Figure 1. Component trace as collected by PPMFigure 1. Component trace as collected by PPMIf I go further down the same page and look in the SQL Summary, I can see SQL operations against PeopleTools tables (they are easily identifiable in that they generally do not have an underscore in the third character). Not only are 5 of the top 8 SQL operations related to PeopleTools tables, we can also see that they also account for over 13000 executions, which means there are at least 13000 rows of additional data to be read from PSPMTRANSHIST.
    Figure 2. SQL Summary of PPM trace with PeopleTools SQLFigure 2. SQL Summary of PPM trace with PeopleTools SQLWhen I open the longest running server round trip (this is also referred to as a Performance Monitoring Unit or PMU), I can only load 1001 rows before I get a message warning that the maximum row limit has been reached. The duration summary and the number of executions and fetches cannot be calculated and hence 0 is displayed.
     Details of longest PMU with PeopleTools SQLFigure 3: Details of longest PMU with PeopleTools SQL
    Another consequence of the PeopleTools data is that it can take a long time to open the PMU tree. There is no screenshot of the PMU tree here because in this case I had so much data that I couldn't open it before the transaction timed out!
    Solution My solution to this problem is to delete the transactions that relate to PeopleTools SQL and correct the durations, and the number of executions and fetches held in summary transactions. The rationale is that these transactions would not normally occur in significant quantities in a real production system, and there is not much I can do about them when they do.
    The first step is to clone the trace. I could work on the trace directly, but I want to preserve the original data.
    PPM transactions are held in the table PSPMTRANSHIST. They have a unique identifier PM_INSTANCE_ID. A single server round trip, also called a Performance Monitoring Unit (PMU), will consist of many transactions. They can be shown as a tree and each transaction has another field PM_PARENT_INST_ID which holds the instance of the parent. This links the data together and we can use hierarchical queries in Oracle SQL to walk the tree. Another field PM_TOP_INST_ID identifies the root transaction in the tree.
    Cloning a PPM trace is simply a matter of inserting data into PSPMTRANSHIST. However, when I clone a PPM trace I have to make sure that the instance numbers are distinct but still link correctly. In my system I can take a very simple approach. All the instance numbers actually collected by PPM are greater than 1016. So, I will simply use the modulus function to consistently alter the instances to be different. This approach may break down in future, but it will do for now.
    On an Oracle database, PL/SQL is a simple and effective way to write simple procedural processes.  I have written two anonymous blocks of code.
    Note that the cloned trace will be purged from PPM like any other data by the delivered PPM archive process.

    REM xPT.sql
    BEGIN --duplicate PPM traces
    FOR i IN (
    SELECT h.*
    FROM pspmtranshist h
    WHERE pm_perf_trace != ' ' /*rows must have a trace name*/
    -- AND pm_perf_trace = '9b. XXXXXXXXXX' /*I could specify a specific trace by name*/
    AND pm_instance_id > 1E16 /*only look at instance > 1e16 so I do not clone cloned traces*/
    ) LOOP
    INSERT INTO pspmtranshist
    (PM_INSTANCE_ID, PM_TRANS_DEFN_SET, PM_TRANS_DEFN_ID, PM_AGENTID, PM_TRANS_STATUS,
    OPRID, PM_PERF_TRACE, PM_CONTEXT_VALUE1, PM_CONTEXT_VALUE2, PM_CONTEXT_VALUE3,
    PM_CONTEXTID_1, PM_CONTEXTID_2, PM_CONTEXTID_3, PM_PROCESS_ID, PM_AGENT_STRT_DTTM,
    PM_MON_STRT_DTTM, PM_TRANS_DURATION, PM_PARENT_INST_ID, PM_TOP_INST_ID, PM_METRIC_VALUE1,
    PM_METRIC_VALUE2, PM_METRIC_VALUE3, PM_METRIC_VALUE4, PM_METRIC_VALUE5, PM_METRIC_VALUE6,
    PM_METRIC_VALUE7, PM_ADDTNL_DESCR)
    VALUES
    (MOD(i.PM_INSTANCE_ID,1E16) /*apply modulus to instance number*/
    ,i.PM_TRANS_DEFN_SET, i.PM_TRANS_DEFN_ID, i.PM_AGENTID, i.PM_TRANS_STATUS,
    i.OPRID,
    SUBSTR('xPT'||i.PM_PERF_TRACE,1,30) /*adjust trace name*/,
    i.PM_CONTEXT_VALUE1, i.PM_CONTEXT_VALUE2, i.PM_CONTEXT_VALUE3,
    i.PM_CONTEXTID_1, i.PM_CONTEXTID_2, i.PM_CONTEXTID_3, i.PM_PROCESS_ID, i.PM_AGENT_STRT_DTTM,
    i.PM_MON_STRT_DTTM, i.PM_TRANS_DURATION,
    MOD(i.PM_PARENT_INST_ID,1E16), MOD(i.PM_TOP_INST_ID,1E16), /*apply modulus to parent and top instance number*/
    i.PM_METRIC_VALUE1, i.PM_METRIC_VALUE2, i.PM_METRIC_VALUE3, i.PM_METRIC_VALUE4, i.PM_METRIC_VALUE5,
    i.PM_METRIC_VALUE6, i.PM_METRIC_VALUE7, i.PM_ADDTNL_DESCR);
    END LOOP;
    COMMIT;
    END;
    /
    Now I will work on the cloned trace. I want to remove certain transaction.
    • PeopleTools SQL. Metric value 7 reports the SQL operation and SQL table name. So if the first word is SELECT and the second word is a PeopleTools table name then it is a PeopleTools SQL operation. A list of PeopleTools tables can be obtained from the object security table PSOBJGROUP.
    • Implicit Commit transactions. This is easy - it is just transaction type 425. 
    Having deleted the PeopleTools transactions, I must also
    • Correct transaction duration for any parents of transaction. I work up the hierarchy of transactions and deduct the duration of the transaction that I am deleting from all of the parent.
    • Transaction types 400, 427 and 428 all record PeopleTools SQL time (metric 66). When I come to that transaction I also deduct the duration of the deleted transaction from the PeopleTools SQL time metric in an parent transaction.
    • Delete any children of the transactions that I delete. 
    • I must also count each PeopleTools SQL Execution transaction (type 408) and each PeopleTools SQL Fetch transaction (type 414) that I delete. These counts are also deducted from the summaries on the parent transaction 400. 
    The summaries in transaction 400 are used on the 'Round Trip Details' components, and if they are not adjusted you can get misleading results. Without the adjustments, I have encountered PMUs where more than 100% of the total duration is spent in SQL - which is obviously impossible.
    Although this technique of first cloning the whole trace and then deleting the PeopleTools operations can be quite slow, it is not something that you are going to do very often. 
    REM xPT.sql
    REM (c)Go-Faster Consultancy Ltd. 2014
    set serveroutput on echo on
    DECLARE
    l_pm_instance_id_m4 INTEGER;
    l_fetch_count INTEGER;
    l_exec_count INTEGER;
    BEGIN /*now remove PeopleTools SQL transaction and any children and adjust trans durations*/
    FOR i IN (
    WITH x AS ( /*returns PeopleTools tables as defined in Object security*/
    SELECT o.entname recname
    FROM psobjgroup o
    WHERE o.objgroupid = 'PEOPLETOOLS'
    AND o.enttype = 'R'
    )
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_trans_duration, h.pm_trans_defn_id
    FROM pspmtranshist h
    LEFT OUTER JOIN x
    ON h.pm_metric_value7 LIKE 'SELECT '||x.recname||'%'
    AND x.recname = upper(regexp_substr(pm_metric_value7,'[^ ,]+',8,1)) /*first word after select*/
    WHERE pm_perf_trace like 'xPT%' /*restrict to cloned traces*/
    -- AND pm_perf_trace = 'xPT9b. XXXXXXXXXX' /*work on a specific trace*/
    AND pm_instance_id < 1E16 /*restrict to cloned traces*/
    AND ( x.recname IS NOT NULL
    OR h.pm_trans_defn_id IN(425 /*Implicit Commit*/))
    ORDER BY pm_instance_id DESC
    ) LOOP
    l_pm_instance_id_m4 := TO_NUMBER(NULL);
     
        IF i.pm_parent_inst_id>0 AND i.pm_trans_duration>0 THEN
    FOR j IN(
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_top_inst_id, h.pm_trans_defn_id
    , d.pm_metricid_3, d.pm_metricid_4
    FROM pspmtranshist h
    INNER JOIN pspmtransdefn d
    ON d.pm_trans_defn_set = h.pm_trans_defn_set
    AND d.pm_trans_defn_id = h.pm_trans_Defn_id
    START WITH h.pm_instance_id = i.pm_parent_inst_id
    CONNECT BY prior h.pm_parent_inst_id = h.pm_instance_id
    ) LOOP
    /*decrement parent transaction times*/
    IF j.pm_metricid_4 = 66 /*PeopleTools SQL Time (ms)*/ THEN --decrement metric 4 on transaction 400
    --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value4 by '||i.pm_trans_duration);
    UPDATE pspmtranshist
    SET pm_metric_value4 = pm_metric_value4 - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_Defn_id = j.pm_trans_defn_id
    AND pm_metric_value4 >= i.pm_trans_duration
    RETURNING pm_instance_id INTO l_pm_instance_id_m4;
    ELSIF j.pm_metricid_3 = 66 /*PeopleTools SQL Time (ms)*/ THEN --SQL time on serialisation
    --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value3 by '||i.pm_trans_duration);
    UPDATE pspmtranshist
    SET pm_metric_value3 = pm_metric_value3 - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_Defn_id = j.pm_trans_defn_id
    AND pm_metric_value3 >= i.pm_trans_duration;
    END IF;

    UPDATE pspmtranshist
    SET pm_trans_duration = pm_trans_duration - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_duration >= i.pm_trans_duration;
    END LOOP;
    END IF;

    l_fetch_count := 0;
    l_exec_count := 0;
    FOR j IN( /*identify transaction to be deleted and any children*/
    SELECT pm_instance_id, pm_parent_inst_id, pm_top_inst_id, pm_trans_defn_id, pm_metric_value3
    FROM pspmtranshist
    START WITH pm_instance_id = i.pm_instance_id
    CONNECT BY PRIOR pm_instance_id = pm_parent_inst_id
    ) LOOP
    IF j.pm_trans_defn_id = 408 THEN /*if PeopleTools SQL*/
    l_exec_count := l_exec_count + 1;
    ELSIF j.pm_trans_defn_id = 414 THEN /*if PeopleTools SQL Fetch*/
    l_fetch_count := l_fetch_count + j.pm_metric_value3;
    END IF;
    DELETE FROM pspmtranshist h /*delete tools transaction*/
    WHERE h.pm_instance_id = j.pm_instance_id;
    END LOOP;

    IF l_pm_instance_id_m4 > 0 THEN
    --dbms_output.put_line('ID:'||l_pm_instance_id_m4||' Decrement '||l_exec_Count||' executions, '||l_fetch_count||' fetches');
    UPDATE pspmtranshist
    SET pm_metric_value5 = pm_metric_value5 - l_exec_count
    , pm_metric_value6 = pm_metric_value6 - l_fetch_count
    WHERE pm_instance_id = l_pm_instance_id_m4;
    l_fetch_count := 0;
    l_exec_count := 0;
    END IF;

    END LOOP;
    END;
    /
    Now, I have a second PPM trace that I can open in the analytic component. Original and Cloned PPM tracesFigure 4: Original and Cloned PPM traces

    When I open the cloned trace, both timings in the duration summary have reduced as have the number of executions and fetches.  The durations of the individual server round trips have also reduced.
     Component Trace without PeopleTools transactionsFigure 5: Component Trace without PeopleTools transactions
    All of the PeopleTools SQL operations have disappeared from the SQL summary.
     SQL Summary of PPM trace after removing PeopleTools SQL transactionsFigure 6: SQL Summary of PPM trace after removing PeopleTools SQL transactions
    The SQL summary now only has 125 rows of data.
    Figure 7: SQL Summary of PMU without PeopleTools SQL
    Now, the PPM tree component opens quickly and without error.
     PMU Tree after removing PeopleTools SQLFigure 8: PMU Tree after removing PeopleTools SQL
    There may still be more transactions in a PMU than I can show in a screenshot, but I can now find the statement that took the most time quite quickly.

     Long SQL transaction further down same PMU treeFigure 9: Long SQL transaction further down same PMU tree
    Conclusions I think that it is reasonable and useful to remove PeopleTools SQL operations from a PPM trace.
    In normal production operation, components will mostly be cached, and this approach renders traces collected in non-production environments both usable in the PPM analytic components and more realistic for performance tuning. However, it is essential that when deleting some transactions from a PMU, that summary data held in other transactions in the same PMU are also corrected so that the metrics remain consistent. ©David Kurtz, Go-Faster Consultancy Ltd.

    Upgrades

    Jonathan Lewis - Mon, 2014-11-03 12:31

    One of the worst problems with upgrades is that things sometimes stop working. A particular nuisance is the execution plan that suddenly stops appearing, to be replaced by an alternative plan that is much less efficient.

    Apart from the nuisance of the time spent trying to force the old plan to re-appear, plus the time spent working out a way of rewriting the query when you finally decide the old plan simply isn’t going to re-appear, there’s also the worry about WHY the old plan won’t appear. Is it some sort of bug, is it that some new optimizer feature has disabled some older optimizer feature, or is it that someone in the optimizer group realised that the old plan was capable of producing the wrong results in some circumstances … it’s that last possibility that I find most worrying.

    Here’s an example that appeared recently on OTN that’s still got me wondering about the possibility of wrong results (in the general case). We start with a couple of tables, a view, and a pipelined function. This example is a simple model of the problem that showed up on OTN; it’s based on generated data so that anyone who wants to can play around with it to see if they can bypass the problem without making any significant changes to the shape of the code:

    
    create table t1
    as
    with generator as (
    	select	--+ materialize
    		rownum id
    	from dual
    	connect by
    		level <= 1e4
    )
    select
    	rownum			id,
    	rownum			n1,
    	mod(rownum,100)		n_100,
    	rpad('x',100)		padding
    from
    	generator	v1
    ;
    
    create table t2
    as
    with generator as (
    	select	--+ materialize
    		rownum id
    	from dual
    	connect by
    		level <= 1e4
    )
    select
    	rownum			id,
    	rownum			n1,
    	mod(rownum,100)		n_100,
    	rpad('x',100)		padding
    from
    	generator	v1
    ;
    
    alter table t2 add constraint t2_pk primary key(id);
    
    begin
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T1',
    		method_opt	 => 'for all columns size 1'
    	);
    
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T2',
    		method_opt	 => 'for all columns size 1'
    	);
    
    end;
    /
    
    create or replace type myScalarType as object (
            x int,
            y varchar2(15),
            d date
    )
    /
    
    create or replace type myArrayType as table of myScalarType
    /
    
    create or replace function t_fun1(i_in number)
    return myArrayType
    pipelined
    as
    begin
    	pipe row (myscalartype(i_in,     lpad(i_in,15),     trunc(sysdate) + i_in    ));
    	pipe row (myscalartype(i_in + 1, lpad(i_in + 1,15), trunc(sysdate) + i_in + 1));
    	return;
    end;
    /
    
    create or replace view v1
    as
    select
    	--+ leading(t2 x) index(t2)
    	x.x, x.y, x.d,
    	t2.id, t2.n1
    from
    	t2,
    	table(t_fun1(t2.n_100)) x
    where
    	mod(t2.n1,3) = 1
    union all
    select
    	--+ leading(t2 x) index(t2)
    	x.x, x.y, x.d,
    	t2.id, t2.n1
    from
    	t2,
    	table(t_fun1(t2.n_100)) x
    where
    	mod(t2.n1,3) = 2
    ;
    
    

    A key part of the problem is the UNION ALL view, where each subquery holds a join to a pipeline function. We’re about to write a query that joins to this view, and wants to push a join predicate into the view. Here’s the SQL:

    
    select
    	/*+ leading(t1 v1) use_nl(v1) */
    	v1.x, v1.y, v1.d,
    	v1.n1,
    	t1.n1
    from
    	t1,
    	v1
    where
    	t1.n_100 = 0
    and	v1.id = t1.n1
    ;
    
    

    You’ll notice that the join v1.id = t1.n1 could (in principle) be pushed inside the view to become t2.id = t1.n1 in the two branches of the UNION ALL; this would make it possible for the nested loop that I’ve hinted between t1 and v1 to operate efficiently – and in 11.1.0.7 this is exactly what happens:

    
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                      |        | 16336 |   733K|   123   (1)| 00:00:01 |
    |   1 |  NESTED LOOPS                         |        | 16336 |   733K|   123   (1)| 00:00:01 |
    |*  2 |   TABLE ACCESS FULL                   | T1     |   100 |   700 |    23   (5)| 00:00:01 |
    |   3 |   VIEW                                | V1     |   163 |  6357 |     1   (0)| 00:00:01 |
    |   4 |    UNION-ALL PARTITION                |        |       |       |            |          |
    |   5 |     NESTED LOOPS                      |        |  8168 |   103K|    16   (0)| 00:00:01 |
    |*  6 |      TABLE ACCESS BY INDEX ROWID      | T2     |     1 |    11 |     2   (0)| 00:00:01 |
    |*  7 |       INDEX UNIQUE SCAN               | T2_PK  |     1 |       |     1   (0)| 00:00:01 |
    |   8 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |       |       |            |          |
    |   9 |     NESTED LOOPS                      |        |  8168 |   103K|    16   (0)| 00:00:01 |
    |* 10 |      TABLE ACCESS BY INDEX ROWID      | T2     |     1 |    11 |     2   (0)| 00:00:01 |
    |* 11 |       INDEX UNIQUE SCAN               | T2_PK  |     1 |       |     1   (0)| 00:00:01 |
    |  12 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |       |       |            |          |
    ------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("T1"."N_100"=0)
       6 - filter(MOD("T2"."N1",3)=1)
       7 - access("T2"."ID"="T1"."N1")
      10 - filter(MOD("T2"."N1",3)=2)
      11 - access("T2"."ID"="T1"."N1")
    
    

    For each row returned by the tablescan at line 2 we call the view operator at line 3 to generate a rowsource, but we can see in the predicate sections for lines 7 and 11 that the join value has been pushed inside the view, allowing us to access t2 through its primary key index. Depending on the data definitions, constraints, view definition, and version of Oracle, you might see the UNION ALL operator displaying the PARTITION option or the PUSHED PREDICATE option in cases of this type.

    So now we upgrade to 11.2.0.4 (probably any 11.2.x.x version) and get the following plan:

    
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                      |        |  1633K|    99M|   296K  (4)| 00:24:43 |
    |   1 |  NESTED LOOPS                         |        |  1633K|    99M|   296K  (4)| 00:24:43 |
    |*  2 |   TABLE ACCESS FULL                   | T1     |   100 |   700 |    23   (5)| 00:00:01 |
    |*  3 |   VIEW                                | V1     | 16336 |   909K|  2966   (4)| 00:00:15 |
    |   4 |    UNION-ALL                          |        |       |       |            |          |
    |   5 |     NESTED LOOPS                      |        |   816K|    10M|  1483   (4)| 00:00:08 |
    |*  6 |      TABLE ACCESS BY INDEX ROWID      | T2     |   100 |  1100 |   187   (2)| 00:00:01 |
    |   7 |       INDEX FULL SCAN                 | T2_PK  | 10000 |       |    21   (0)| 00:00:01 |
    |   8 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |  8168 | 16336 |    13   (0)| 00:00:01 |
    |   9 |     NESTED LOOPS                      |        |   816K|    10M|  1483   (4)| 00:00:08 |
    |* 10 |      TABLE ACCESS BY INDEX ROWID      | T2     |   100 |  1100 |   187   (2)| 00:00:01 |
    |  11 |       INDEX FULL SCAN                 | T2_PK  | 10000 |       |    21   (0)| 00:00:01 |
    |  12 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |  8168 | 16336 |    13   (0)| 00:00:01 |
    ------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("T1"."N_100"=0)
       3 - filter("V1"."ID"="T1"."N1")
       6 - filter(MOD("T2"."N1",3)=1)
      10 - filter(MOD("T2"."N1",3)=2)
    
    

    In this plan the critical join predicate appears at line 3; the predicate hasn’t been pushed. On the other hand the index() hints in the view have, inevitably, been obeyed (resulting in index full scans), as has the use_nl() hint in the main query – leading to a rather more expensive and time-consuming execution plan.

    The first, quick, debugging step is simply to set the optimizer_features_enable back to 11.1.0.7 – with no effect; the second is to try adding the push_pred() hint to the query – with no effect; the third is to generate the outline section of the execution plans and copy the entire set of hints from the good plan into the bad plan, noting as we do so that the good plan actually uses the hint OLD_PUSH_PRED(@”SEL$1″ “V1″@”SEL$1″ (“T2″.”ID”)) – still no effect.

    Since I happen to know a few things about what is likely to appear in the 10053 (optimizer) trace file, my next step would be to flush the shared pool, enable the trace, and then check the trace file (using grep or find depending on whether I was running UNIX or Windows) for the phrase “JPPD bypassed”; this is what I got:

    
    test_ora_9897.trc:OJPPD:     OJPPD bypassed: View contains TABLE expression.
    test_ora_9897.trc:JPPD:     JPPD bypassed: View not on right-side of outer-join.
    test_ora_9897.trc:JPPD:     JPPD bypassed: View not on right-side of outer-join.
    
    

    So 11.1.0.7 had a plan that used the old_push_pred() hint, but 11.2.0.4 explicitly bypassed the option (the rubric near the top of the trace file translates OJPPD to “old-style (non-cost-based) JPPD”, where JPPD translates to “join predicate push-down”). It looks like the plan we got from 11.1.0.7 has been deliberately blocked in 11.2.0.4. So now it’s time to worry whether or not that means I could have been getting wrong results from 11.1.0.7.

    In my test case, of course, I can bypass the problem by explicitly rewriting the query – but I’ll have to move the join with t1 inside the view for both subqueries; alternatively, given the trivial nature of the pipeline function, I could replace the table() operator with a join to another union all view. In real life such changes are not always so easy to implement.

    Footnote: the restriction is still in place on 12.1.0.2.

    Footnote 2: somewhere I’ve probably published a short note explaining that one of my standard pre-emptive strikes on an upgrade is to run the following command to extract useful information from the executable: “strings -a oracle | grep -v bypass”: it can be very helpful to have a list of situations in which some query transformation is bypassed.

     


    Oracle Roundtables: Next Gen Digital Experience & Engagement (Dallas & Chicago)

    WebCenter Team - Mon, 2014-11-03 10:34
    Oracle Corporation Next Gen Digital Experience & Engagement

    Connecting Experiences to Outcomes

    The world has changed to one that’s always on, always-engaged, requiring organizations to rapidly become “digital businesses.” In order to thrive and survive in this new economy, having the right digital experience and engagement strategy and speed of execution is crucial. 

    But where do you start? How do you accelerate this transformation? 

    Attend this roundtable to hear directly from leading industry analysts from Forrester Research, Inc., Blast Radius, client companies, and solution experts as they outline the best practice strategies to seize the full potential of digital experience and engagement platform. Gain insights on how your business can deliver the exceptional and engaging digital experiences and the drive the next wave of revenue growth, service excellence and business efficiency. 

    We look forward to your participation at the Solution Roundtable. 

    Register now for the November 12 event or call 1.800.820.5592 ext. 12830.

    Register now for the November 13 or call 1.800.820.5592 ext. 12864.

    Blast Radius

    Red Button Top Register Now (Dallas) Red Button Bottom Red Button Top Register Now (Chicago) Red Button Bottom Calendar November 12, 2014
    10:30 a.m. - 11:45 a.m. Renaissance Dallas Renaissance Dallas
    2222 N. Stemmons Fwy.
    Dallas, TX 75207 Calendar November 13, 2014
    10:30 a.m. - 11:45 a.m. The Westin O'Hare The Westin O'Hare
    6100 N River Rd
    Rosemont, IL 60018 Featuring:

    James L. McQuivey James L. McQuivey, Ph.D.
    Vice President, Principal Analyst serving CMO Professionals, Forrester 

    Oracle Day - Dallas

    If you are an employee or official of a government organization, please click here for important ethics information regarding this event. Hardware and Software Engineered to Work Together Copyright © 2014, Oracle Corporation and/or its affiliates.
    All rights reserved.
    Contact Us | Legal Notices and Terms of Use | Privacy Statement

    UKOUG Partner of the Year Awards

    Rittman Mead Consulting - Mon, 2014-11-03 10:23

    A few days ago Rittman Mead won 5 awards at the UKOUG Partner of the Year Awards.

    • Business Intelligence Partner of the Year (Silver)
    • Training Partner of the Year (Silver)
    • Managed Services (Outsourcing and Operations) Partner of the Year (Silver)
    • Emerging (New Products) Partner of the Year (Silver)
    • Operating Systems, Storage and Hardware Partner of the Year (Gold)

    We have consistently done well at this event and feel the awards reflect the effort we put in to both our clients and the user community alike.

    The number and diversity of the awards demonstrates how much Rittman Mead has grown over the years. Enterprise Business Intelligence is still at the heart of the work we do, however awards for Operating Systems, Storage and Hardware Partner of the Year and Emerging (New Products) Partner of the Year show the investment we have made in the engineered systems, big data and cloud markets.

    Likewise the Training Partner of the Year and Managed Services (Outsourcing and Operations) Partner of the Year awards show the end to end services we now offer.

    I would like to say thank you very much to everyone who voted for us and most of all thanks to all our staff who have put in the effort to make this happen.

    Categories: BI & Warehousing

    50 years of killing Deer – Data visualization and analysis

    Nilesh Jethwa - Mon, 2014-11-03 10:04

    Virginia maintains the summary of Deer kills way back from 1947

    The stack bar gives a total view of the killings and how it has grown over the years

    Image

     

    By comparing the killings on a line chart we see that the female Deer killings has an uptick from 2008 onwards

    Image

    Watch: The Most Underrated Features of SQL Server 2014 — Part 1

    Pythian Group - Mon, 2014-11-03 09:59

    Since its release back in April, SQL Server experts across the globe are becoming familiar with the top features in Microsoft SQL Server 2014—the In-Memory OLTP engine, the AlwaysOn enhancements, and more. But we couldn’t help but notice that there are a few features that aren’t getting the same attention. Warner Chaves, a Microsoft Certified Master and SQL Server Principal Consultant at Pythian has filmed a video series sharing the most underrated features of SQL Server 2014.

    In his first video, Warner emphasizes the importance of Microsoft’s new feature, delayed durability. “Why do we care about delayed durability?” Warner asks. “The reason we care about it is because waits on the transaction log get a lot shorter. It’s a very, very small change on the database that makes a big impact.” Learn about the specific use cases for delayed durability, and how to implement it in Warner’s video The Most Underrated Features of SQL Server 2014 — Part 1 down below.

    What are your thoughts? Are there any features that you feel were quietly announced, but deserve more recognition? Let us know.

    Watch the rest of the series here:

     

    Pythian is a global leader in data consulting and managed services. We specialize in optimizing and managing mission-critical data systems, combining the world’s leading data experts with advanced, secure service delivery. Learn more about Pythian’s Microsoft SQL Server expertise.

     

    Categories: DBA Blogs

    Useful Carlos Sierra post about queries changing performance

    Bobby Durrett's DBA Blog - Mon, 2014-11-03 09:26

    Saw an interesting post related to work I am doing locking in plans and finding plans that have changed: Carlos Sierra post on queries with changing performance.

    Our DBA team uses versions of my sqlstat.sql script to find queries whose elapsed time change and we use a script from Carlos Sierra’s SQLT collection to force the execution plan back to an efficient plan that the optimizer choose in the past.  Carlos Sierra’s new post includes scripts that resemble my sqlstat script but with more functionality including an interesting way of helping people figure out which queries are now running slower.  At this point we just look at the top queries that are running and run sqlstat.sql to see if they have run with less elapsed time in the past.  We eyeball it and usually it is obvious because the bad plan takes 100 times longer than the good one.  But the newly posted Carlos Sierra script seems to use some math to help figure out whether the elapsed time is really bad enough for us to consider it a problem.

    I want to set up a monitoring script that pages us when a plan changes for the worse so we can decide whether to intervene and force on older plan to improve performance.  I have a simple script running on one of our databases but, as I expected, it is returning a lot of false positives.  So, I’m hoping to have some time this week to improve that script so that we get fewer false alarms and I’m going to take a close look at Carlos Sierra’s newly posted script for ideas.

    – Bobby

     




    Categories: DBA Blogs

    Logging Actual Application User Names for Oracle E-Business Suite, SAP, PeopleSoft, and OBIEE

    Knowing which person, not just which database account, has been a challenge for database logging and auditing when working with enterprise software applications such as the Oracle E-Business Suite, SAP, PeopleSoft, and OBIEE.  Knowing which application user did what and when is now much easier because of adoption of standard Oracle functionality.

    Standard functionality of Oracle database is the CLIENT_IDENTIFER attribute.  The CLIENT_IDENTIFIER is a predefined attribute of the built-in application context namespace, USERENV, and can be used to capture the application user name.

    CLIENT IDENTIFIER is set using the DBMS_SESSION.SET_IDENTIFIER procedure to store the application username.  The CLIENT IDENTIFIER attribute is one the same as V$SESSION.CLIENT_IDENTIFIER.  Once set, you can query V$SESSION or select sys_context('userenv','client_identifier') from dual.

    The table below offers examples of how CLIENT_IDENTIFIER is now being used by the Oracle E-Business Suite, SAP, and PeopleSoft. If you are running one of these software packages, Integrigy highly recommends that you incorporate the information that the CLIENT_IDENTIFIER provides into your logging and auditing solution.

    Oracle CLIENT_IDENTIFIER

    Application

    Application Usage

    Oracle

    E-Business Suite

    As of Release 12, the Oracle E-Business Suite automatically sets and updates client_identifier to the FND_USER.USERNAME of the user logged on.  Prior to Release 12, follow Support Note How to add DBMS_SESSION.SET_IDENTIFIER(FND_GLOBAL.USER_NAME) to FND_GLOBAL.APPS_INITIALIZE procedure (Doc ID 1130254.1)

    Oracle

    PeopleSoft

    Starting with PeopleTools 8.50, the PSOPRID is now additionally set in the Oracle database CLIENT_IDENTIFIER attribute. 

    SAP

    With SAP version 7.10 above, the SAP user name is stored in the CLIENT_IDENTIFIER.

    Oracle Business Intelligence Enterprise Edition(OBIEE)

    When querying an Oracle database using OBIEE the connection pool username is passed to the database.  To also pass the middle-tier username, set the user identifier on the session.  To do this in OBIEE, open the RPD, edit the connection pool settings and create a new connection script to run at connect time.  Add the following line to the connect script:

    CALL DBMS_SESSION.SET_IDENTIFIER('VALUEOF(NQ_SESSION.USER)')

    If you have questions, please contact us at mailto:info@integrigy.com

    Reference Tags: AuditingOracle DatabaseOracle E-Business SuiteOracle PeopleSoftSAPOracle Business Intelligence (OBIEE)
    Categories: APPS Blogs, Security Blogs

    Digital Learning: LVC customer quotes

    The Oracle Instructor - Mon, 2014-11-03 03:34

    LVC

    Here’s a collection of customer quotes as a follow-up to my last post about the importance of attitude towards Live Virtual Classes (LVCs). They are from courses that I have taught personally this year with an average delivery score of about 96%:

    Oracle Database 12c: Managing Multitenant Architecture

    Oracle Grid Infrastructure 11g: Manage Clusterware and ASM

    Oracle Database 12c: New Features for Administrators

    Exadata Database Machine Administration Workshop

    Oracle 11g: RAC and Grid Infrastructure Administration Accelerated

    “My first experience of an LVC. Pleased to say it was very positive. Introduction and start on Monday morning was smooth. I would recommend to my colleagues.”

    Always important to make a good first impression!

    “The whole LVC package just worked. From the comfort of my own environment with a great instructor makes for happy learning :)”

    And that is exactly what we strive to deliver.

    “Both, host and producer were very professional and guided the students through the course.”

    An LVC producer takes care for all technical aspects apart from the course itself, like access to the learning platform. The instructor appears as “host” on the learning platform.

    “Instructor professionally answered students’ questions and kept up a positive mood in the community!”

    LVCs can be funny too :-)

    “I appreciate the way how the course was presented. Very well controlled time, organization of presentation, exercises. Interaction with us was great. Always ready to answer a question, give an examples to difficult topic, illustrating topics.”

    So much about allegedly missing interaction in LVCs.

    “I work few years on RAC databases, my knowledge was not so clear regarding some topic on RAC and Grid after completing this training I’m sure that I will handle our RAC and Grid environment differently and for sure will have positive impact in our production environment. Great thank!”

    You cannot top that with a classroom course either :-)

    “LVC is offering great and flexible way to gain knowledge without travel or hotels etc.” “LVCs reduce travel costs and help the students to manage their time on their own, i.e. to join the classes from home and focus on the presented content.”

    Trust me, I didn’t make up the last two although they may sound like your manager talking – or mine, for that matter ;-)


    Tagged: Digital Learning, LVC
    Categories: DBA Blogs

    Annonce : Web Séminaire DBaaS

    Jean-Philippe Pinte - Mon, 2014-11-03 03:32

    Accélérer considérablement vos délais de mise à disposition des services DB avec une infrastructure de Cloud Privé pour vos bases de données
    (Database as a service).

    Mardi 18 novembre 2014 ; de 11h à 12h

    Les solutions Oracle pour bâtir un Cloud Privé permettent de transformer les organisations IT en apportant une architecture agile, évolutive et mutualisée permettant de répondre à leurs besoins et objectifs IT.
    Inscrivez-vous au webinar du 18 novembre 2014 pour comprendre la solution de cloud privé Database, les caractéristique, les enjeux, les bénéfices, etc...
    En effet, lors de ce webinar, vous nous présenterons:
    • Les principales fonctionnalités de la solution : automatisation, portail libre service, catalogue de services, snapclone, mesure et facturation de l’usage.
    • Les principaux bénéfices :  Réduction du délai de mise à disposition des services DB : jours/semaines è minutes/heures, réduction drastique des volumétries Hors Production, Plus d’agilité, Meilleur contrôle des ressources, Consolidation.
    • Une démonstration de notre solution DBaaS Private Cloud.
    • Des exemples Clients.
    Ce webinar s’adresse à toute personne impliquée dans la planification, le déploiement et l’administration d’un cloud privé mais également les équipes Hors Production exprimant le besoin de services DB : DBAs, Administrateurs systèmes, Architectes techniques, Ingénieurs qualité, Développeurs, Chefs de projet applicatif/Web, Testeurs etc...
    Inscrivez-vous dès maintenant en envoyant un e-mail à alexandre.lacreuse@oracle.com .

    MdsMetadataResourceProvider Error in ADF 11.1.1.7

    Andrejus Baranovski - Mon, 2014-11-03 02:59
    If you have migrated ADF application to ADF 11.1.1.7 version recently, you may see error in the log related to the internal MetadataResourceXmlImpl class usage. Runtime behaviour will not be affected, but there could be many redundant error messages in the log, like this one:


    Error UNEXPECTED_CLIENT_OBJECT_TYPE found oracle.adf.controller.internal.metadata.xml.MetadataResourceXmlImpl, points to the ADF internal class usage, instead of public. You must look through the source code and you should find illegal internal imports for MetadataService and TaskFlowDefinition classes. Typically ADF developer would use such classes in the previous ADF versions to check if specific ADF TF exists:


    In ADF 11.1.1.7 you can use public classes for MetadataService and TaskFlowDefinition, instead of private. To fix the error, you only need to change class import:


    Here you can download sample application with correct API usage - ADFMetadataInternalApp.zip.

    Oracle BI Cloud Service for SaaS Application Reporting Part 1: Integrating BICS to Salesforce.com using REST APIs

    Rittman Mead Consulting - Mon, 2014-11-03 01:30

    Last month Mark Rittman covered a series of posts detailing the Oracle BI Cloud Service (BICS), aimed at departmental users who want the power of OBIEE 11g without the need to stand-up their own infrastructure. If you’re coming in late, here’s the link to the series.

    Before the GA Rittman Mead participated in the beta program for release one of Oracle’s Business Intelligence Cloud Service (BICS), the global aim of the beta was both to understand the capabilities and to identify potential use cases of the new BICS platform.  As Mark wrote, an excellent use case for BICS is to report on top of any SaaS application that expose the data (stored in the cloud) using REST APIs by taking advantage of the ApEx capabilities hosted in the Database Schema Service that comes with BICS. SaaS applications like Oracle’s Fusion CRM, Taleo or Salesforce.com – that was used during the beta program – can be easily integrated and queried with BICS.

    The technical goal of our beta program has been to check the features, options and limitations of Oracle BICS by connecting it to a Salesforce.com instance, accessing the data exposed through REST APIs by using the ApEx native functions,  storing the data in the Database Schema Service, creating the Repository using the new Model Editor and showing the data in dashboard by keeping the same data security settings configured in the source platform.

    Salesforce.com to Oracle BI Cloud Service

    This series of post is going to explain all the details about our successful PoC, over the next few days we’ll be covering the following topics, and we’ll update the list with hyperlinks once the articles are published:

    • Oracle BI Cloud Service for SaaS Application Reporting Part 1: Integrating BICS to Salesforce.com using REST APIs
    • Oracle BI Cloud Service for SaaS Application Reporting Part 2: BICS ApEx components
    • Oracle BI Cloud Service for SaaS Application Reporting Part 3: BICS Repository and Front-end configurations

    In the first post of the series we’re going to define the steps required to setup a Salesforce.com demo environment that can be accessed with REST APIs. Later in the post we intend to analyse the most interesting Salesforce.com REST APIs that are used to extract the data from the platform. Before starting digging into the BICS-Salesforce.com integration, some definitions may be needed:

    • RESTful APIs: A Web service API can be defined as RESTful if it conforms to the REST architectural constraints. Some more details about RESTful APIs can be found here.
    • Salesforce.comSalesforce.com is company specialised in software as a service (SaaS) and one of the major providers of CRM in the cloud. In addition to CRM Salesforce.com offers Force.com a cloud platform as a service (PaaS) that developers can use in order to build multitenant applications hosted on Salesforce.com servers. All the Force.com applications can be accessed by using RESTful APIs.
    Environments Setup

    In a new BI project it is never safe to directly extract data from a live (production) environment. Thus the first step needed in order to test BICS-Salesforce.com connection is to obtain a Salesforce.com environment that we can use without the risk of slowing it down, or even worse crashing and impacting users. A free demo Salesforce.com platform can be obtained by subscribing for the developers program and can be customised as needed. After requesting the Salesforce.com demo environment the creation of a security token is needed, the token is associated with the user invoking the RESTful API. To generate the token: login to Salesforce.com with the specific user -> click on the username -> My Settings -> Reset My Security Token. The new token should then be sent to the user’s email.

    Salesforce Reset token Procedure

    The third step in order to access Salesforce.com data with REST APIs is to define a Connected App. The creation of a Connected App generates a Consumer Key and a Consumer Secret that is going to be used later during the REST login calls. Once defined a Connected App, the last bit missing is to populate the Salesforce.com instance since it’s empty by default, it can be populated manually or by writing population scripts based on the create REST API statements explained here. An Oracle BICS instance is needed in order to analyse the Salesforce.com data, all the info regarding Oracle BICS and how to activate one instance can be found at this link. Having covered all the basics, it’s time to start analysing how the Salesforce.com data can be extracted.

    Salesforce.com REST APIs

    In the following sections we are going to analyse some of the REST APIs used in our process. The Salesforce.com REST APIs will be called with cURL commands, cURL is a command line tool and library for transferring data with URL syntax.

    Salesforce.com Authentication

    The first step in order to download the data from any Salesforce.com instance is to execute the authentication and retrieve the access token, the access token will then be used in all the following REST calls. There are various authentication mechanisms that can be used against Salesforce.com, for the aim of the beta program we used the one called “Username-Password OAuth Authentication Flow” that is described in detail here and in the image below.

    Salesforce.com User Password OAuth

    The REST API authentication command is

    curl -v https://login.salesforce.com/services/oauth2/token -d "grant_type=password" \
        -d "client_id=CLIENT_ID" -d "client_secret=CLIENT_SECRET" \
        -d "username=USERNAME" -d "password=PASSWORD_TOKEN"

    where the following parameters must be assigned:

    • CLIENT_ID:  The client ID generated during the creation of the Connected App (step described in Basic Setup)
    • CLIENT_SECRET: The client Secret generated during the creation of the Connected App (step described in Basic Setup)
    • USERNAME: The username you want to use in order to access Salesforce.com data
    • PASSWORD_TOKEN: The concatenation of password and user security token (e.g. if the password is ABC and the security token is 123 then PASSWORD_TOKEN is ABC123)

    The response (if all the parameters are correct) should be like the following

    {  
       "id":"https://login.salesforce.com/id/00Dx0000000BV7z/005x00000012Q9P",
       "issued_at":"1278448832702",
       "instance_url":"https://na1.salesforce.com",
       "signature":"0CmxinZir53Yex7nE0TD+zMpvIWYGb/bdJh6XfOH6EQ=",
       "access_token":"00Dx0000000BV7z!AR8AQAxo9UfVkh8AlV0Gomt9Czx9LjHnSSpwBMmbRcgKFmxOtvxjTrKW19ye6PE3Ds1eQz3z8jr3W7_VbWmEu4Q8TVGSTHxs"
    }

    where the most interesting parameters are:

    • instance_url: defines the salesforce instance to use.
    • access_token: defines the time-limited security token to use for all the following calls.
    Salesforce.com List of Objects and Retrieve Metadata for an Object

    Two of the Salesforce.com REST APIs were very useful when trying to build a general ETL that could be applied to any customised Salesforce.com instance:

    • Get list of Objects: retrieves the list of all objects available in Salesforce.com (custom or not) with some additional information for each object.
    • Retrieve Metadata for an Object: retrieves the list of columns for the selected object with related data types and additional information.

    The Get list of Objects call is the following

    curl INSTANCE/services/data/VERSION/sobjects/ -H "Authorization: Bearer TOKEN"

    where

    • INSTANCE is the instance_url parameter retrieved from the authentication call.
    • TOKEN is the access_token parameter retrieved from the authentication call.
    • VERSION is the REST APIs version used, in our beta test we used v29.0.

    The response should be similar to the following in which you could see for each object (identified by the “name” field) all the metadata available.

    {  
       "encoding":"UTF-8",
       "maxBatchSize":200,
       "sobjects":[  
          {  
             "name":"Account",
             "label":"Account",
             "keyPrefix":"001",
             "labelPlural":"Accounts",
             "custom":false,
             "layoutable":true,
             "activateable":false,
             "urls":{  
                "sobject":"/services/data/v26.0/sobjects/Account",
                "describe":"/services/data/v26.0/sobjects/Account/describe",
                "rowTemplate":"/services/data/v26.0/sobjects/Account/{ID}"
             },
             "searchable":true,
             "updateable":true,
             "createable":true,
             "deprecatedAndHidden":false,
             "customSetting":false,
             "deletable":true,
             "feedEnabled":true,
             "mergeable":true,
             "queryable":true,
             "replicateable":true,
             "retrieveable":true,
             "undeletable":true,
             "triggerable":true
          },
          ...
       ]
    }

    The Retrieve Metadata of an Object can be called for each object listed in Get list of Object response, the code is the following:

    curl INSTANCE/services/data/VERSION/sobjects/OBJECT/ \
        -H "Authorization: Bearer TOKEN"

    where:

    • INSTANCE and TOKEN  and VERSION are the same parameters defined for Get list of objects call
    • OBJECT is the Salesforce.com object to retrieve

    The response for the Account object is similar to the following

    {  
       "objectDescribe":{  
          "name":"Account",
          "updateable":true,
          "label":"Account",
          "keyPrefix":"001",
          ...  
              
            "replicateable":true,
          "retrieveable":true,
          "undeletable":true,
          "triggerable":true
       },
       "recentItems":[  
          {  
             "attributes":{  
                "type":"Account",
                "url":"/services/data/v20.0/sobjects/Account/001D000000INjVeIAL"
             },
             "Id":"001D000000INjVeIAL",
             "Name":"asdasdasd"
          },
          ...
       ]
    }

    Interesting parameters for each column are:

    • name: is the columns name.
    • retrievable: if true it means that the particular column can be part of a query
    • soapType and length or byteLength: these fields provide the field type information and the length of the field.

    With the three REST APIs analysed we could recreate all the objects (custom or not) in any Salesforce.com (or Force.com) instance as tables in the Oracle Database Schema Service. In the next section we will see how to extract the data from the list of objects by using the Salesforce.com query capabilities.

    Salesforce.com query

    Salesforce provides two methods of querying the objects: including or excluding deleted object. The difference in the code is minimal, the first uses the /queryAll suffix while the second uses the /query. A Salesforce.com REST API query call is:

    curl INSTANCE/services/data/VERSION/query/?q=SELECT+LIST_OF_COLUMNS+from+OBJECT \
        -H "Authorization: Bearer TOKEN"

    where:

    • INSTANCE and TOKEN parameters are the ones retrieved from the Authentication call
    • VERSION is the Salesforce.com REST API version used
    • OBJECT is the Salesforce.com object to query
    • LIST_OF_COLUMNS is the comma delimited list of columns to retrieve

    The response of a query “select+Name+from+Account” is:

    {  
       "done":true,
       "totalSize":14,
       "records":[  
          {  
             "attributes":{  
                "type":"Account",
                "url":"/services/data/v20.0/sobjects/Account/001D000000IRFmaIAH"
             },
             "Name":"Test 1"
          },
          {  
             "attributes":{  
                "type":"Account",
                "url":"/services/data/v20.0/sobjects/Account/001D000000IomazIAB"
             },
             "Name":"Test 2"
          },
          ...
       ]
    }

    In the JSON result you can find:

    • totalsize: the number of records retrieved
    • Name: for each record the name of the Account

    If the resultSet is too big, Salesforce.com will start paging results. If the initial query returns only part of the results, the end of the response will contain a field called nextRecordsUrl. For example:

    "nextRecordsUrl" : "/services/data/v20.0/query/01gD0000002HU6KIAW-2000"

    In order to get the next page of data a call like the following is needed passing the access token as a parameter.

    curl INSTANCE/services/data/v20.0/query/01gD0000002HU6KIAW-2000 \
        -H "Authorization: Bearer TOKEN"

    Salesforce.com limits

    It’s important to be aware of the limit set by Salesforce.com on the number of REST API calls per day. In order to check at any time the remaining number of calls available execute the following code.

    curl INSTANCE/services/data/VERSION/limits/ -H "Authorization: Bearer TOKEN "X-PrettyPrint:1"

    The DailyApiRequest -> Remaining field contained in the response shows the amount of calls still available.

    {
        "DailyApiRequests":
        {
            "Remaining":"4980",
            "Max":"5000"
        },
        "DailyAsyncApexExecutions":
        {
            "Remaining":"250000",
            "Max":"250000"
        },
        ...
    }

    Resultset output format

    Salesforce.com default output format is JSON, which is a common format for most of the cloud application. However there is the opportunity to also retrieve the result in XML format by using the HTTP ACCEPT header set to “application/xml”. This first release of the tool is based on Oracle DB11g which supports native XML and not JSON. For this reason during our beta program with Oracle BI Cloud Service we decided to use the XML output. Once BICS will be bundled with the Oracle DB 12c that supports JSON and XML native, the resultset output format could be kept as default in JSON. In this post we defined the steps required in order to setup a Salesforce.com demo environment that can be accessed with REST APIs and which are the most interesting Salesforce.com REST APIs that we used to extract the data in order to analyse it with Oracle BI Cloud Service platform. In the next post we will look more in detail at the BICS ApEx part of it by analysing how the Salesforce.com REST APIs can be called from ApEx.

    Categories: BI & Warehousing

    Science needs to explain this?

    FeuerThoughts - Sun, 2014-11-02 08:57
    Christopher Nolan of Dark Knight fame releasing new sci-fi movie: Interstellar.

    In a Chicago Tribune interview, he says:
    I could be wrong, but science needs to cross a threshold and explain why a monkey typing infinitely would never type the works of Shakespeare.
    Well, I could be wrong, but maybe Nolan is a bit of an idiot when it comes to science.
    Please, Mr. Nolan, tell me which scientists make this claim?
    I guess he read somewhere about infinity and how incredibly awesome and big and never-ending it is, and so eventually anything would be done by anybody or anything and so even monkeys would "eventually" write Shakespeare and and and....
    Produce a movie called Interstellar with Matthew McConaughey. 
    In fact, maybe Chris Nolan is actually a monkey who crossed over from that obelisk in 2001, and got super smart and so a monkey already has produced a movie called Interstellar.
    Damn, that is just so cool and so weird and it's like, that's never going to happen, man, no way.
    So scientists had better figure out WHY that is not going to happen when they obviously really believe that it WILL happen (go, monkey, go!).
    And to do that, they are going to have a cross a threshold, 'cause clearly science has hit its limit here. Just like with souls. Science can't explain souls, so I guess scientists had better cross over - maybe into a parallel universe -
    Because really what could be cooler than parallel universes?



    Categories: Development

    Notes on predictive modeling, November 2, 2014

    DBMS2 - Sun, 2014-11-02 05:49

    Following up on my notes on predictive modeling post from three weeks ago, I’d like to tackle some areas of recurring confusion.

    Why are we modeling?

    Ultimately, there are two reasons to model some aspect of your business:

    • You generally want insight and understanding.
      • This is analogous to why you might want to do business intelligence.
      • It commonly includes a search for causality, whether or not “root cause analysis” is exactly the right phrase to describe the process.
    • You want to do calculations from the model to drive wholly or partially automated decisions.
      • A big set of examples can be found in website recommenders and personalizers.
      • Another big set of examples can be found in marketing campaigns.
      • For an example of partial automation, consider a tool that advises call center workers.

    How precise do models need to be?

    Use cases vary greatly with respect to the importance of modeling precision. If you’re doing an expensive mass mailing, 1% additional accuracy is a big deal. But if you’re doing root cause analysis, a 10% error may be immaterial.

    Who is doing the work?

    It is traditional to have a modeling department, of “data scientists” or SAS programmers as the case may be. While it seems cool to put predictive modeling straight in the hands of business users — some business users, at least — it’s rare for them to use predictive modeling tools more sophisticated than Excel. For example, KXEN never did all that well.

    That said, I support the idea of putting more modeling in the hands of business users. Just be aware that doing so is still a small business at this time.

    “Operationalizing” predictive models

    The topic of “operationalizing” models arises often, and it turns out to be rather complex. Usually, to operationalize a model, you need:

    • A program that generates scores, based on the model.
    • A program that consumes scores (for example a recommender or fraud alerter).

    In some cases, the two programs might be viewed as different modules of the same system.

    While it is not actually necessary for there to be a numerical score — or scores — in the process, it seems pretty common that there are such. Certainly the score calculations can create a boundary for loose-coupling between model evaluation and the rest of the system.

    That said:

    • Sometimes the scoring is done on the fly. In that case, the two programs mentioned above are closely integrated.
    • Sometimes the scoring is done in batch. In that case, loose coupling seems likely. Often, there will be ETL (Extract/Transform/Load) to make the scores available to the program that will eventually use them.
    • PMML (Predictive Modeling Markup Language) is good for some kinds of scoring but not others. (I’m not clear on the details.)

    In any case, operationalizing a predictive model can or should include:

    • A process for creating the model.
    • A process for validating and refreshing the model.
    • A flow of derived data.
    • A program that consumes the model’s outputs.

    Traditional IT considerations, such as testing and versioning, apply.

    What do we call it anyway?

    The term “predictive analytics” was coined by SPSS. It basically won. However, some folks — including whoever named PMML — like the term “predictive modeling” better. I’m in that camp, since “modeling” seems to be a somewhat more accurate description of what’s going on, but I’m fine with either phrase.

    Some marketers now use the term “prescriptive analytics”. In theory that makes sense, since:

    • “Prescriptive” can be taken to mean “operationalized predictive”, saving precious syllables and pixels.
    • What’s going on is usually more directly about prescription than prescription anyway.

    Edit: Ack! I left the final paragraph out of the post, namely:

    In practice, however, the term “prescriptive analytics” is a strong indicator of marketing nonsense. Predictive modeling has long been used to — as it were — prescribe business decisions; marketers who use the term “prescriptive analytics” are usually trying to deny that very obvious fact.

    Categories: Other

    Analytics for lots and lots of business users

    DBMS2 - Sun, 2014-11-02 05:45

    A common marketing theme in the 2010s decade has been to claim that you make analytics available to many business users, as opposed to your competition, who only make analytics available to (pick one):

    • Specialists (with “PhD”s).
    • Fewer business users (a thinner part of the horizontally segmented pyramid — perhaps inverted — on your marketing slide, not to be confused with the horizontally segmented pyramids — perhaps inverted — on your competition’s marketing slides).

    Versions of this claim were also common in the 1970s, 1980s, 1990s and 2000s.

    Some of that is real. In particular:

    • Early adoption of analytic technology is often in line-of-business departments.
    • Business users on average really do get more numerate over time, my three favorite examples of that being:
      • Statistics is taught much more in business schools than it used to be.
      • Statistics is taught much more in high schools than it used to be.
      • Many people use Excel.

    Even so, for most analytic tools, power users tend to be:

    • People with titles or roles like “business analyst”.
    • More junior folks pulling things together for their bosses.
    • A hardcore minority who fall into neither of the first two categories.

    Asserting otherwise is rarely more than marketing hype.

    Related link

    Categories: Other

    Does Increasing An Oracle Background Process OS Priority Improve Performance?

    Does Increasing An Oracle Background Process OS Priority Improve Performance?
    Does increasing an Oracle Database background process operating system priority improve performance? As you might expect, the answer is, "It depends."

    In this posting I will explain the results of an experiment where I increase the Oracle Database 12c log writer background processes operating system priority.

    In my experiment I created a clear CPU bottleneck and the top wait event was log file parallel write. I gathered some data. Next I increased all the log writer background process priorities. Again, I gathered some data and then I analyzed the two data sets. The results were disappointing, not surprising, but a fundamental rule in performance tuning was demonstrated and reinforced.

    You can download the "analysis pack" which contains the raw experimental data, histogram, statistical R results and the R statical package commands HERE. You can download total time delta reporting script (ttpctx.sql) I show below, which is contained within my OraPub System Monitor (OSM) Toolkit HERE.

    Before I could gather some data, I needed to create the appropriate system load, the data gather scripts and the data analysis scripts. Here's a quick overview of each.

    Increasing The LGWR Processes PriorityIf you are not familiar with changing Oracle Database background OS process priority, I blogged about how to do this HERE.

    My experiment consisted of creating two situations and statistically comparing them to see if increasing the LGWR background process OS priority increased performance. The only difference in the "two situations" was the instance parameter, _high_priority_processes. For the "normal priority" part of the experiment, the default "LMS*|VKTM" was used. For the "high priority" part of the experiment the parameter was changed to "LMS*|VKTM|LG*". The "LG*" caused the increase in the Linux OS priority of all log writer processes from the default 19 to 41.

    Ready for some version specifics? Oracle continues to make OS priority related instance parameter changes... even within 12c releases. Since this experiment was done with Oracle Database version 12.1.0.1.0 the parameter default was "LMS*|VKTM" not "LMS*" as with version 12.1.0.2.0. Also, in 12.1.0.2.0 VKTM is placed into a new parameter, _highest_priority_processes.

    Generating The DML With CPU Bottleneck Load
    To generate the DML workload, I used my OP Load Generator. You can download it HERE. It's quick, easy and I'm familiar with it. I kept increasing the number of DML processes until the CPU utilization was pegged at 100%. While the database server was clearly out of CPU power, the log writer background processes were rarely seen as the top CPU consuming process. I used the "top" program to monitor the process activity. The top CPU consuming processes were almost always the Oracle server/foreground/shadow processes. As I'll explain in the Results section below, this has significant ramifications on the results.

    Oracle Time Based Analysis Summary
    The data collection script was similar to the one I used back in 2012 when collecting data regarding Oracle's commit write facility. Essentially, I collected multiple three minute samples of the delta elapsed time, user commits, total non-idle wait time and CPU consumption. In both the normal and high priority runs, around 99% of the database time was CPU consumption and 1% of the time Oracle non-idle wait time. Also the top wait event (~70%) was log file parallel write (display name: log file redo write).

    If you want master an Oracle Time Based Analysis, check out my online seminar, Tuning Oracle Using An AWR Report. It will teach you how to using an AWR report to optimize Oracle performance so users will feel the difference.

    The OraPub System Monitor Toolkit script ttpctx.sql report below was taken during the "normal priority" log writer load.

    SQL> @ttpctx

    Database: prod35 16-MAY-14 06:07pm
    Report: ttpctx.sql OSM by OraPub, Inc. Page 1
    Total Time Activity (39 sec interval)

    Avg Time Time Wait
    Time Component % TT % WT Waited (ms) (sec) Count(k)
    ------------------------------------- ------- ------- ----------- ----------- --------
    CPU consumption: Oracle SP + BG procs 98.91 0.00 0.000 238.716 0
    log file redo write 0.73 67.56 8.082 1.770 0
    control file parallel write 0.18 16.41 33.077 0.430 0
    target log write size 0.05 4.20 6.111 0.110 0
    oracle thread bootstrap 0.03 3.05 40.000 0.080 0
    os thread creation 0.02 1.53 20.000 0.040 0
    commit: log file sync 0.01 0.76 10.000 0.020 0
    enq: CR - block range reuse ckpt 0.01 0.76 20.000 0.020 0
    Disk file operations I/O 0.00 0.00 0.000 0.000 0

    The OSM report below was taken during the "high priority" log writer load.

    SQL> @ttpctx

    Database: prod35 16-MAY-14 09:25pm
    Report: ttpctx.sql OSM by OraPub, Inc. Page 1
    Total Time Activity (41 sec interval)

    Avg Time Time Wait
    Time Component % TT % WT Waited (ms) (sec) Count(k)
    ------------------------------------- ------- ------- ----------- ----------- --------
    CPU consumption: Oracle SP + BG procs 98.92 0.00 0.000 238.733 0
    log file redo write 0.83 77.01 8.272 2.010 0
    control file parallel write 0.08 7.28 14.615 0.190 0
    target log write size 0.05 4.98 5.909 0.130 0
    oracle thread bootstrap 0.03 3.07 40.000 0.080 0
    os thread creation 0.02 1.92 25.000 0.050 0
    commit: log file sync 0.01 0.77 10.000 0.020 0
    enq: CR - block range reuse ckpt 0.01 0.77 20.000 0.020 0
    enq: RO - fast object reuse 0.00 0.38 10.000 0.010 0


    Data Collection
    For the normal priority load 27 three minute samples where collected. For the high priority situation there were 30 three minute samples collected. (I forgot why there was only 27 samples collected for the normal priority.)  I collected the elapsed time, total non-idle wait time, total CPU consumption (v$sys_time_model: db_cpu + background cpu time) and total user commits.

    In this experiment more user commits processed per second means better performance.


    Experimental Results
    I used the free statistics package "R" (www.r-project.org) to analyze the data. I demonstrate how to get, install and use "R" in my online video seminar, Using Skewed Data To Your Advantage.

    With the normal log writer process priority, an average of 984.5 commits/sec and a median of 983.0 commits/sec occurred. With the LG* high process priority, an average of 993.6 commits/sec and a median of 991.0 commits/sec occurred. While the "high priority" situation was able to process more commits per second, is this statistically significant?

    The red "smoothed" histogram is the normal priority situation and the blue smoothed histogram is when the log writers were set to the higher priority. The more separated the two histograms the more "different" the sample sets, the more likely there is a statistically significant difference and the more likely a user would feel the difference. Looking at the above histograms plot, there does not appear to be a real difference. But let's do a proper significance test!

    Because both sample sets are normally distributed (details are in the Analysis Pack), I could use a simple t-test. R produced a p-value of 0.04451. To be statistically "different" I want the p-value to be less than 0.05 and it is. What does this mean?

    While statistically and numerically the commit rates are different, I wouldn't expect any special performance tuning award! In fact, the hassles with cycling a production instance and setting underscore/hidden parameters would make it very unlikely I would increase the OS priority of the log writer background processes. I want to see a big performance difference.

    To Summarize... What I Learned... Again
    This situation is a perfect example of focusing on the wrong thing! While there is a clear operating system CPU bottleneck and the top wait event is about redo, the log writers are not suffering from a want/need of CPU resources. I suspect the server processes want more CPU resources, but they are NOT the processes we increased their OS priority.

    If the log writers were suffering from a lack of CPU resources and fighting for CPU resources, I would expect to see them consuming CPU resources along with the Oracle server processes. And I would definitely expect to see them near the top of the "top" process monitor... especially when their priority has been increased!

    Because of this "misguided" tuning effort, this experiment does not build a case for or against changing the log writer priority. What it reinforces is in our quest to optimize performance, make sure we focus on the right thing.

    As a side note, this is a great statistical analysis example for two reasons. First, our samples sets look similar, but statistically they are not. Second, while they are statistically different, the performance impact will not be very different. And my guess is the users won't feel a thing... except frustration and anger.

    To Super Summarize
    When considering increasing a background process's operating system priority, make sure the process is in need of CPU and is not able to get it. In this situation, the DBA could have been mislead by the CPU bottleneck. But upon closer inspection of the log writers from an operating system perspective and knowing the wait event "log file parallel write" is probably more about IO than CPU (Oracle perspective) it would be unlikely that increasing the log writer processes OS priority would help increase the commits per second.

    Thanks for reading!

    Craig.












    Categories: DBA Blogs

    StatsPack and AWR Reports -- Bits and Pieces -- 2

    Hemant K Chitale - Sat, 2014-11-01 08:52
    This is the second post in a series on reading StatsPack and AWR reports.
    (The first is available here)


    Comparing Reports :

    Here are two 9.2 StatsPack extracts from one database:

    Extract A  : 9.2 StatsPack
    Load Profile
    ~~~~~~~~~~~~ Per Second Per Transaction
    --------------- ---------------
    Redo size: 56,031.63 3,084.68
    Logical reads: 68,286.24 3,759.32
    Block changes: 314.88 17.33
    Physical reads: 842.92 46.40
    Physical writes: 134.76 7.42
    User calls: 271.32 14.94
    Parses: 146.46 8.06
    Hard parses: 7.37 0.41
    Sorts: 93.83 5.17
    Logons: 0.33 0.02
    Executes: 296.70 16.33
    Transactions: 18.16

    Extract B : 9.2 StatsPack
    Load Profile
    ~~~~~~~~~~~~ Per Second Per Transaction
    --------------- ---------------
    Redo size: 89,615.76 2,960.48
    Logical reads: 210,302.81 6,947.42
    Block changes: 541.83 17.90
    Physical reads: 1,465.04 48.40
    Physical writes: 161.68 5.34
    User calls: 213.82 7.06
    Parses: 125.28 4.14
    Hard parses: 6.13 0.20
    Sorts: 104.31 3.45
    Logons: 0.35 0.01
    Executes: 664.81 21.96
    Transactions: 30.27

    Extract B shows a higher volume of activity -- Transactions per second, Redo Size per second, Logical Reads per second.
    Actually, although these are both from the same database but for two different time-windows on different dates.  The fact that the time-windows aren't the same make the comparisons difficult.  Extract A is for an 11-hour window on day 1 and Extract B is for a 6-hour window on day 6.  The two windows aren't strictly comparable.  You have to be very careful when comparing StatsPack or AWR reports, even when from the same database if they don't cover comparable time-windows with similar loads.  The first Report includes non-peak hours, the second is for Peak hours only. The level of activity averaged out over 11hours that include non-peak hours isn't strictly comparable with a 6hour peak window.

    Extract C : 10.2 AWR
    Load Profile
    Per SecondPer TransactionRedo size: 520,776.15 50,948.36Logical reads: 353,525.71 34,585.98Block changes: 1,854.93 181.47Physical reads: 14,285.23 1,397.55Physical writes: 295.84 28.94User calls: 1,265.14 123.77Parses: 822.64 80.48Hard parses: 15.40 1.51Sorts: 168.09 16.44Logons: 0.16 0.02Executes: 1,040.31 101.78Transactions: 10.22 
    Extract D : 10.2 AWRLoad Profile
    Per SecondPer TransactionRedo size: 517,862.01 54,681.39Logical reads: 288,341.09 30,446.12Block changes: 1,879.27 198.43Physical reads: 12,820.96 1,353.77Physical writes: 323.90 34.20User calls: 1,115.78 117.82Parses: 719.39 75.96Hard parses: 17.28 1.82Sorts: 95.74 10.11Logons: 0.15 0.02Executes: 935.33 98.76Transactions: 9.47 
    Extracts C and D are from the same database for the same time-window on 2 consecutive workdays.  They are comparable.  A difference of 10% for some statistic may not be significant.  Understand that every business, every activity, every system will have some "normal" fluctuations day-to-day.

    .
    .
    .



    Categories: DBA Blogs