Skip navigation.

Feed aggregator

Cardinality Feedback

Jonathan Lewis - Wed, 2014-11-05 12:43

A fairly important question, and a little surprise, appeared on Oracle-L a couple of days ago. Running 11.2.0.3 a query completed quickly on the first execution then ran very slowly on the second execution because Oracle had used cardinality feedback to change the plan. This shouldn’t really be entirely surprising – if you read all the notes that Oracle has published about cardinality feedback – but it’s certainly a little counter-intuitive.

Of course there are several known bugs related to cardinality feedback that could cause this anomaly to appear – a common complaint seems to relate to views on the right-hand (inner table) side of nested loop joins, and cardinality feedback being used on a table inside the view; but there’s an inherent limitation to cardinality feedback that makes it fairly easy to produce an example of a query doing more work on the second execution.

The limitation is that cardinality feedback generally can’t be used (sensibly) on all the tables where better information is needed. This blog describes the simplest example I can come up with to demonstrate the point. Inevitably it’s a little contrived, but it captures the type of guesswork and mis-estimation that can make the problem appear in real data sets. Here’s the query I’m going to use:


select
	t1.n1, t1.n2, t2.n1, t2.n2
from
	t1, t2
where
	t1.n1 = 0
and	t1.n2 = 1000
and	t2.id = t1.id
and	t2.n1 = 0
and	t2.n2 = 400
;

You’ll notice that I’ve got two predicates on both tables so, in the absence of “column-group” extended stats the optimizer will enable cardinality feedback as the query runs to check whether or not its “independent columns” treatment of the predicates gives a suitably accurate estimate of cardinality and a reasonable execution plan. If the estimates are bad enough the optimizer will use information it has gathered as the query ran as an input to re-optimising the query on the next execution.

So here’s the trick.  I’m going to set up the data so that there seem to be only two sensible plans:  (a) full scan of t1, with nested loop unique index access to t2; (b) full scan of t2, with nested loop unique index access to t1. But I’m going to make sure that the optimizer thinks that (a) is more efficient than (b) by making making the stats look as if (on average) the predicates on t1 should return 100 rows while the predicates on t2 return 200 rows.

On the other hand I’ve set the data up so that (for this specific set of values) t1 returns 1,000 rows which means Oracle will decide that its estimate was so far out that it will re-optimize with 1,000 as the estimated single table access cardinality for t1 – and that means it will decide to do the nested loop from t2 to t1. But what the optimizer doesn’t know (and hasn’t been able to find out by running the first plan) is that with this set of predicates t2 will return 20,000 rows to drive the nested loop into t1 – and the new execution plan will do more buffer gets and use more CPU (and time) than the old plan. Since cardinality feedback is applied only once, the optimizer won’t be able to take advantage of the second execution to change the plan again, or even to switch back to the first plan.

Here’s the setup so you can test the behaviour for yourselves:


create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum			id,
	mod(rownum,2)		n1,
	mod(rownum,2000)	n2,	-- 200 rows for each value on average
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 4e5
;

alter table t1 add constraint t1_pk primary key(id);

create table t2
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum			id,
	mod(rownum,2)		n1,
	2 * mod(rownum,1000)	n2,	-- 400 rows for each value on average, same range as t1
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 4e5
;

alter table t2 add constraint t2_pk primary key(id);

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T2',
		method_opt	 => 'for all columns size 1'
	);
end;
/

--
-- Now update both tables to put the data out of sync with the statistics
-- We need a skewed value in t1 that is out by a factor of at least 8 (triggers use of CF)
-- We need a skewed value in t2 that is so bad that the second plan is more resource intensive than the first
--

update t1 set n2 = 1000 where n2 between 1001 and 1019;
update t2 set n2 =  400 where n2 between 402 and 598;
commit;

Here are the execution plans for the first and second executions (with rowsource execution statistics enabled, and the “allstats last” option used in a call to dbms_xplan.display_cursor()).


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |  1049 (100)|      0 |00:00:00.36 |   11000 |   6588 |
|   1 |  NESTED LOOPS                |       |      1 |    100 |  1049   (3)|      0 |00:00:00.36 |   11000 |   6588 |
|   2 |   NESTED LOOPS               |       |      1 |    100 |  1049   (3)|   2000 |00:00:00.35 |    9000 |   6552 |
|*  3 |    TABLE ACCESS FULL         | T1    |      1 |    100 |   849   (4)|   2000 |00:00:00.30 |    6554 |   6551 |
|*  4 |    INDEX UNIQUE SCAN         | T2_PK |   2000 |      1 |     1   (0)|   2000 |00:00:00.02 |    2446 |      1 |
|*  5 |   TABLE ACCESS BY INDEX ROWID| T2    |   2000 |      1 |     2   (0)|      0 |00:00:00.01 |    2000 |     36 |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("T1"."N2"=1000 AND "T1"."N1"=0))
   4 - access("T2"."ID"="T1"."ID")
   5 - filter(("T2"."N2"=400 AND "T2"."N1"=0))

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |  1249 (100)|      0 |00:00:00.66 |   32268 |   1246 |
|   1 |  NESTED LOOPS                |       |      1 |    200 |  1249   (3)|      0 |00:00:00.66 |   32268 |   1246 |
|   2 |   NESTED LOOPS               |       |      1 |    200 |  1249   (3)|  20000 |00:00:00.56 |   12268 |    687 |
|*  3 |    TABLE ACCESS FULL         | T2    |      1 |    200 |   849   (4)|  20000 |00:00:00.12 |    6559 |    686 |
|*  4 |    INDEX UNIQUE SCAN         | T1_PK |  20000 |      1 |     1   (0)|  20000 |00:00:00.19 |    5709 |      1 |
|*  5 |   TABLE ACCESS BY INDEX ROWID| T1    |  20000 |      1 |     2   (0)|      0 |00:00:00.15 |   20000 |    559 |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("T2"."N2"=400 AND "T2"."N1"=0))
   4 - access("T2"."ID"="T1"."ID")
   5 - filter(("T1"."N2"=1000 AND "T1"."N1"=0))

Note
-----
   - cardinality feedback used for this statement

The second plan does fewer reads because of the buffering side effects from the first plan – but that’s not what the optimizer is looking at. The key feature is that the first plan predicts 100 rows for t1, with 100 starts for the index probe, but discovers 2,000 rows and does 2,000 probes. Applying cardinality feedback the optimizer decides that fetching 200 rows from t2 and probing t1 200 times will be lower cost than running the join the other way round with the 2,000 rows it now knows it will get – but at runtime Oracle actually gets 20,000 rows, does three times as many buffer gets, and spends twice as much time as it did on the first plan.

Hinting

Oracle hasn’t been able to learn (in time) that t2 will supply 20,000 rows – but if you knew this would happen you could use the cardinality() hint to tell the optimizer the truth about both tables /*+ cardinality(t1 2000) cardinality(t2 20000) */ this is the plan you would get:

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  1698 (100)|      0 |00:00:00.06 |   13109 |  13105 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |   2000 |  1698   (4)|      0 |00:00:00.06 |   13109 |  13105 |  1696K|  1696K| 1647K (0)|
|*  2 |   TABLE ACCESS FULL| T1   |      1 |   2000 |   849   (4)|   2000 |00:00:00.05 |    6554 |   6552 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T2   |      1 |  20000 |   849   (4)|  20000 |00:00:00.09 |    6555 |   6553 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter(("T1"."N2"=1000 AND "T1"."N1"=0))
   3 - filter(("T2"."N2"=400 AND "T2"."N1"=0))

Unfortunately, unless you have used hints, it doesn’t matter how many times you re-run the query after cardinality feedback has pushed you into the bad plan – it’s not going to change again (unless you mess around flushing the shared pool or using dbms_shared_pool.purge() to kick out the specific statement).

Upgrade

If you upgrade to 12c the optimizer does a much better job of handling this query – it produces an adaptive execution plan (starting with the nested loop join but dynamically switching to the hash join as the query runs). Here’s the full adaptive plan pulled from memory after the first execution – as you can see both the t1/t2 nested loop and hash joins were considered, then the nested loop was discarded in mid-execution. Checking the 10053 trace file I found that Oracle has set the inflexion point (cross-over from NLJ to HJ) at 431 rows.


----------------------------------------------------------------------------------------------------------------------------------------------------
|   Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT              |       |      1 |        |  1063 (100)|      0 |00:00:00.06 |   13113 |  13107 |       |       |          |
|  *  1 |  HASH JOIN                    |       |      1 |    100 |  1063   (3)|      0 |00:00:00.06 |   13113 |  13107 |  1519K|  1519K| 1349K (0)|
|-    2 |   NESTED LOOPS                |       |      1 |    100 |  1063   (3)|   2000 |00:00:00.11 |    6556 |   6553 |       |       |          |
|-    3 |    NESTED LOOPS               |       |      1 |    100 |  1063   (3)|   2000 |00:00:00.10 |    6556 |   6553 |       |       |          |
|-    4 |     STATISTICS COLLECTOR      |       |      1 |        |            |   2000 |00:00:00.09 |    6556 |   6553 |       |       |          |
|  *  5 |      TABLE ACCESS FULL        | T1    |      1 |    100 |   863   (4)|   2000 |00:00:00.08 |    6556 |   6553 |       |       |          |
|- *  6 |     INDEX UNIQUE SCAN         | T2_PK |      0 |      1 |     1   (0)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|- *  7 |    TABLE ACCESS BY INDEX ROWID| T2    |      0 |      1 |     2   (0)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|  *  8 |   TABLE ACCESS FULL           | T2    |      1 |      1 |     2   (0)|  20000 |00:00:00.07 |    6557 |   6554 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   5 - filter(("T1"."N2"=1000 AND "T1"."N1"=0))
   6 - access("T2"."ID"="T1"."ID")
   7 - filter(("T2"."N2"=400 AND "T2"."N1"=0))
   8 - filter(("T2"."N2"=400 AND "T2"."N1"=0))

Note
-----
   - this is an adaptive plan (rows marked '-' are inactive)

Footnote:

For reference, here are a couple of the bug (or patch)  numbers associated with cardinality feedback:

  • Patch 13454409: BAD CARDINALITY FROM FEEDBACK (CFB) ON VIEW RHS OF NLJ
  • Bug 16837274 (fixed in 12.2): Bad cost estimate for object on RHS of NLJ
  • Bug 12557401: The table that is being incorrectly adjusted is in the right hand side of a nested loops.
  • Bug 8521689: Bad cardinality feedback estimate for view on right-hand side of NLJ

 


Pseudo-Philosophical Observations on Wearables, Part 1

Oracle AppsLab - Wed, 2014-11-05 11:53

Jawbone announced the Up3 today, reportedly its most advanced fitness tracker to date.

As with all fitness trackers, the Up3 has an accelerometer, but it also has sensors for measuring skin and ambient temperature, as well as something called bioimpedence. As these data collected by the Up3 are used by a new feature called Smart Coach.

You can imagine what the Smart Coach does. It sounds like a cool, possibly creepy, feature.

This post is not about the Up3.

This post is about my journey into the dark heart of the quantified self. The Up3 has just reminded me to coalesce my thoughts.

Earlier this year, I started wearing my first fitness tracker, the Misfit Shine. I happily wore it for about two months before the battery died, and then I realized it had control of me.

Misfit calculates activity based on points, and my personal goal of 1,000 points was relatively easy to reach every day, even for someone who works from home. What I realized quickly was that the Shine pushed me to chase points, not activity.

Screenshot_2014-11-05-08-18-56

My high score.

 

The Shine uses its accelerometer to measure activity, so depending on where I wore it on my person, a run could be worth more points. This isn’t unique to the Shine. I’ve seen people spinning at the gym wearing their fitness trackers on their ankles.

As the weeks passed, I found myself avoiding activities that didn’t register a lot of points, definitely not good behavior, and even though my goal was 1,000 points, I avoided raising it for fear of missing my daily goal-achievement dopamine high.

Then, mid-Summer, Misfit dropped an update that added some new game mechanics, and one day, my Shine app happily informed me that I’d hit my goal 22 days in a row.

This streak was the beginning of the end for me.

On the 29th day of my streak, the battery died. I replaced it, crisis averted, streak in tact. Then, later that day, the Shine inexplicably died. I tried several new batteries and finally had to contact support.

All the while, I worried about my streak. I went to gym, but it felt hollow and meaningless without the tangible representation, the coaching, as it were, from my Shine.

This is not a good look.

Misfit replaced my Shine, but in the days that elapsed, during my detox, I decided to let it go. Turns out the quantified self isn’t for obsessive, overly-competitive personality types like me.

And I’m not the only one in this group.

In September, I read an article called Stepping Out: Living the Fitbit Life, in which the author, David Sedaris, describes a similar obsession with his Fitbit. As I read it, I commiserated, but I also felt a little jealous of the level of his commitment. This dude makes me look like a rank amateur.

Definitely worth a read.

Anyway, this is not in any meant to be an indictment of the Shine, Fitbit, Jawbone or any fitness tracker. Overall, these devices offer people a positive and effective way to reenforce healthy behavior and habits.

But for people like, they lead to unanticipated side effects. As I read about the Up3, its sensors and Smart Coach, all of which sound very cool, I had to remind myself of the bad places where I went with the Shine.

And the colloquial, functionally-incorrect but very memorable, definition of insanity.

In Part 2, when I get around to it, I’ll discuss the flaws in the game mechanics these companies use.

Find the comments.Possibly Related Posts:

Pythian at LISA14

Pythian Group - Wed, 2014-11-05 10:27

Pythian is a sponsor at the LISA conference this year, where we’ll be participating in a panel discussion, invited talk, and a birds-of-a-feather session.

Bill Fraser, Principal Consultant in Pythian’s SRE practice notes that this year is different for Pythian. “While members of our team have attended LISA in the past, this marks the second year Pythian will be a sponsor for the event, and the first time we have been accepted to speak.” Bill will be speaking at one of the birds-of-a-feather sessions alongside Pierig Le Saux, another Principal Consultant in Pythian’s SRE practice.

“One of the longest running technical conferences of its kind (this will be the 28th incarnation), the LISA conference is an opportunity to meet, learn from, and network with some of the most respected technical leaders and researchers in the industry,” Bill says. “The conference program features talks, panels, and tutorials on the topics of DevOps, monitoring and metrics, and security, and provides the attendee with an opportunity to learn about the newest tools and emerging technologies in the field.”

“For Pythian, the conference provides us with an opportunity to give back to the community, by showing our support of the LISA conference and USENIX organization, and allowing us to share the experiences of members of our team. We look forward to seeing you there. Please stick around after our talks and introduce yourself, and / or stop by our booth and say hi!”

 

Birds-of-a-Feather Session featuring Bill Fraser and Pierig Le Saux
Wednesday November 12, 2014 — 7:00-8:00 PM
Grand Ballroom A

Bill Fraser and Pierig, Principal Consultants for the SRE practice at Pythian, will be discussing what it really means to be practicing Infrastructure as Code. They will provide examples of concepts, tools and workflows, and encourage attendees to engage in a dialog of how their day-to-day work is changing.

 

Remote Work panel featuring Bill Lincoln
Thursday November 13, 2014 — 11:00 AM-12:30 PM

Bill Lincoln, Service Delivery Manager and business advocate at Pythian, will be participating in the Remote Work panel. This panel will focus on how companies handle the remote workers in Ops roles. “When you’re looking for the best talent in the world, remote work is a requirement—not an option,” Bill says. “Finding ways to effectively manage large, remote, teams across the globe is a challenge for any organization and Pythian has built its entire business around this model.” He will be presenting alongside folks from DigitalOcean, TeamSnap, and Etsy.

 

Invited Talk presented by Chris Stankaitis
Friday November 14, 2014 — 9:00-9:45 AM

Chris Stankaitis, Team Lead for the Enterprise Infrastructure Services practice at Pythian, will be presenting an invited talk called Embracing Checklists as a Tool for Human Reliability.

“A pilot cannot fly a plane, and a surgeon cannot cut into a person without first going through a checklist.” Chris says. “These are some of the most well educated and highly skilled people in the world, and they have embraced the value of checklists as a tool that can dramatically reduce human error.”

 

EXPO Hall
Wednesday November 12, 12:00 -7:00 PM
Thursday November 13, 10:00 AM-2:00 PM

The LISA EXPO hall opens at noon on Wednesday, so be sure to stop by Pythian’s booth #204 (we’ll be in good company, right next to Google!) You could win a Sonos Play: 1, and all you have to do is take a selfie. Learn the full contest details at our booth, and follow us on Twitter @Pythian to stay updated!

 

Pythian is a global leader in data consulting and managed services. We specialize in optimizing and managing mission-critical data systems, combining the world’s leading data experts with advanced, secure service delivery. Learn more about Pythian’s Data Infrastructure expertise.

 

Categories: DBA Blogs

Watch: The Most Underrated Features of SQL Server 2014 — Part 2

Pythian Group - Wed, 2014-11-05 10:09

Since its release back in April, SQL Server experts across the globe are becoming familiar with the top features in Microsoft SQL Server 2014—the In-Memory OLTP engine, the AlwaysOn enhancements, and more. But we couldn’t help but notice that there are a few features that aren’t getting the same attention. Warner Chaves, a Microsoft Certified Master and SQL Server Principal Consultant at Pythian has filmed a video series sharing the most underrated features of SQL Server 2014.

In the second video in his series, Warner discusses the significance of the new partition operations, making online index rebuilds and incremental statistics much more efficient. “For many clients, it was really hard to find maintenance windows that were big enough to actually rebuild the entire set [of partition tables] when they had fragmentation issues,” said Warner. “2014 now implements online index rebuild on a partition level.” Learn how incremental statistics became more efficient, and when you can start using the new partition operations by watching his video The Most Underrated Features of SQL Server 2014 — Part 2  down below.

Watch the rest of the series here:

 

Pythian is a global leader in data consulting and managed services. We specialize in optimizing and managing mission-critical data systems, combining the world’s leading data experts with advanced, secure service delivery. Learn more about Pythian’s Microsoft SQL Server expertise.

 

Categories: DBA Blogs

Understanding PeopleSoft Global Payroll Identification

Javier Delgado - Wed, 2014-11-05 08:49
The first stage in PeopleSoft Global Payroll processing is the identification of the employees to be calculated. Several criteria are used to determine which employees should be selected. Understanding why an employee is selected is not always evident to users. In this post I'm sharing how I normally determine the identification reason.

Once you run the identification stage, the employees to be processed are stored in the GP_PYE_PRC_STAT table. This table not only shows which employees are going to be calculated, but also indicates which calendars will be considered. This is particularly important when running retroactive calculations, as it allows you understanding the impact of this type of calculations.

In any case, going back to the identification, in this table you will find the SEL_RSN field, which contains a code that translates into the reason behind the employee identification. The valid values that this field may take are:

  • 01: The employee is active during the calendar period and included in the Payee List associated to the calendar.
  • 02: The employee is inactive (but was active before the start of the calendar period) and included in the Payee List associated to the calendar.
  • 03: The employee is active during the calendar period and has a positive input associated to him/her.
  • 04: The employee is active during the calendar period and has a retro trigger associated to him/her.
  • 05: The employee is active during the calendar period and associated to the calendar pay group.
  • 06: The employee is inactive during the calendar period and associated to a positive input in the current calendar.
  • 07: The employee is inactive (but still associated to the calendar pay group) and has a retro trigger associated to him/her.
  • 08: The employee is inactive but has a retroactive calculation delta from a previous calendar which has not been picked yet.
  • 09: The employee is inactive but has a retroactive calculation correction from a previous calendar which has not been picked yet.
  • 0A: The employee is active and linked to the calendar using an override.
  • 0B: The employee is inactive and linked to the calendar using an override.

From a technical standpoint, you can check the SQL used to select each reason by check the stored statement under the name GPPIDNT2_I_PRCnn, when nn is the SEL_RSN value.

Do you use other way to understand why was an employee identified? If so, please feel free to share your method in the comments, as I'm afraid my approach is a little bit too technical. ;)

Webinar: Building Secure Apps with PL/SQL and Formspider

Gerger Consulting - Wed, 2014-11-05 00:58
We are hosting a free Formspider webinar on November 25th. Join in and find out how Formspider helps PL/SQL developers build secure web applications.

In the webinar, the following topics will be covered:

- Formspider Security Architecture
- Built-in countermeasures in Formspider for OWASP Top 10 Security Vulnerabilities
- Introduction to Formspider Authentication and Authorization Repository

Sign up now.
Categories: Development

Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

Rittman Mead Consulting - Tue, 2014-11-04 16:02

In this post we will see how Kibana can be used to create visualisations over various sets of data that we have combined together. Kibana is a graphical front end for data held in ElasticSearch, which also provides the analytic capabilities. Previously we looked at where the data came from and exposing it through Hive, and then loading it into ElasticSearch. Here’s what we’ve built so far, the borders denoting what was covered in the previous two blog articles and what we’ll cover here:

kib32

Now that we’ve got all the data into Elasticsearch, via Hive, we can start putting some pictures around it. Kibana works by directly querying Elasticsearch, generating the same kind of queries that you can run yourself through the Elasticsearch REST API (similar to what we saw when defining the mappings in the previous article). In this sense there is a loose parallel between OBIEE’s Presentation Services and the BI Server – one does the fancy front end stuff, generating queries to the hard-working backend.

I’ve been looking at both the current release version of Kibana (3.x), and also the beta of Kibana 4 which brings with it a very smart visualiser that we’ll look at in detail. It looks like Kibana 4 is a ground-up rewrite rather than modifications to Kibana 3, which means that at the moment it is a long way from parity of functionality – which is why I’m flitting between the two. For a primer in Kibana 3 and its interface see my article on using it to monitor OBIEE.

Installing Kibana is pretty easy in Kibana 3, involving a simple config change to a web server of your choice that you need to provide (details in my previous blog), and has been made even easier in Kibana 4 which actually ships with its own web server so you literally just download it, unarchive it and run it.

So the starting point is the assumption we have all the data in a single Elasticsearch index all_blog, with three different mappings which Kibana refers to accurately as “types”: blog posts, blog visits, and blog tweets.

Kibana 3

Starting with a simple example first, and to illustrate the “analysed” vs “non-analysed” mapping configuration that I mentioned previously, let’s look at the Term visualisation in Kibana 3. This displays the results of an Elasticsearch analysis against a given field. If the field has been marked as “not analysed” we get a listing of the literal values, ranking by the number of times they repeat. This is useful, for example, to show who has blogged the most:

But less useful if we want to analyse the use of words in blog titles, since non-analysed we just get a listing of blog titles:

(there are indeed two blog posts entitled “Odds and Ends” from quite a while ago 1 2)

Building the Term visualisation against the post title field that has been analysed gives us a more interesting, although hardly surprising, result:

Here I’ve weeded out the obvious words that will appear all the time (‘the’, ‘a’, etc), using the Exclude Term(s) option.

Term visualisations are really useful for displaying any kind of top/bottom ranked values, and also because they are interactive – if you click on the value it is applied as a filter to the data on the page. What that means is that we can take a simple dashboard using the two Term objects above, plus a histogram of posts made over time:

And by clicking on one of the terms (for example, my name in the authors list) it shows that I only started posting on the Rittman Mead blog three years ago, and that I write about OBIEE, performance, and exalytics.

Taking another tack, we can search for any term and add it in to the histogram. Here we can see when interest in 11g (the green line), as well as big data (red), started :

Note here we’re just analyzing post titles not content so it’s not 100% representative. Maybe loading in our post contents to Elasticsearch will be my next blog post. But that does then start to get a little bit meta…

Adding in a Table view gives us the ability to show the actual posts and links to them.

Let’s explore the data a bit. Clicking on an entry in the table gives us the option to filter down further

Here we can see for a selected blog post, what its traffic was and when (if at all) it was tweeted:

Interesting in the profile of blog hits is a second peak that looks like it might correlate with tweets. Let’s drill further by drag-clicking (brushing) on the graph to select the range we want, and bring in details of those tweets:

So this is all pretty interesting, and importantly, very rapid in terms of both the user experience and the response time.

Kibana 4

Now let’s take a look at what Kibana 4 offers us. As well as a snazzier interface (think hipster data explorer vs hairy ops guy parsing logs), its new Visualiser builder is great. Kibana 3 dumped you on a dashboard in which you have to build rows and panels and so on. Kibana 4 has a nice big “Visualize” button. Let’s see what this does for us. To start with it’s a nice “guided” build process:

By default we get a single bar, counting all the ‘documents’ for the time period. We can use the Search option at the top to filter just the ‘type’ of document we want, which in this case is going to be tweets about our blog articles.

Obviously, a single bar on its own isn’t that interesting, so let’s improve it. We’ll click the “Add Aggregation” button (even though to my pedantic mind the data is already aggregated to total), and add an X-Axis of date:

The bucket size in the histogram defaults to automatic, and the the axis label tells us it’s per three hours. At the volume of tweets we’re analysing, we’d see patterns better at a higher grain such as daily (the penultimate bar to the right of the graph shows a busy day of tweets that’s lost in the graph at 3-hour intervals):

NB at the moment in Kibana 4 intervals are fixed (in Kibana 3 they were freeform).

Let’s dig into the tweets a bit deeper. Adding a “Sub Aggregation” to split the bars based on top two tweet authors per day gives us this:

You can hover over the legend to highlight the relevant bar block too:

Now with a nifty function in the Visualizer we can change the order of this question. So instead of, “by day, who were the top two tweeters”, we can ask “who were the top two tweeters over the time period, and what was their tweet count by day” – all just by rearranging the buckets/aggregation with a single click:

Let’s take another angle on the data, looking not at time but which blog links were most tweeted, and by whom. Turns out I’m a self-publicist, tweeting four times about my OOW article. Note that I’ve also including some filtering on my data to exclude automated tweets:

Broadening out the tweets to all those from accounts we were capturing during the sample we can see the most active tweeters, and also what proportion are original content vs retweets:

Turning our attention to the blog hits, it’s easy to break it down by top five articles in a period, accesses by day:

Having combined (dare I say, mashed up) post metadata with apache logs, we can overlay information about which author gets the most hits. Unsuprisingly Mark Rittman gets the lion’s share, but interestingly Venkat, who has not blogged for quite a while is still in the top three authors (based on blog page hits) in the time period analysed:

It’s in the current lack of a table visualisation that Kibana 4 is currently limited (although it is planned), because this analysis here (of the top three authors, what were their respective two most popular posts) just makes no sense as a graph:

but would be nice an easy to read off a table. You can access a table view of sorts from the arrow at the bottom of the screen, but this feels more like a debug option than an equal method for presenting the data

Whilst you can access the table on a dashboard, it doesn’t persist as the default option of the view, always showing the graph initially. As noted above, a table visualisation is planned and under development for Kibana 4.

Speaking of dashboards, Kibana 4 has a very nice dashboard builder with interactive resizing of objects both within rows and columns – quite a departure from Kibana 3 which has a rigid system of rows and panels:

Summary

Kibana 3 is great for properly analysing data and trends as you find them in the data, if you don’t mind working your way through the slightly rough interface. In contrast, Kibana 4 has a pretty slick UI but being an early beta is missing features like Term and Table from Kibana 3 that would enable tables of data as well as the pretty graphs. It’ll be great to see how it develops.

Putting the data in Elasticsearch makes it very fast to query. I’m doing this on a the Big Data Lite VM which admittedly is not very representative of a realworld Hadoop cluster but the relative speeds are interesting – dozens of seconds for any kind of Hive query, subsecond for any kind of Kibana/Elasticsearch query. The advantage of the latter of course being very interesting from a data exploration point of view, because you not only have the speed but also the visualisation and interactions with those visuals to dig and drill further into it.

Whilst Elasticsearch is extremely fast to query, I’ve not compared it to other options that are designed for speed (eg Impala) and which support a more standard interface, such as ODBC or JDBC so you can bring your own data visualisation tool (eg T-who-shall-not-be-named). In addition, there is the architectural consideration of Elasticsearch’s fit with the rest of the Hadoop stack. Whilst the elasticsearch-hadoop connector is two-way, I’m not sure if you would necessarily site your data in Elasticsearch alone, opting instead to duplicate all or part of it from somewhere like HDFS.

What would be interesting is to look at a similar analysis exercise using the updated Hue Search in CDH 5.2 which uses Apache Solr and therefore based on the same project as Elasticsearch (Apache Lucene). Another angle on this is Oracle’s forthcoming Big Data Discovery tool which also looks like it covers a similar purpose.

Categories: BI & Warehousing

Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

Rittman Mead Consulting - Tue, 2014-11-04 15:17
Introduction

In the first part of this series I described how I made several sets of data relating to the Rittman Mead blog from various sources available through Hive. This included blog hits from the Apache webserver log, tweets, and metadata from WordPress. Having got it into Hive I now need to get it into ElasticSearch as a pre-requisite for using Kibana to see how it holds up as a analysis tool or as a “data discovery” option. Here’s a reminder of the high-level architecture, with the parts that I’ve divided it up into covering over the three number of blog posts indicated:

kib31

In this article we will see how to go about doing that load into ElasticSearch, before getting into some hands-on with Kibana in the final article of this series.

Loading data from Hive to Elasticsearch

We need to get the data into Elasticsearch itself since that is where Kibana requires it to be for generating the visualisations. Elasticsearch holds the data and provides the analytics engine, and Kibana provides the visualisation rendering and the generation of queries into Elasticsearch. Kibana and Elasticsearch are the ‘E’ and ‘K’ of the ELK stack, which I have written about previously (the ‘L’ being Logstash but we’re not using that here).

Using the elasticsearch-hadoop connector we can load data exposed through Hive into Elasticsearch. It’s possible to load data directly from origin into Elasticsearch (using, for example, Logstash) but here we’re wanting to bring together several sets of data using Hadoop/Hive as the common point of integration.

Elasticsearch has a concept of an ‘index’ within which data is stored, held under a schema known as a ‘mapping’. Each index can have multiple mappings. It’s dead easy to run Elasticsearch – simply download it, unpack the archive, and then run it – it really is as easy as that:

[oracle@bigdatalite ~]$ /opt/elasticsearch-1.4.0.Beta1/bin/elasticsearch
[2014-10-30 16:59:39,078][INFO ][node                     ] [Master] version[1.4.0.Beta1], pid[13467], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-30 16:59:39,080][INFO ][node                     ] [Master] initializing ...
[2014-10-30 16:59:39,094][INFO ][plugins                  ] [Master] loaded [], sites [kopf, gui]
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] initialized
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] starting ...
[2014-10-30 16:59:43,419][INFO ][transport                ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.57.3:9300]}
[2014-10-30 16:59:43,446][INFO ][discovery                ] [Master] elasticsearch/mkQYgr4bSiG-FqEVRkB_iw
[2014-10-30 16:59:46,501][INFO ][cluster.service          ] [Master] new_master [Master][mkQYgr4bSiG-FqEVRkB_iw][bigdatalite.localdomain][inet[/192.168.57.3:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-30 16:59:46,552][INFO ][http                     ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.57.3:9200]}
[2014-10-30 16:59:46,552][INFO ][node                     ] [Master] started

You can load data directly across into Elasticsearch from Hive without having to prepare anything on Elasticsearch – it will create the index and mapping for you. But, for it to work how we want, we do need to specify the mapping in advance because we want to tell Elasticsearch two important things:

  • To treat the date field as a date – crucial for Kibana to do its time series-based magic
  • Not to “analyze” certain fields. By default Elasticsearch will analyze each string field so that you can display most common terms within it etc. However if we want to report things like blog title, breaking it down into individual words doesn’t make sense.

This means that the process is as follows:

  1. Define the Elasticsearch table in Hive
  2. Load a small sample of data into Elasticsearch from Hive
  3. Extract the mapping and amend the date field and mark required fields as non-analysed
  4. Load the new mapping definition to Elasticsearch
  5. Do a full load from Hive into Elasticsearch

Steps 2 and 3 can be sidestepped by crafting the mapping by hand from the outset but it’s typically quicker not to.

Before we can do anything in terms of shifting data around, we need to make elasticsearch-hadoop available to Hadoop. Download it from the github site, and copy the jar file to /usr/lib/hadoop and add it to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive-env.sh.

Defining the Hive table over Elasticsearch

The Hive definition for a table stored in Elasticsearch is pretty simple. Here’s a basic example of a table that’s going to hold a list of all blog posts made. Note the _es suffix, a convention I’m using to differentiate the Hive table from others with the same data and denoting that it’s in Elasticsearch (es). Also note the use of EXTERNAL as previously discussed, to stop Hive trashing the underlying data if you drop the Hive table:

CREATE EXTERNAL TABLE all_blog_posts_es (
ts_epoch bigint ,
post_title string ,
post_title_a string ,
post_author string ,
url string ,
post_type string )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/posts'
) ;

The ROW FORMAT and STORED BY are standard, but the TBLPROPERTIES values should be explained (you’ll find full details in the manual):

  1. es.nodes – this is the hostname of the Elasticsearch server. If you have multiple nodes it will discover the others from this.
  2. es.resource – this is the index and mapping where the data should be stored. We’ll see more about these later, because they’re important.
Time for a tangent …

The biggest issue I had getting data from Hive into Elasticsearch was timestamps. To cut a very long story (involving lots of random jiggling, hi Christian!) short, I found it was easiest to convert timestamps into Unix epoch (number of seconds since Jan 1st 1970), rather than prat about with format strings (and prat about I did). For timestamps already matching the ISO8601 standard such as those in my WordPress data, I could leverage the Hive function UNIX_TIMESTAMP which returns exactly that

0: jdbc:hive2://bigdatalite:10000> select post_date, unix_timestamp(post_date) as post_date_epoch from posts limit 1;
post_date        2007-03-07 17:45:07
post_date_epoch  1173289507

For others though that included the month name as text such as Wed, 17 Sep 2014 08:31:20 +0000 I had to write a very kludgy CASE statement to first switch the month names for numbers and then concatenate the whole lot into a ISO8601 that could be converted to unix epoch. This is why I also split the apache log SerDe so that it would bring in the timestamp components (time_dayDD, time_monthMMM, etc) individually, making the epoch conversion a little bit neater:

unix_timestamp(concat(concat(concat(concat(concat(concat(
a.time_yearyyyy,'-')
,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
,a.time_daydd,' ')
,a.time_hourhh,':')
,a.time_minmm,':')
,a.time_secss,'')
)

Because if you thought this was bad, check out what I had to do to the twitter timestamp:

unix_timestamp(
    concat(concat(concat(concat(regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-')
    ,case regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} ',''),' .*$','') 
    when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,regexp_replace(regexp_replace(created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))
)

As with a few things here, this was all for experimentation than streamlined production usage, so it probably could be rewritten more efficiently or solved in a better way – suggestions welcome!

So the nett result of all of these is the timestamp as epoch in seconds – but note that Elasticsearch works with millisecond epoch, so they all need multiplying by 1000.

As I’ve noted above, this feels more complex than it needed to have been, and maybe with a bit more perseverence I could have got it to work without resorting to epoch. The issue I continued to hit with passing timestamps across as non-epoch values (i.e. as strings using the format option of the Elasticsearch mapping definition, or Hive Timestamp, and even specifying es.mapping.timestamp) was org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: TimestampParsingException, regardless of the careful format masks that I applied.

Back on track – loading a sample row into Elasticsearch

We want to send a sample row of data to Elasticsearch now for two reasons:

  1. As a canary to prove the “plumbing” – no point chucking thousands of rows across through MapReduce if it’s going to fall over for a simple problem (I learnt my lesson during the timestamp fiddling above).
  2. Automagically generate the Elasticsearch mapping, which we subsequently need to modify by hand and is easier if it’s been created for us first.

Since the table is defined in Hive, we can just run a straightforward INSERT to send some data across, making use of the LIMIT clause of HiveQL to just send a couple of rows:

INSERT INTO TABLE all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
       title, 
       title, 
       author, 
       REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
       post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
LIMIT 2
;

Hive will generate a MapReduce job that pushes the resulting data over to Elasticsearch. You can see the log for the job – essential for troubleshooting – at /var/log/hive/hive-server2.log (by default). In this snippet you can see a successful completion:

2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Starting Job = job_1414451727442_0011, Tracking URL = http://bigdatalite.localdomain:8088/proxy/application_1414451727442_0011/
2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414451727442_0011
2014-10-30 22:35:22,244 INFO  exec.Task (SessionState.java:printInfo(417)) - Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 1
2014-10-30 22:35:22,275 WARN  mapreduce.Counters (AbstractCounters.java:getGroup(234)) - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-10-30 22:35:22,276 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:22,276 Stage-0 map = 0%,  reduce = 0%
2014-10-30 22:35:30,757 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:30,757 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.51 sec
2014-10-30 22:35:40,098 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:40,098 Stage-0 map = 100%,  reduce = 100%, Cumulative CPU 4.44 sec
2014-10-30 22:35:40,100 INFO  exec.Task (SessionState.java:printInfo(417)) - MapReduce Total cumulative CPU time: 4 seconds 440 msec
2014-10-30 22:35:40,132 INFO  exec.Task (SessionState.java:printInfo(417)) - Ended Job = job_1414451727442_0011
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - MapReduce Jobs Launched:
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.44 sec   HDFS Read: 4313 HDFS Write: 0 SUCCESS
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Total MapReduce CPU Time Spent: 4 seconds 440 msec
2014-10-30 22:35:40,159 INFO  ql.Driver (SessionState.java:printInfo(417)) - OK

But if you’ve a problem with your setup you’ll most likely see this generic error instead passed back to beeline prompt:

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

Meaning that you need to go to the Hive log file for the full diagnostics.

Amending the Elasticsearch mapping

So assuming the previous step worked (if you got the innocuous No rows affected from beeline then it did) you now have an index and mapping (and a couple of “documents” of data) in Elasticsearch. You can inspect the mapping in several ways, including with the GUI for Elasticsearch admin kopf.

You can also interogate Elasticsearch directly with its REST API, which is what we’re going to use to update the mapping so let’s use it also to view it. I’m going to use curl to do the HTTP call, and then pipe it | straight to jq to prettify the resulting JSON that Elasticsearch sends back.

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/posts/_mapping' | jq '.'
{
  "all_blog": {
    "mappings": {
      "posts": {
        "properties": {
          "url": {
            "type": "string"
          },
          "ts_epoch": {
            "type": "long"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string"
          },
          "post_author": {
            "type": "string"
          }
        }
      }
    }
  }
}

We can see from this that Elasticsearch has generated the mapping to match the data that we’ve sent across from Hive (note how it’s picked up the ts_epoch type as being numeric not string, per our Hive table DDL). But, as mentioned previously, there are two things we need to rectify here:

  1. ts_epoch needs to be a date type, not long. Without the correct type, Kibana won’t recognise it as a date field.
  2. Fields that we don’t want broken down for analysis need marking as such. We’ll see the real difference that this makes when we get on to Kibana later.

To amend the mapping we just take the JSON document, make the changes, and then push it back with curl again. You can use any editor with the JSON (I’ve found Atom on the Mac to be great for its syntax highlighting, brace matching, etc). To change the type of the date field just change long to date. To mark a field not for analysis add "index": "not_analyzed" to the column definition. After these changes, the amended fields in my mapping JSON look like this:

[...]
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
            [...]

The particularly eagle-eyed of you will notice that I am loading post_title in twice. This is because I want to use the field both as a label but also to analyse it as a field itself, looking at which terms get used most. So in the updated mapping, only post_title is set to not_analyzed; the post_title_a is left alone.

To remove the existing mapping, use this API call:

curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/posts'

and then the amended mapping put back. Note that the "all_blog" / "mappings" outer levels of the JSON have been removed from the JSON that we send back to Elasticsearch:

curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
{
      "posts": {
        "properties": {
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
          }
        }
      }
    }
'

Full load into Elasticsearch

Now we can go ahead and run a full INSERT from Hive, and this time the existing mapping will be used. Depending on how much data you’re loading, it might take a while but you can always tail the hive-server2.log file to monitor progress. So that we don’t duplicate the ‘canary’ data that we sent across, use the INSERT OVERWRITE statement:

INSERT OVERWRITE table all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
title, 
title, 
author, 
REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
;

To check the data’s made it across we can do a count from Hive:

0: jdbc:hive2://bigdatalite:10000> select count(*) from all_blog_posts_es;
+------+
| _c0  |
+------+
| 2257 |
+------+
1 row selected (27.005 seconds)

But this requires a MapReduce job to run and is fairly slow. Much faster is direct from the horse’s mouth – from Elasticsearch itself where the data is. Just as we called a REST API to get and set the mapping, Elasticsearch can also give us statistics back this way too:

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/_stats/docs' | jq '.indices[].total.docs'
{
  "deleted": 0,
  "count": 2257
}

Here I’ve used a bit more jq to parse down the stats in JSON that Elasticsearch sends back. If you want to explore more of what jq can do, you’ll find https://jqplay.org/ useful.

Code

For reference, here is the set of three curl/DDL/DML that I used:

  • Elasticsearch index mappings

    # For reruns, remove and recreate index
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog' && curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog'
    
    # For partial rerun, remove mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts'
    # Create posts mapping
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
    {
    "posts" : {
    "properties": {
    "ts_epoch": {"type": "date"},
    "post_author": {"type": "string", "index" : "not_analyzed"},
    "post_title": {"type": "string", "index" : "not_analyzed"},
    "post_title_a": {"type": "string", "index" : "analyzed"},
    "post_type": {"type": "string", "index" : "not_analyzed"},
    "url": {"type": "string", "index" : "not_analyzed"}
    }}}
    '
    
    # For partial rerun, remove mapping
    # Create tweets mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets'
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets' -d '
    {"tweets": {
    "properties": {
    "tweet_url": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_type": {
    "type": "string"
    },
    "ts_epoch": {
    "type": "date"
    },
    "tweet_author": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_author_followers": {
    "type": "string"
    },
    "tweet_author_friends": {
    "type": "string"
    },
    "tweet_author_handle": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet": {
    "index": "not_analyzed",
    "type": "string"
    },
    "tweet_analysed": {     "type": "string"      }
    ,"post_author": {       "index": "not_analyzed","type": "string"      }
    ,"post_title": {       "index": "not_analyzed", "type": "string"      }
    ,"post_title_a": {    "type": "string"    }
    
    }
    }
    }'
    
    # For partial rerun, remove mapping
    curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache'
    # Create apachelog mapping
    curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache' -d '
    
    {"apache": {
    "properties": {
    "user": {
    "type": "string"
    },
    "url": {
    "index": "not_analyzed",
    "type": "string"
    },
    "status": {
    "type": "string"
    },
    "agent": {
    "index": "not_analyzed",
    "type": "string"
    },
    "host": {
    "type": "string"
    },
    "http_call": {
    "type": "string"
    },
    "http_status": {
    "type": "string"
    },
    "identity": {
    "type": "string"
    },
    "referer": {
    "index": "not_analyzed",
    "type": "string"
    },
    "ts_epoch": {
    "type": "date"
    },
    "size": {
    "type": "string"
    },"post_author": {      "index": "not_analyzed","type": "string"      }
    ,"post_title": {       "index": "not_analyzed", "type": "string"      }
    ,"post_title_a": {    "type": "string"    }
    
    }}}'

  • Hive table DDL

    drop table all_blog_posts_es;
    CREATE external TABLE all_blog_posts_es(
    ts_epoch bigint ,
    post_title string ,
    post_title_a string ,
    post_author string ,
    url string ,
    post_type string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/posts')
    ;
    
    drop table all_blog_tweets_es;
    CREATE EXTERNAL TABLE all_blog_tweets_es(
    tweet_type string ,
    tweet_url string ,
    tweet_author string ,
    tweet string ,
    tweet_analysed string ,
    ts_epoch bigint ,
    tweet_author_handle string ,
    tweet_author_followers string ,
    tweet_author_friends string ,url string ,post_author string ,
    post_title string ,post_title_a string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/tweets')
    ;
    
    drop table all_blog_apache_es;
    CREATE EXTERNAL TABLE all_blog_apache_es(
    host string ,
    identity string ,
    user string ,
    ts_epoch bigint ,
    http_call string ,
    url string ,
    http_status string ,
    status string ,
    size string ,
    referer string ,
    agent string, post_author string ,
    post_title string ,post_title_a string )
    ROW FORMAT SERDE
    'org.elasticsearch.hadoop.hive.EsSerDe'
    STORED BY
    'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES (
    'es.nodes'='bigdatalite.localdomain',
    'es.resource'='all_blog/apache');

  • Hive DML – load data to Elasticsearch

    insert into table all_blog_posts_es
    select unix_timestamp(post_date) * 1000 as post_date_epoch,title,title,author,
    regexp_extract(generated_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)
    ,post_type
    from posts
    where post_date is not null
    ;
    
    insert overwrite table all_blog_tweets_es
    select x.*,p.author,p.title
    from (
    select 'tweets'
    ,t.url as tweet_url
    ,t.author
    ,t.content as tweet
    ,t.content as tweet_analyzed
    ,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
    ,t.author_handle
    ,t.author_followers
    ,t.author_friends
    ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
    from tweets t lateral view explode (referenced_urls) refs as ref_url
    where t.author_followers is not null
    and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;
    
    insert overwrite table all_blog_tweets_es
    select x.*,p.author,p.title
    from (
    select 'retweets'
    ,t.url as tweet_url
    ,t.author
    ,t.content as tweet
    ,t.content as tweet_analyzed
    ,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
    ,t.author_handle
    ,t.author_followers
    ,t.author_friends
    ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
    from retweets t lateral view explode (referenced_urls) refs as ref_url
    where t.author_followers is not null
    and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;
    
    insert into table all_blog_apache_es
    select x.*,p.author,p.title,p.title
    from (
    select
    a.host,a.identity,a.user
    ,unix_timestamp(concat(concat(concat(concat(concat(concat(
    a.time_yearyyyy,'-')
    ,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,a.time_daydd,' ')
    ,a.time_hourhh,':')
    ,a.time_minmm,':')
    ,a.time_secss,'')
    ) * 1000 as ts_epoch
    ,a.http_call ,regexp_extract(a.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url,a.http_status ,a.status ,a.size ,a.referer ,a.agent
    from apachelog a
    where a.url regexp "^\\/\\d{4}\\/\\d{2}\\/.*"
    ) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
    ;

Summary

With the data loaded into Elasticsearch we’re now ready to start our analysis against it. Stay tuned for the final part in this short blog series to see how we use Kibana to do this.

Categories: BI & Warehousing

How to disable all database links

Yann Neuhaus - Tue, 2014-11-04 13:36

A frequent scenario: you refresh test from production with a RMAN duplicate. Once the duplicate is done, you probably change dblinks so that they address the test environment instead of the production one. But are you sure that nobody will connect in between and risk to access production from the test environement? You want to disable all db links until you have finished your post-duplicate tasks.

I know two solutions for that. The first one is for 12c only. You can add the NOOPEN to the duplicate statement. Then the duplicate leaves the database in MOUNT and you can open it in restricted mode and do anything you want before opening it to your users.

But if you're still in 11g you want to be able to disable all database links before the open. That can be done in the instance, steeing the open_links parameter to zero in your spfile.

Let's see an example:

SQL> alter system set open_links=0 scope=spfile;
System altered.

I restart my instance:

startup force
ORACLE instance started.
Total System Global Area  943718400 bytes
Fixed Size                  2931136 bytes
Variable Size             641730112 bytes
Database Buffers          188743680 bytes
Redo Buffers                5455872 bytes
In-Memory Area            104857600 bytes
Database mounted.
Database opened.

And here is the result:

SQL> select * from dual@LOOPBACK_DB_LINK;
select * from dual@LOOPBACK_DB_LINK
                   *
ERROR at line 1:
ORA-02020: too many database links in use

With that you prevent any connection through database links until you change them to address the test environment. Then:

SQL> alter system reset open_links;
System altered.

SQL> shutdown immediate;
SQL> startup

and then:

SQL> show parameter open_links

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
open_links                           integer     4
open_links_per_instance              integer     4

SQL> set autotrace on explain
SQL> select * from dual@LOOPBACK_DB_LINK;

D
-
X


Execution Plan
----------------------------------------------------------
Plan hash value: 272002086

----------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT REMOTE|      |     1 |     2 |     2   (0)| 00:00:01 |        |
|   1 |  TABLE ACCESS FULL     | DUAL |     1 |     2 |     2   (0)| 00:00:01 |    DB1 |
----------------------------------------------------------------------------------------

Note
-----
   - fully remote statement

SQL> set autotrace off
SQL> select * from V$DBLINK;

DB_LINK
----------------------------------------------------------------------------------------------------
  OWNER_ID LOG HET PROTOC OPEN_CURSORS IN_ UPD COMMIT_POINT_STRENGTH     CON_ID
---------- --- --- ------ ------------ --- --- --------------------- ----------
LOOPBACK_DB_LINK
         0 YES YES UNKN              0 YES YES                     1          0


Yes, that was the occasion to see how to check dblink usage from the execution plan and from V$DBLINK.

Webcast: WebCenter Content & Imaging for Oracle Application Customers

WebCenter Team - Tue, 2014-11-04 11:15

WebCenter Content & Imaging for Oracle Application Customers

Thursday, November 13, 2014 
11:00 AM - 12:00 PM EST

Register here - http://www.sofbang.com/webcenter.aspx Learn from industry experts how »Oracle WebCenter« extends and streamlines document management

Join Oracle & Sofbang on November 13th for an informative free webinar on WebCenter Content and Imaging.

WebCenter Content is Oracle’s best-in-breed comprehensive Enterprise Content Management system (ECM). It gives the user everything that they need to create and manage a wide range of content across the enterprise. It also offers an enterprise class solution for centralized imaging and capture.

Key Benefits of WebCenter
  • Provides fast ROI by eliminating paper, automating business processes & reducing time and labor costs associated with manual data entry
  • Provides ongoing ROI by enabling wide scale enterprise imaging use and increasing delivery of data to back-office applications
  • Maximizes existing investments in Oracle FMW & applications
  • Safe, strategic investment as the default imaging solution for the next generation of Oracle applications
Register here - http://www.sofbang.com/webcenter.aspx

Oracle DBA job in Tempe, Arizona

Bobby Durrett's DBA Blog - Tue, 2014-11-04 10:16

We still have a position open on our Oracle database team here in Tempe, Arizona.  Here is the link with an updated job description: url

We have a great team and would love to have a new member to join us.

-Bobby

Categories: DBA Blogs

Git for PL/SQL

Gerger Consulting - Tue, 2014-11-04 07:42
Gitora, the free version control system for PL/SQL is launching early December '14. Gitora hooks Git to the Oracle database and helps you manage your PL/SQL code easily. Sign up at http://www.gitora.com to get notified when the product launches.

Please share this news in your social networks and help us spread the word.

Thank you.
Categories: Development

Three Level Master Detail with Formspider and PL/SQL

Gerger Consulting - Tue, 2014-11-04 00:28
TANI, a subsidiary of the Koç Holding, provides value-added integrated marketing solutions for offline, digital and mobile platforms. TANI chose Formspider, the application development tool for PL/SQL developers, to implement the application that manages its core business.

Business Need

TANI wanted to improve the efficiency of their business unit and help them make better decisions. As part of this goal TANI decided to upgrade the current campaign management application which is used to manage the online banner ad  campaigns of their customers.

Specifically TANI’s goals in this upgrade were:

  • Increase the data entry/modification speed in the application
  • Improve the reporting capabilities in the application
  • Improve the application’s UI with a fresh and modern look.


The Challenge

The core campaign information in the database spans three tables which are tied to each other in a master-detail-detail relationship. For any campaign, the data in the master row, the detail rows and the detail-detail rows must be validated, committed or rolled backed in the same logical transaction.

The current application did not support batch validation and commit of the entire campaign and therefore was prone to human errors.

The Solution

Since Formspider has an integrated model layer that supports transactions, building a master-detail-detail screen which enforces data validation over three tables was a breeze. The Formspider application easily validates and commits updates to a campaign in the same logical transaction preventing data entry errors.

The Campaign Edit Screen
The master-detail-detail screen also greatly improved the data entry speed of the application because the user could edit the entire campaign information in one screen.

Reporting capabilities of the application also increased significantly thanks to the Formspider grid and its built-in features such as ordering, hiding and filtering of columns.

The Campaign Search Screen with Enhanced Reporting Capabilities
The new application featured new a brand new look&feel in harmony with TANI’s corporate colors. As with every Formspider application, the new campaign management application is a single page application functioning 100% with AJAX giving it the modern effect TANI desired.
New Fresh Look that matches TANI's Corporate Guidelines
Conclusion

Formspider enabled us to deliver TANI a high quality application that features a master-detail-detail data entry screen with validations spanning multiple tables with a fraction of the cost it would take using other technologies.
The application enabled TANI business units to work more efficiently and helped them make better decisions while serving their customers.  
Categories: Development

PeopleSoft's paths to the Cloud - Part II

Javier Delgado - Tue, 2014-11-04 00:01
In my previous post, I've covered some ways in which cloud computing features could be used with PeopleSoft, particularly around Infrastructure as a Service (IaaS) and non-Production environments. Now, I'm going to discuss how cloud technologies bring value to PeopleSoft Production environments.

Gain Flexibility



Some of the advantages of hosting PeopleSoft Production environments using an IaaS provider were also mentioned in the my past article as they are also valid for Non Production environments:

  • Ability to adjust processing power (CPU) and memory according to peak usage.
  • Storage may be enlarged at any time to cope with increasing requirements.
  • Possibility of replicating the existing servers for contingency purposes.

In terms of cost, hosting the Production environment in IaaS may not always be cheaper than the on premise alternative (this needs to be analyzed on a case by case basis). However, the possibility to add more CPU, memory and storage on the run gives IaaS solutions an unprecedented flexibility. It is true that you can obtain similar flexibility with in house virtualized environments, but not many in-house data centers have the available horsepower of Amazon, IBM or Oracle data centers, to name a few.

Be Elastic



Adding additional power to the existing servers may not be the best way to scale up. An alternative way is to add a new server to the PeopleSoft architecture. This type of architecture is called elastic (actually, Amazon EC2 stands for Elastic Computing), as the architecture can elastically grow or shrink in order to adapt to the user load.

Many PeopleSoft customers use Production environments with multiple servers for high availability purposes. You may have two web servers, two application servers, two process schedulers, and so on. This architecture guarantees a better system availability in case one of the nodes fails. Using an elastic architecture means that we can add, for instance, a third application server not only to increase redundancy, but also the application performance.

In order to implement an elastic architecture, you need to fulfill two requirements:

  1. You should be able to quickly deploy an additional instance of any part of the architecture. 
  2. Once the instance is created, it should be plugged in the rest of the components, without disrupting the system availability.

The first point is easily covered by creating an Amazon AMI which can be instantiated at any moment. I've discussed the basics about AMIs in my previous post, but there is plenty of information from Amazon.

The second point is a bit trickier. Let's assume we are adding a new application server instance. If you do not declare this application server in the web servers configuration.properties file, it will not be used.

Of course you can do this manually, but my suggestion is that you try to automate these tasks, as it is this automation which will eventually bring elasticity to your architecture. You need to plan the automation not only for enlarging the architecture, but also for potential reduction (in case you covered a usage peak by increasing the instances and then you want to go back to the original situation).

At BNB we have built a generic elastic architecture, covering all layers of a normal PeopleSoft architecture. If you are planning to move to a cloud infrastructure and you need assistance, we would be happy to help.

Coming Next...

In my next post on this topic, I will cover how Database as a Service could be used to host PeopleSoft databases and what value it brings to PeopleSoft customers.

Starting a Pivotal GemFireXD server from Java

Pas Apicella - Mon, 2014-11-03 21:09
The FabricServer interface provides an easy way to start an embedded GemFire XD server process in an existing Java application.

In short code as follows will get you started. Use this in DEV/TEST scenarios not for production use.
  
package pivotal.au.gemfirexd.demos.startup;

import com.pivotal.gemfirexd.FabricServer;
import com.pivotal.gemfirexd.FabricServiceManager;

import java.sql.SQLException;
import java.util.Properties;

public class StartServer1
{
public static void main(String[] args) throws SQLException, InterruptedException {
// TODO Auto-generated method stub
FabricServer server = FabricServiceManager.getFabricServerInstance();

Properties serverProps = new Properties();
serverProps.setProperty("server-groups", "mygroup");
serverProps.setProperty("persist-dd", "false");
serverProps.setProperty("sys-disk-dir","./gfxd/server1");
serverProps.setProperty("host-data","true");

server.start(serverProps);

server.startNetworkServer("127.0.0.1", 1527, null);

Object lock = new Object();
synchronized (lock) {
while (true) {
lock.wait();
}
}

}
}

More Information

http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#developers_guide/topics/server-side/fabricserver.htmlhttp://feeds.feedburner.com/TheBlasFromPas
Categories: Fusion Middleware

RDX Services: Optimization [VIDEO]

Chris Foot - Mon, 2014-11-03 15:57

Transcript

Hi, welcome to RDX. When searching for a database administration service, it's important to look for a company that prioritizes performance, security and availability.

How does RDX deliver such a service? First, we assess all vulnerabilities and drawbacks that are preventing your environments from operating efficiently. Second, we make any applicable changes that will ensure your business software is running optimally. From there, we regularly conduct quality assurance audits to prevent any performance discrepancies from arising. 

In addition, we offer 24/7 support for every day of the year. We recognize that systems need to remain online on a continuous basis, and we're committed to making sure they remain accessible. 

Thanks for watching!

The post RDX Services: Optimization [VIDEO] appeared first on Remote DBA Experts.

Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

Rittman Mead Consulting - Mon, 2014-11-03 15:21
Introduction

I’ve recently started learning more about the tools and technologies that fall under the loose umbrella term of Big Data, following a lot of the blogs that Mark Rittman has written, including getting Apache log data into Hadoop, and bringing Twitter data into Hadoop via Mongodb.

What I wanted to do was visualise the data I’d brought in, looking for patterns and correlations. Obviously the de facto choice at our shop would be Oracle BI, which Mark previously demonstrated reporting on data in Hadoop through Hive and Impala. But, this was more at the “Data Discovery” phase that is discussed in the new Information Management and Big Data Reference Architecture that Rittman Mead helped write with Oracle. I basically wanted a quick and dirty way to start chucking around columns of data without yet being ready to impose the structure of the OBIEE metadata model on it. One of the tools I’ve worked with recently is a visualisation tool called Kibana which is part of the ELK stack (that I wrote about previously for use in building a monitoring solution for OBIEE). In this article we’ll take a look at making data available to Kibana and then the kind of analytics and visualisations you can do with it. In addition, we’ll see how loading the data into ElasticSearch has the benefit of extremely fast query times compared to through Hive alone.

The Data

I’ve got three sources of data I’m going to work with, all related to the Rittman Mead website:

  • Website logs, from Apache webserver
  • Tweets about Rittman Mead blog articles, via Datasift
  • Metadata about blog posts, extracted from the WordPress MySQL database

At the moment I’ve focussed on just getting the data in, so it’s mostly coming from static files, with the exception of the tweets which are held in a noSQL database (MongoDB).

The Tools

This is where ‘big data’ gets fun, because instead of “Acme DI” and “Acme Database” and “Acme BI”, we have the much more interesting – if somewhat silly – naming conventions of the whackier the better. Here I’m using:

  • Kibana – data visualisation tool for Elasticsearch
  • Elasticsearch – data store & analytics / search engine
  • HDFS – Hadoop’s distributed file system
  • MongoDB – NoSQL database
  • Hive – enables querying data held in various places including HDFS (and Elasticsearch, and MongoDB) with a SQL-like query language
  • Beeline – Hive command line interface
  • Datasift – online service that streams tweets matching a given pattern to a nominated datastore (such as MongoDB)
  • mongo-hadoop – a connector for MongoDB to Hadoop including Hive
  • elasticsearch-hadoop – a connector for Elasticsearch to Hadoop including Hive

Kibana only queries data held in Elasticsearch, which acts as both the data store and the analytics engine. There are various ways to get data into Elasticsearch directly from source but I’ve opted not to do that here, instead bringing it all in via HDFS and Hive. I’ve done that because my – albeit fairly limited – experience is that Elasticsearch is great once you’ve settled on your data and schema, but in the same way I’m not building a full OBIEE metadata model (RPD) yet, nor did I want to design my Elasticsearch schema up front and have to reload from source if it changed. Options for reprocessing and wrangling data once in Elasticsearch seem limited and complex, and by making all my data available through Hive first I could supplement it and mash it up as I wanted, loading it into Elasticsearch only when I had a chunk of data to explore. Another approach that I haven’t tried but could be useful if the requirement fits it would be to load the individual data elements directly into their own Elasticsearch area and then using the elasticsearch-hadoop connector run the required mashups with other data through Hive, loading the results back into Elasticsearch. It all depends on where you’re coming from with the data.

Overview

Here’s a diagram of what I’m building:

I’ll explain it in steps as follows:

  1. Loading the data and making it accessible through Hive
  2. Loading data from Hive to Elasticsearch
  3. Visualising and analysing data in Kibana
Getting the data into Hive

Strictly speaking we’re not getting the data into Hive, so much as making it available through Hive. Hive simply enables you to define and query tables sitting on top of data held in places including HDFS. The beauty of the Hadoop ecosystem is that you can physicalise data in a bunch of tools and the components will most often support interoperability with each other. It’s only when you get started playing with it that you realise how powerful this is.

The Apache log files and WordPress metadata suit themselves fairly well to a traditional RDBMS format of [de]normalised tables, so we can store them in HDFS with simple RDBMS tables defined on top through Hive. But the twitter data comes in JSON format (like this), and if we were going to store the Twitter data in a traditional RDBMS we’d have to work out how to explode the document into a normalised schema, catering for varying structures depending on the type of tweet and data payload within it. At the moment we just want to collect all the data that looks useful, and then look at different ways to analyse it afterwards. Instead of having to compromise one way (force a structure over the variable JSON) or another (not put a relational schema over obviously relational data) we can do both, and decide at run-time how to best use it. From there, we can identify important bits of data and refactor our design as necessary. This “schema on read” approach is one of the real essences of Hadoop and ‘big data’ in general.

So with that said, let’s see how we get the data in. This bit is the easy part of the article to write, because a lot of it is pretty much what Mark Rittman has already written up in his articles, so I’ll refer to those rather than duplicate here.

Apache log data

References:

I’ve used a variation on the standard Apache log SerDe that the interwebs offers, because I’m going to need to work with the timestamp quite closely (we’ll see why later) so I’ve burst it out into individual fields.

The DDL is:

CREATE EXTERNAL TABLE apachelog (
host STRING,    identity STRING,    user STRING,
time_dayDD STRING,  time_monthMMM STRING,   time_yearYYYY STRING,
time_hourHH STRING, time_minmm STRING,  time_secss STRING,  time_tzZ STRING,
http_call STRING,   url STRING, http_status STRING, status STRING,  size STRING,    referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(\\d{2})\\/(\\w{3})\\/(\\d{4}):(\\d{2}):(\\d{2}):(\\d{2}) (.*?)\\] \\\"(\\w*) ([^ ]*?)(?:\\/)? ([^ \\\"]*)\\\" (\\d*) (\\d*) \\\"(.*?)\\\" \\\"(.*?)\\\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s")
STORED AS TEXTFILE LOCATION '/user/oracle/apache_logs';

The EXTERNAL is important on the table definition as it stops Hive moving the HDFS files into its own area on HDFS. If Hive does move the files it is annoying if you want to also access them through another program (or Hive table), and downright destructive if you DROP the table since it’ll delete the HDFS files too – unless it’s EXTERNAL. Note the LOCATION must be an HDFS folder, even if it just holds one file.

For building and testing the SerDe regex Rubular is most excellent, but note that it’s Java regex you’re specifying in the SerDe which has its differences from Python or Ruby regex that Rubular (and most other online regex testers) support. For the final validation of Java regex I use the slightly ugly but still useful regexplanet, which also gives you the fully escaped version of your regex which you’ll need to use for the actual Hive DDL/DML.

A sample row from the apache log on disk looks like this:

74.208.161.70 - - [12/Oct/2014:03:47:43 +0000] "GET /2014/09/sunday-times-tech-track-100/ HTTP/1.0" 301 247 "-" "-"

and now in Hive:

0: jdbc:hive2://bigdatalite:10000> !outputformat vertical
0: jdbc:hive2://bigdatalite:10000> select * from apachelog limit 1;
host           74.208.161.70
identity       -
user           -
time_daydd     12
time_monthmmm  Oct
time_yearyyyy  2014
time_hourhh    03
time_minmm     47
time_secss     43
time_tzz       +0000
http_call      GET
url            /2014/09/sunday-times-tech-track-100/
http_status    HTTP/1.0
status         301
size           247
referer        -
agent          -

Twitter data

Reference:

The twitter data we’ve got includes the Hive ARRAY datatype for the collections of hashtag(s) and referenced url(s) from within a tweet. A point to note here is that the author_followers data appears in different locations of the JSON document depending on whether it’s a retweet or not. I ended up with two variations of this table and a UNION on top.

The table is mapped on data held in MongoDB and as with the HDFS data above the EXTERNAL is crucial to ensure you don’t trash your data when you drop your table.

CREATE EXTERNAL TABLE tweets
(
id string,
url string,
author string,
content string,
created_at string,
hashtags ARRAY<string>,
referenced_urls ARRAY<string>,
sentiment STRING,
author_handle string,
author_id string,
author_followers string,
author_friends string
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","url":"interaction.interaction.link","author":"interaction.interaction.author.name","content":"interaction.interaction.content","created_at":"interaction.interaction.created_at","hashtags":"interaction.interaction.hashtags","referenced_urls":"interaction.links.url","sentiment":"interaction.salience.content.sentiment","author_handle":"interaction.interaction.author.username","author_id":"interaction.interaction.author.id","author_followers":"interaction.twitter.user.followers_count","author_friends":"interaction.twitter.user.friends_count"}')
TBLPROPERTIES('mongo.uri'='mongodb://bigdatalite.localdomain:27017/rm_tweets.rm_tweets')
;

The other point to note is that we’re now using mongo-hadoop for Hive to connect to MongoDB. I found that I had to first build the full set of jar files by running ./gradlew jar -PclusterVersion='cdh5', and also download the MongoDB java driver, before copying the whole lot into /usr/lib/hadoop/lib. This is what I had by the end of it:

[oracle@bigdatalite mongo-hadoop-r1.3.0]$ ls -l /usr/lib/hadoop/lib/mongo-*
-rw-r--r--. 1 root root 105446 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-core-1.3.0.jar
-rw-r--r--. 1 root root  21259 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-hive-1.3.0.jar
-rw-r--r--. 1 root root 723219 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-pig-1.3.0.jar
-rw-r--r--. 1 root root    261 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-r1.3.0.jar
-rw-r--r--. 1 root root 697644 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-streaming-1.3.0.jar
-rw-r--r--. 1 root root 591189 Oct 24 00:44 /usr/lib/hadoop/lib/mongo-java-driver-2.12.4.jar

After all that, the data as it appears in Hive looks like this:

id                5441097d591f90cf2c8b45a1
url               https://twitter.com/rmoff/status/523085961681317889
author            Robin Moffatt
content           Blogged: Using #rlwrap with Apache #Hive #beeline for improved readline functionality http://t.co/IoMML2UDxp
created_at        Fri, 17 Oct 2014 12:19:46 +0000
hashtags          ["rlwrap","Hive","beeline"]
referenced_urls   ["http://www.rittmanmead.com/2014/10/using-rlwrap-with-apache-hive-beeline-for-improved-readline-functionality/"]
sentiment         4
author_handle     rmoff
author_id         82564066
author_followers  790
author_friends    375

For reference, without the mongo-hadoop connectors I was getting the error

Error in loading storage handler.com.mongodb.hadoop.hive.MongoStorageHandler

and with them installed but without the MongoDB java driver I got:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON (state=08S01,code=1)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON

WordPress metadata

WordPress holds its metadata in a MySQL database, so it’s easy to extract out:

  1. Run a query in MySQL to generate the CSV export files, such as:

    SELECT p.ID, p.POST_TITLE,p.POST_DATE_GMT,
           p.POST_TYPE,a.DISPLAY_NAME,p.POST_NAME,
           CONCAT('/', DATE_FORMAT(POST_DATE_GMT, '%Y'), '/', LPAD(
           DATE_FORMAT(POST_DATE_GMT, '%c'), 2, '0'), '/', p.POST_NAME) AS
           generated_url
    FROM   posts p
           INNER JOIN users a
                   ON p.POST_AUTHOR = a.ID
    WHERE  p.POST_TYPE IN ( 'page', 'post' )
           AND p.POST_STATUS = 'publish' 
    into outfile '/tmp/posts.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';

  2. Copy the CSV file to your Hadoop machine, and copy it onto HDFS. Make sure each type of data goes in its own HDFS folder:

    hadoop fs -mkdir posts
    hadoop fs -copyFromLocal /tmp/posts.csv posts

  3. Define the Hive table on top of it:

    CREATE EXTERNAL TABLE posts 
    ( post_id string,title string,post_date string,post_type string,author string,url string ,generated_url string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    "input.regex" = "^(\\d*),\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\"",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s")
    location '/user/oracle/posts'
    ;

Rinse & repeat for the category data, and post->category relationships.

The data once modelled in Hive looks like this:

0: jdbc:hive2://bigdatalite:10000> select * from posts limit 1;
post_id        788
title          Blog
post_date      2007-03-07 17:45:07
post_type      page
author         Mark Rittman
url            blog
generated_url  /2007/03/blog

0: jdbc:hive2://bigdatalite:10000> select * from categories limit 1;
category_id    5
cat2_id        5
category_name  category
category_code  BI (General)
catslug        bi

0: jdbc:hive2://bigdatalite:10000> select * from post_cats limit 5;
post_id      8046
category_id  1

The WordPress metadata quite obviously joins together, as it is already from the relational schema in which it was held on MySQL. Here is an example of where “schema on read” comes into play, because you could look at the above three tables (posts / post_cats / categories) and conclude it was redundant to export all three from WordPress and instead a single query listings posts and their respective category would be sufficient. But, some posts have more than one category, which then leads to a design/requirements decision. Either we retain one row per post – and collapse down the categories, but in doing so lose ability to easily treat categories as individual data – or have one row per post/category, and end up with multiple rows per post which if we’re doing a simple count of posts complicates matters. So we bring it in all raw from source, and then decide how we’re going to use it afterwards.

Bringing the data together

At this point I have six tables in Hive that I can query (albeit slowly) with HiveQL, a close relation to SQL with a few interesting differences running through the Hive client Beeline. The data is tweets, website visits, and details about the blog posts themselves.

0: jdbc:hive2://bigdatalite:10000> show tables;
+------------------------+
|        tab_name        |
+------------------------+
| apachelog              |
| categories             |
| post_cats              |
| posts                  |
| retweets               |
| tweets                 |
+------------------------+

As well as time, the other common element running throughout all the data is the blog article URL, whether it is a post, a visit to the website, or a tweet about it. But to join on it is not quite as simple as you’d hope, because all the following are examples of recorded instances of the data for the same blog post:

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So whether it’s querying the data within Hive, or loading it joined together to another platform, we need to be able to unify the values of this field.

Tangent: RegEx

And now it’s time, if you’d not already for your SerDe against the Apache file, to really immerse yourself in Regular Expressions (RegEx). Part of the “schema on read” approach is that it can get messy. You need to juggle and wrangle and munge data in ways that it really might not want to, and RegEx is an essential tool with which to do this. Regex isn’t specific to Hadoop – it’s used throughout the computing world.

My journey with regex over quite a few years in computing has gone in stages something like this:

  1. To be a fully rounded geek, I should learn regex. Looks up regex. Hmm, looks complicated….Squirrel!
    1. To be a fully round (geddit?!) geek, I should keep eating these big breakfasts
  2. I’ve got a problem, I’ve got a feeling regex will help me. But my word it looks complicated … I’ll just do it by hand.
  3. I’ve got another problem, I need to find this text in a file but with certain patterns around it. Here’s a regex I found on google. Neat!
  4. Hmmm another text matching problem, maybe I should really learn regex instead of googling it to death each time
  5. Mastered the basic concepts of regex
  6. Still a long way to go…

If you think you’ll nail RegEx overnight, you won’t (or at least, you’re a better geek than me). It’s one of those techniques, maybe a bit like SQL, that to fully grok takes a period of exposure and gradually increasing usage, before you have an “ah hah!” moment. There’s a great site explaining regex here: www.regular-expressions.info. My best advice is to take a real example text that you want to work with (match on, replace bits of, etc), and stick it in one of these parsers and experiment with the code:

Oh and finally, watch out for variations in regex – what works in a Java-based program (most of the Hadoop world) may not in Python and visa versa. Same goes for PHP, Ruby, and so on – they all have different regex engines that may or may not behave as you’d expect.

Back on track : joining data on non-matching columns

So to recap, we want to be able to analyse our blog data across tweets, site hits and postings, using the common field of the post URL, which from the various sources can look like any of the following (and more):

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So out comes the RegEx. First off, we’ll do the easy one – strip the http:// and server bit. Using the Hive function REGEXP_REPLACE we can use this in the query:

regexp_replace(ref_url,'http:\\/\\/www.rittmanmead.com','')

This means, take the ref_url column and if you find http://www.rittmanmead.com then replace it with nothing, i.e. delete it. The two backslashes before each forward slash simply escape them since a forward slash on its own has a special meaning in regex. Just to keep you on your toes – Java regex requires double backspace escaping, but all other regex (including the online parser I link to below) uses a single one.

So now our list possible join candidates has shrunk by one to look like this:

/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

The variation as you can see is whether there is a trailing forward slash (/) after the post ‘slug’ , and whether there is additional cruft after that too (feed, foobar+foorbar, etc). So let’s build it up a piece at a time. On each one, I’ve linked to an online parser that you can use to see it in action.

  1. We’ll match on the year and month (/2014/01/) because they’re fixed pattern, so using \d to match on digits and {x} to match x repetitions: (see example on Rubular.com)

    \/\d{4}\/\d{2}\/

    This will match /2014/01/.

  2. Now we need to match the slug, but we’re going to ditch the forward slash suffix if there is one. This is done with two steps.

    First, we define a “match anything except x” group, which is what the square brackets (group) and the caret ^ (negate) do, and in this case x is the forward slash character, escaped.

    Secondly, the plus symbol + tells regex to match at least one repetitions of the preceeding group – i.e. any character that is not a forward slash. (example)

    [^\/]+

    Combined with the above regex from the first step we will now match /2014/01/automated-regression-testing-for-obiee.

  3. The final step is to turn the previous REGEXP_REPLACE on its head and instead of replacing out content from the string that we don’t want, instead we’ll extract the content that we do want, using a regex capture group which is defined by regular brackets (parantheses, just like these). We’ve now brought in a couple of extra bits to make it hang together, seen in the completed regex here:

    \S*(\/\d{4}\/\d{2}\/[^\/]+).*$

    1. The \S* at the beginning means match any non-whitespace character, which will replace the previous regex replace we were doing to strip out the http://www.rittmanmead.com
    2. After the capture group, which is the content from steps one and two above, surround by parentheses (\/\d{4}\/\d{2}\/[^\/]+) there is a final .* to match anything else that might be present (eg trailing forward slash, foobar, etc etc)

    Now all we need to do is escape it for Java regex, and stick it in the Hive REGEXP_EXTRACT function, specifying 1 as the capture group number to extract: (example)

    regexp_extract(url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)

So now all our URLs will look like this, regardless of whether they’re from tweet data, website hits, or wordpress:

/2014/01/automated-regression-testing-for-obiee

Which is nice, because it means we can use it as the common join in our queries. For example, to look up the title of the blog post that someone has tweeted about, and who wrote the post:

SELECT
x.author AS tweet_author, 
x.tweet ,
x.tweet_url, 
x.created_at, 
p.author as post_author, 
p.title as post_title
FROM            ( 
SELECT 'tweets' , 
t.url AS tweet_url , 
t.author , 
t.content AS tweet , 
t.created_at ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
FROM   tweets t 
LATERAL VIEW EXPLODE (referenced_urls) refs as ref_url 
WHERE  t.author_followers IS NOT NULL 
AND    ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*' ) x 
INNER JOIN posts p 
ON regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url ;

[...]
tweet_author  Dain Hansen
tweet         Like a Big Data kid in a Hadoop candy store: Presos on #bigdata for BI, DW, Data Integration http://t.co/06DLnvxINx via @markrittman
tweet_url     https://twitter.com/dainsworld/status/520463199447961600
created_at    Fri, 10 Oct 2014 06:37:51 +0000
post_author   Mark Rittman
post_title    Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

tweet_author  Robin Moffatt
tweet         Analyzing Twitter Data using Datasift, MongoDB and Pig http://t.co/h67cd4kJo2 via @rittmanmead
tweet_url     https://twitter.com/rmoff/status/524197131276406785
created_at    Mon, 20 Oct 2014 13:55:09 +0000
post_author   Mark Rittman
post_title    Analyzing Twitter Data using Datasift, MongoDB and Pig
[...]

Note here also the use of LATERAL VIEW EXPLODE () as a way of denormalising out the Hive ARRAY of referenced url(s) in the tweet so there is one row returned per value.

Summary

We’ve got our three sources of data available to us in Hive, and can query across them. Next we’ll take a look at loading the data into Elasticsearch, taking advantage of our conformed url column to join data that we load. Stay tuned!

Categories: BI & Warehousing

Filtering PeopleTools SQL from Performance Monitor Traces

David Kurtz - Mon, 2014-11-03 15:01

I have been doing some on-line performance tuning on a PeopleSoft Financials system using PeopleSoft Performance Monitor (PPM).  End-users have collect verbose PPM traces. Usually, when I use PPM in a production system, all the components are fully cached by the normal activity of the user (except when the application server caches have recently been cleared).  However, when working in a user test environment it is common to find that the components are not fully cached. This presents two problems.
  • The application servers spend quite a lot of time executing queries on the PeopleTools tables to load the components, pages and PeopleCode into their caches. We can see in the screenshot of the component trace that there is a warning message that component objects are not fully cached, and that these  cache misses skew timings.
  • In verbose mode, the PPM traces collect a lot of additional transactions capturing executions and fetches against PeopleTools tables. The PPM analytic components cannot always manage the resultant volume of transactions.
    Figure 1. Component trace as collected by PPMFigure 1. Component trace as collected by PPMIf I go further down the same page and look in the SQL Summary, I can see SQL operations against PeopleTools tables (they are easily identifiable in that they generally do not have an underscore in the third character). Not only are 5 of the top 8 SQL operations related to PeopleTools tables, we can also see that they also account for over 13000 executions, which means there are at least 13000 rows of additional data to be read from PSPMTRANSHIST.
    Figure 2. SQL Summary of PPM trace with PeopleTools SQLFigure 2. SQL Summary of PPM trace with PeopleTools SQLWhen I open the longest running server round trip (this is also referred to as a Performance Monitoring Unit or PMU), I can only load 1001 rows before I get a message warning that the maximum row limit has been reached. The duration summary and the number of executions and fetches cannot be calculated and hence 0 is displayed.
     Details of longest PMU with PeopleTools SQLFigure 3: Details of longest PMU with PeopleTools SQL
    Another consequence of the PeopleTools data is that it can take a long time to open the PMU tree. There is no screenshot of the PMU tree here because in this case I had so much data that I couldn't open it before the transaction timed out!
    Solution My solution to this problem is to delete the transactions that relate to PeopleTools SQL and correct the durations, and the number of executions and fetches held in summary transactions. The rationale is that these transactions would not normally occur in significant quantities in a real production system, and there is not much I can do about them when they do.
    The first step is to clone the trace. I could work on the trace directly, but I want to preserve the original data.
    PPM transactions are held in the table PSPMTRANSHIST. They have a unique identifier PM_INSTANCE_ID. A single server round trip, also called a Performance Monitoring Unit (PMU), will consist of many transactions. They can be shown as a tree and each transaction has another field PM_PARENT_INST_ID which holds the instance of the parent. This links the data together and we can use hierarchical queries in Oracle SQL to walk the tree. Another field PM_TOP_INST_ID identifies the root transaction in the tree.
    Cloning a PPM trace is simply a matter of inserting data into PSPMTRANSHIST. However, when I clone a PPM trace I have to make sure that the instance numbers are distinct but still link correctly. In my system I can take a very simple approach. All the instance numbers actually collected by PPM are greater than 1016. So, I will simply use the modulus function to consistently alter the instances to be different. This approach may break down in future, but it will do for now.
    On an Oracle database, PL/SQL is a simple and effective way to write simple procedural processes.  I have written two anonymous blocks of code.
    Note that the cloned trace will be purged from PPM like any other data by the delivered PPM archive process.

    REM xPT.sql
    BEGIN --duplicate PPM traces
    FOR i IN (
    SELECT h.*
    FROM pspmtranshist h
    WHERE pm_perf_trace != ' ' /*rows must have a trace name*/
    -- AND pm_perf_trace = '9b. XXXXXXXXXX' /*I could specify a specific trace by name*/
    AND pm_instance_id > 1E16 /*only look at instance > 1e16 so I do not clone cloned traces*/
    ) LOOP
    INSERT INTO pspmtranshist
    (PM_INSTANCE_ID, PM_TRANS_DEFN_SET, PM_TRANS_DEFN_ID, PM_AGENTID, PM_TRANS_STATUS,
    OPRID, PM_PERF_TRACE, PM_CONTEXT_VALUE1, PM_CONTEXT_VALUE2, PM_CONTEXT_VALUE3,
    PM_CONTEXTID_1, PM_CONTEXTID_2, PM_CONTEXTID_3, PM_PROCESS_ID, PM_AGENT_STRT_DTTM,
    PM_MON_STRT_DTTM, PM_TRANS_DURATION, PM_PARENT_INST_ID, PM_TOP_INST_ID, PM_METRIC_VALUE1,
    PM_METRIC_VALUE2, PM_METRIC_VALUE3, PM_METRIC_VALUE4, PM_METRIC_VALUE5, PM_METRIC_VALUE6,
    PM_METRIC_VALUE7, PM_ADDTNL_DESCR)
    VALUES
    (MOD(i.PM_INSTANCE_ID,1E16) /*apply modulus to instance number*/
    ,i.PM_TRANS_DEFN_SET, i.PM_TRANS_DEFN_ID, i.PM_AGENTID, i.PM_TRANS_STATUS,
    i.OPRID,
    SUBSTR('xPT'||i.PM_PERF_TRACE,1,30) /*adjust trace name*/,
    i.PM_CONTEXT_VALUE1, i.PM_CONTEXT_VALUE2, i.PM_CONTEXT_VALUE3,
    i.PM_CONTEXTID_1, i.PM_CONTEXTID_2, i.PM_CONTEXTID_3, i.PM_PROCESS_ID, i.PM_AGENT_STRT_DTTM,
    i.PM_MON_STRT_DTTM, i.PM_TRANS_DURATION,
    MOD(i.PM_PARENT_INST_ID,1E16), MOD(i.PM_TOP_INST_ID,1E16), /*apply modulus to parent and top instance number*/
    i.PM_METRIC_VALUE1, i.PM_METRIC_VALUE2, i.PM_METRIC_VALUE3, i.PM_METRIC_VALUE4, i.PM_METRIC_VALUE5,
    i.PM_METRIC_VALUE6, i.PM_METRIC_VALUE7, i.PM_ADDTNL_DESCR);
    END LOOP;
    COMMIT;
    END;
    /
    Now I will work on the cloned trace. I want to remove certain transaction.
    • PeopleTools SQL. Metric value 7 reports the SQL operation and SQL table name. So if the first word is SELECT and the second word is a PeopleTools table name then it is a PeopleTools SQL operation. A list of PeopleTools tables can be obtained from the object security table PSOBJGROUP.
    • Implicit Commit transactions. This is easy - it is just transaction type 425. 
    Having deleted the PeopleTools transactions, I must also
    • Correct transaction duration for any parents of transaction. I work up the hierarchy of transactions and deduct the duration of the transaction that I am deleting from all of the parent.
    • Transaction types 400, 427 and 428 all record PeopleTools SQL time (metric 66). When I come to that transaction I also deduct the duration of the deleted transaction from the PeopleTools SQL time metric in an parent transaction.
    • Delete any children of the transactions that I delete. 
    • I must also count each PeopleTools SQL Execution transaction (type 408) and each PeopleTools SQL Fetch transaction (type 414) that I delete. These counts are also deducted from the summaries on the parent transaction 400. 
    The summaries in transaction 400 are used on the 'Round Trip Details' components, and if they are not adjusted you can get misleading results. Without the adjustments, I have encountered PMUs where more than 100% of the total duration is spent in SQL - which is obviously impossible.
    Although this technique of first cloning the whole trace and then deleting the PeopleTools operations can be quite slow, it is not something that you are going to do very often. 
    REM xPT.sql
    REM (c)Go-Faster Consultancy Ltd. 2014
    set serveroutput on echo on
    DECLARE
    l_pm_instance_id_m4 INTEGER;
    l_fetch_count INTEGER;
    l_exec_count INTEGER;
    BEGIN /*now remove PeopleTools SQL transaction and any children and adjust trans durations*/
    FOR i IN (
    WITH x AS ( /*returns PeopleTools tables as defined in Object security*/
    SELECT o.entname recname
    FROM psobjgroup o
    WHERE o.objgroupid = 'PEOPLETOOLS'
    AND o.enttype = 'R'
    )
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_trans_duration, h.pm_trans_defn_id
    FROM pspmtranshist h
    LEFT OUTER JOIN x
    ON h.pm_metric_value7 LIKE 'SELECT '||x.recname||'%'
    AND x.recname = upper(regexp_substr(pm_metric_value7,'[^ ,]+',8,1)) /*first word after select*/
    WHERE pm_perf_trace like 'xPT%' /*restrict to cloned traces*/
    -- AND pm_perf_trace = 'xPT9b. XXXXXXXXXX' /*work on a specific trace*/
    AND pm_instance_id < 1E16 /*restrict to cloned traces*/
    AND ( x.recname IS NOT NULL
    OR h.pm_trans_defn_id IN(425 /*Implicit Commit*/))
    ORDER BY pm_instance_id DESC
    ) LOOP
    l_pm_instance_id_m4 := TO_NUMBER(NULL);
     
        IF i.pm_parent_inst_id>0 AND i.pm_trans_duration>0 THEN
    FOR j IN(
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_top_inst_id, h.pm_trans_defn_id
    , d.pm_metricid_3, d.pm_metricid_4
    FROM pspmtranshist h
    INNER JOIN pspmtransdefn d
    ON d.pm_trans_defn_set = h.pm_trans_defn_set
    AND d.pm_trans_defn_id = h.pm_trans_Defn_id
    START WITH h.pm_instance_id = i.pm_parent_inst_id
    CONNECT BY prior h.pm_parent_inst_id = h.pm_instance_id
    ) LOOP
    /*decrement parent transaction times*/
    IF j.pm_metricid_4 = 66 /*PeopleTools SQL Time (ms)*/ THEN --decrement metric 4 on transaction 400
    --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value4 by '||i.pm_trans_duration);
    UPDATE pspmtranshist
    SET pm_metric_value4 = pm_metric_value4 - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_Defn_id = j.pm_trans_defn_id
    AND pm_metric_value4 >= i.pm_trans_duration
    RETURNING pm_instance_id INTO l_pm_instance_id_m4;
    ELSIF j.pm_metricid_3 = 66 /*PeopleTools SQL Time (ms)*/ THEN --SQL time on serialisation
    --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value3 by '||i.pm_trans_duration);
    UPDATE pspmtranshist
    SET pm_metric_value3 = pm_metric_value3 - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_Defn_id = j.pm_trans_defn_id
    AND pm_metric_value3 >= i.pm_trans_duration;
    END IF;

    UPDATE pspmtranshist
    SET pm_trans_duration = pm_trans_duration - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_duration >= i.pm_trans_duration;
    END LOOP;
    END IF;

    l_fetch_count := 0;
    l_exec_count := 0;
    FOR j IN( /*identify transaction to be deleted and any children*/
    SELECT pm_instance_id, pm_parent_inst_id, pm_top_inst_id, pm_trans_defn_id, pm_metric_value3
    FROM pspmtranshist
    START WITH pm_instance_id = i.pm_instance_id
    CONNECT BY PRIOR pm_instance_id = pm_parent_inst_id
    ) LOOP
    IF j.pm_trans_defn_id = 408 THEN /*if PeopleTools SQL*/
    l_exec_count := l_exec_count + 1;
    ELSIF j.pm_trans_defn_id = 414 THEN /*if PeopleTools SQL Fetch*/
    l_fetch_count := l_fetch_count + j.pm_metric_value3;
    END IF;
    DELETE FROM pspmtranshist h /*delete tools transaction*/
    WHERE h.pm_instance_id = j.pm_instance_id;
    END LOOP;

    IF l_pm_instance_id_m4 > 0 THEN
    --dbms_output.put_line('ID:'||l_pm_instance_id_m4||' Decrement '||l_exec_Count||' executions, '||l_fetch_count||' fetches');
    UPDATE pspmtranshist
    SET pm_metric_value5 = pm_metric_value5 - l_exec_count
    , pm_metric_value6 = pm_metric_value6 - l_fetch_count
    WHERE pm_instance_id = l_pm_instance_id_m4;
    l_fetch_count := 0;
    l_exec_count := 0;
    END IF;

    END LOOP;
    END;
    /
    Now, I have a second PPM trace that I can open in the analytic component. Original and Cloned PPM tracesFigure 4: Original and Cloned PPM traces

    When I open the cloned trace, both timings in the duration summary have reduced as have the number of executions and fetches.  The durations of the individual server round trips have also reduced.
     Component Trace without PeopleTools transactionsFigure 5: Component Trace without PeopleTools transactions
    All of the PeopleTools SQL operations have disappeared from the SQL summary.
     SQL Summary of PPM trace after removing PeopleTools SQL transactionsFigure 6: SQL Summary of PPM trace after removing PeopleTools SQL transactions
    The SQL summary now only has 125 rows of data.
    Figure 7: SQL Summary of PMU without PeopleTools SQL
    Now, the PPM tree component opens quickly and without error.
     PMU Tree after removing PeopleTools SQLFigure 8: PMU Tree after removing PeopleTools SQL
    There may still be more transactions in a PMU than I can show in a screenshot, but I can now find the statement that took the most time quite quickly.

     Long SQL transaction further down same PMU treeFigure 9: Long SQL transaction further down same PMU tree
    Conclusions I think that it is reasonable and useful to remove PeopleTools SQL operations from a PPM trace.
    In normal production operation, components will mostly be cached, and this approach renders traces collected in non-production environments both usable in the PPM analytic components and more realistic for performance tuning. However, it is essential that when deleting some transactions from a PMU, that summary data held in other transactions in the same PMU are also corrected so that the metrics remain consistent. ©David Kurtz, Go-Faster Consultancy Ltd.

    Upgrades

    Jonathan Lewis - Mon, 2014-11-03 12:31

    One of the worst problems with upgrades is that things sometimes stop working. A particular nuisance is the execution plan that suddenly stops appearing, to be replaced by an alternative plan that is much less efficient.

    Apart from the nuisance of the time spent trying to force the old plan to re-appear, plus the time spent working out a way of rewriting the query when you finally decide the old plan simply isn’t going to re-appear, there’s also the worry about WHY the old plan won’t appear. Is it some sort of bug, is it that some new optimizer feature has disabled some older optimizer feature, or is it that someone in the optimizer group realised that the old plan was capable of producing the wrong results in some circumstances … it’s that last possibility that I find most worrying.

    Here’s an example that appeared recently on OTN that’s still got me wondering about the possibility of wrong results (in the general case). We start with a couple of tables, a view, and a pipelined function. This example is a simple model of the problem that showed up on OTN; it’s based on generated data so that anyone who wants to can play around with it to see if they can bypass the problem without making any significant changes to the shape of the code:

    
    create table t1
    as
    with generator as (
    	select	--+ materialize
    		rownum id
    	from dual
    	connect by
    		level <= 1e4
    )
    select
    	rownum			id,
    	rownum			n1,
    	mod(rownum,100)		n_100,
    	rpad('x',100)		padding
    from
    	generator	v1
    ;
    
    create table t2
    as
    with generator as (
    	select	--+ materialize
    		rownum id
    	from dual
    	connect by
    		level <= 1e4
    )
    select
    	rownum			id,
    	rownum			n1,
    	mod(rownum,100)		n_100,
    	rpad('x',100)		padding
    from
    	generator	v1
    ;
    
    alter table t2 add constraint t2_pk primary key(id);
    
    begin
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T1',
    		method_opt	 => 'for all columns size 1'
    	);
    
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T2',
    		method_opt	 => 'for all columns size 1'
    	);
    
    end;
    /
    
    create or replace type myScalarType as object (
            x int,
            y varchar2(15),
            d date
    )
    /
    
    create or replace type myArrayType as table of myScalarType
    /
    
    create or replace function t_fun1(i_in number)
    return myArrayType
    pipelined
    as
    begin
    	pipe row (myscalartype(i_in,     lpad(i_in,15),     trunc(sysdate) + i_in    ));
    	pipe row (myscalartype(i_in + 1, lpad(i_in + 1,15), trunc(sysdate) + i_in + 1));
    	return;
    end;
    /
    
    create or replace view v1
    as
    select
    	--+ leading(t2 x) index(t2)
    	x.x, x.y, x.d,
    	t2.id, t2.n1
    from
    	t2,
    	table(t_fun1(t2.n_100)) x
    where
    	mod(t2.n1,3) = 1
    union all
    select
    	--+ leading(t2 x) index(t2)
    	x.x, x.y, x.d,
    	t2.id, t2.n1
    from
    	t2,
    	table(t_fun1(t2.n_100)) x
    where
    	mod(t2.n1,3) = 2
    ;
    
    

    A key part of the problem is the UNION ALL view, where each subquery holds a join to a pipeline function. We’re about to write a query that joins to this view, and wants to push a join predicate into the view. Here’s the SQL:

    
    select
    	/*+ leading(t1 v1) use_nl(v1) */
    	v1.x, v1.y, v1.d,
    	v1.n1,
    	t1.n1
    from
    	t1,
    	v1
    where
    	t1.n_100 = 0
    and	v1.id = t1.n1
    ;
    
    

    You’ll notice that the join v1.id = t1.n1 could (in principle) be pushed inside the view to become t2.id = t1.n1 in the two branches of the UNION ALL; this would make it possible for the nested loop that I’ve hinted between t1 and v1 to operate efficiently – and in 11.1.0.7 this is exactly what happens:

    
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                      |        | 16336 |   733K|   123   (1)| 00:00:01 |
    |   1 |  NESTED LOOPS                         |        | 16336 |   733K|   123   (1)| 00:00:01 |
    |*  2 |   TABLE ACCESS FULL                   | T1     |   100 |   700 |    23   (5)| 00:00:01 |
    |   3 |   VIEW                                | V1     |   163 |  6357 |     1   (0)| 00:00:01 |
    |   4 |    UNION-ALL PARTITION                |        |       |       |            |          |
    |   5 |     NESTED LOOPS                      |        |  8168 |   103K|    16   (0)| 00:00:01 |
    |*  6 |      TABLE ACCESS BY INDEX ROWID      | T2     |     1 |    11 |     2   (0)| 00:00:01 |
    |*  7 |       INDEX UNIQUE SCAN               | T2_PK  |     1 |       |     1   (0)| 00:00:01 |
    |   8 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |       |       |            |          |
    |   9 |     NESTED LOOPS                      |        |  8168 |   103K|    16   (0)| 00:00:01 |
    |* 10 |      TABLE ACCESS BY INDEX ROWID      | T2     |     1 |    11 |     2   (0)| 00:00:01 |
    |* 11 |       INDEX UNIQUE SCAN               | T2_PK  |     1 |       |     1   (0)| 00:00:01 |
    |  12 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |       |       |            |          |
    ------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("T1"."N_100"=0)
       6 - filter(MOD("T2"."N1",3)=1)
       7 - access("T2"."ID"="T1"."N1")
      10 - filter(MOD("T2"."N1",3)=2)
      11 - access("T2"."ID"="T1"."N1")
    
    

    For each row returned by the tablescan at line 2 we call the view operator at line 3 to generate a rowsource, but we can see in the predicate sections for lines 7 and 11 that the join value has been pushed inside the view, allowing us to access t2 through its primary key index. Depending on the data definitions, constraints, view definition, and version of Oracle, you might see the UNION ALL operator displaying the PARTITION option or the PUSHED PREDICATE option in cases of this type.

    So now we upgrade to 11.2.0.4 (probably any 11.2.x.x version) and get the following plan:

    
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                      |        |  1633K|    99M|   296K  (4)| 00:24:43 |
    |   1 |  NESTED LOOPS                         |        |  1633K|    99M|   296K  (4)| 00:24:43 |
    |*  2 |   TABLE ACCESS FULL                   | T1     |   100 |   700 |    23   (5)| 00:00:01 |
    |*  3 |   VIEW                                | V1     | 16336 |   909K|  2966   (4)| 00:00:15 |
    |   4 |    UNION-ALL                          |        |       |       |            |          |
    |   5 |     NESTED LOOPS                      |        |   816K|    10M|  1483   (4)| 00:00:08 |
    |*  6 |      TABLE ACCESS BY INDEX ROWID      | T2     |   100 |  1100 |   187   (2)| 00:00:01 |
    |   7 |       INDEX FULL SCAN                 | T2_PK  | 10000 |       |    21   (0)| 00:00:01 |
    |   8 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |  8168 | 16336 |    13   (0)| 00:00:01 |
    |   9 |     NESTED LOOPS                      |        |   816K|    10M|  1483   (4)| 00:00:08 |
    |* 10 |      TABLE ACCESS BY INDEX ROWID      | T2     |   100 |  1100 |   187   (2)| 00:00:01 |
    |  11 |       INDEX FULL SCAN                 | T2_PK  | 10000 |       |    21   (0)| 00:00:01 |
    |  12 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |  8168 | 16336 |    13   (0)| 00:00:01 |
    ------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("T1"."N_100"=0)
       3 - filter("V1"."ID"="T1"."N1")
       6 - filter(MOD("T2"."N1",3)=1)
      10 - filter(MOD("T2"."N1",3)=2)
    
    

    In this plan the critical join predicate appears at line 3; the predicate hasn’t been pushed. On the other hand the index() hints in the view have, inevitably, been obeyed (resulting in index full scans), as has the use_nl() hint in the main query – leading to a rather more expensive and time-consuming execution plan.

    The first, quick, debugging step is simply to set the optimizer_features_enable back to 11.1.0.7 – with no effect; the second is to try adding the push_pred() hint to the query – with no effect; the third is to generate the outline section of the execution plans and copy the entire set of hints from the good plan into the bad plan, noting as we do so that the good plan actually uses the hint OLD_PUSH_PRED(@”SEL$1″ “V1″@”SEL$1″ (“T2″.”ID”)) – still no effect.

    Since I happen to know a few things about what is likely to appear in the 10053 (optimizer) trace file, my next step would be to flush the shared pool, enable the trace, and then check the trace file (using grep or find depending on whether I was running UNIX or Windows) for the phrase “JPPD bypassed”; this is what I got:

    
    test_ora_9897.trc:OJPPD:     OJPPD bypassed: View contains TABLE expression.
    test_ora_9897.trc:JPPD:     JPPD bypassed: View not on right-side of outer-join.
    test_ora_9897.trc:JPPD:     JPPD bypassed: View not on right-side of outer-join.
    
    

    So 11.1.0.7 had a plan that used the old_push_pred() hint, but 11.2.0.4 explicitly bypassed the option (the rubric near the top of the trace file translates OJPPD to “old-style (non-cost-based) JPPD”, where JPPD translates to “join predicate push-down”). It looks like the plan we got from 11.1.0.7 has been deliberately blocked in 11.2.0.4. So now it’s time to worry whether or not that means I could have been getting wrong results from 11.1.0.7.

    In my test case, of course, I can bypass the problem by explicitly rewriting the query – but I’ll have to move the join with t1 inside the view for both subqueries; alternatively, given the trivial nature of the pipeline function, I could replace the table() operator with a join to another union all view. In real life such changes are not always so easy to implement.

    Footnote: the restriction is still in place on 12.1.0.2.

    Footnote 2: somewhere I’ve probably published a short note explaining that one of my standard pre-emptive strikes on an upgrade is to run the following command to extract useful information from the executable: “strings -a oracle | grep -v bypass”: it can be very helpful to have a list of situations in which some query transformation is bypassed.

     


    Oracle Roundtables: Next Gen Digital Experience & Engagement (Dallas & Chicago)

    WebCenter Team - Mon, 2014-11-03 10:34
    Oracle Corporation Next Gen Digital Experience & Engagement

    Connecting Experiences to Outcomes

    The world has changed to one that’s always on, always-engaged, requiring organizations to rapidly become “digital businesses.” In order to thrive and survive in this new economy, having the right digital experience and engagement strategy and speed of execution is crucial. 

    But where do you start? How do you accelerate this transformation? 

    Attend this roundtable to hear directly from leading industry analysts from Forrester Research, Inc., Blast Radius, client companies, and solution experts as they outline the best practice strategies to seize the full potential of digital experience and engagement platform. Gain insights on how your business can deliver the exceptional and engaging digital experiences and the drive the next wave of revenue growth, service excellence and business efficiency. 

    We look forward to your participation at the Solution Roundtable. 

    Register now for the November 12 event or call 1.800.820.5592 ext. 12830.

    Register now for the November 13 or call 1.800.820.5592 ext. 12864.

    Blast Radius

    Red Button Top Register Now (Dallas) Red Button Bottom Red Button Top Register Now (Chicago) Red Button Bottom Calendar November 12, 2014
    10:30 a.m. - 11:45 a.m. Renaissance Dallas Renaissance Dallas
    2222 N. Stemmons Fwy.
    Dallas, TX 75207 Calendar November 13, 2014
    10:30 a.m. - 11:45 a.m. The Westin O'Hare The Westin O'Hare
    6100 N River Rd
    Rosemont, IL 60018 Featuring:

    James L. McQuivey James L. McQuivey, Ph.D.
    Vice President, Principal Analyst serving CMO Professionals, Forrester 

    Oracle Day - Dallas

    If you are an employee or official of a government organization, please click here for important ethics information regarding this event. Hardware and Software Engineered to Work Together Copyright © 2014, Oracle Corporation and/or its affiliates.
    All rights reserved.
    Contact Us | Legal Notices and Terms of Use | Privacy Statement