Skip navigation.

BI & Warehousing

dbms_output.put_line

Chet Justice - Tue, 2013-03-19 20:08
I've been scratching my eyes out lately trying to reverse engineer some lots of PL/SQL.

One thing I've seen a lot of is calls to dbms_output.put_line. Fortunately, I've seen some dbms_application_info.set_module and other system calls too. But back to that first one.

1. When I used dbms_output, I would typically only use it in development. Once done, I would remove all calls to it, test and promote to QA. It would never survive the trip to production.
2. Typically, when I used it in development, I would tire of typing out d b m s _ o u t p u t . p u t _ l i n e so I would either a, create a standalone procedure or create a private procedure inside the package, something like this (standalone version).
CREATE OR REPLACE
PROCEDURE p( p_text IN VARCHAR2 )
IS
BEGIN
dbms_output.put_line( p_text );
END p;
Easy. Then, in the code, I would simply use the procedure p all over the place...like this:
  l_start_time date;
l_end_time date;
begin
l_start_time := sysdate;
p( 'l_start_time: ' || l_start_time );

--do some stuff here
--maybe add some more calls to p

l_end_time := sysdate;
p( 'l_end_time: ' || l_start_time );

end;
Since the procedure is 84 characters long, I only have to use the p function 4 times to get the benefit. Yay for me...I think. Wait, I like typing.
Categories: BI & Warehousing

Performance and OBIEE – Summary and FAQ

Rittman Mead Consulting - Mon, 2013-03-18 15:00

This article is the final one in a series about OBIEE and performance. You can find the previous posts here:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
Summary
  • The key to long term, repeatable, successful performance optimisation of a system is KNOWING where problems lie and then resolving them accordingly. This means investing time up front to develop an understanding of the system and its instrumentations and metrics. Once you have this, you can apply the same method over and over to consistently and successfully resolve performance problems.
  • GUESSING what the problem is and throwing best practice checklists at it will only get you so far. If you hit a lucky streak it may get you far enough to convince yourself that it is enough alone. But sooner or later you will hit a dead-end with your guesswork and have to instead start truly diagnosing problems. When this happens, you are starting from scratch in learning and developing your method for doing this.
 Know; Dont Guess

Getting the best performance out of OBIEE is all about good design and empirical validation.

To be able to improve performance you must first identify the cause of problem. If you start ‘tuning’ without identifying the actual cause you risk making things much worse.

The slightly acerbic tone at times of these articles may betray my frustration with the incorrect approach that I see people take all too often. A methodical approach is the correct one, and I am somewhat passionate about this, because:

  1. It works! You gather the data to diagnose the issue, and then you fix the issue. There is no guessing and there is no luck
  2. By approaching it methodically, you learn so much more about how the OBIEE stack works, which aside from being useful in itself means that you will design better OBIEE systems and troubleshoot performance more easily. You actually progress in your understanding, rather than remaining in the dark about how OBIEE works and throwing fixes at it to hope one works.
FAQ

Q: How do I improve the performance of my OBIEE dashboards/reports?

A: Start here. Use the method described to help understand where your performance is slow, and why. Then you set to resolving it as described in this series of blog articles.

Q: No seriously, I don’t have time to read that stuff… how do I fix the performance? My boss is mad and I must fix it urgently!

A: You can either randomly try changing things, in which case Google will turn up several lists of settings to play with, or you can diagnose the real cause of the performance problem. If you’ve run a test then see here for how to analyse the performance and understand where the problem lies

Q: Why are you being such a bore? Can’t you just tell me the setting to change?

A: I’m the guy putting £0.50 bets on horses because I don’t want to really risk my money with big bets. In my view, changing settings to fix performance without knowing which setting actually needs changing is a gamble. Sometimes the gamble pays off, but in the end the house always wins.

Q: Performance is bad. When I run the report SQL against the database manually it is fast. Why is OBIEE slow?

A1: You can see from nqquery.log on the BI Server how long the SQL takes on the database, so you don’t necessarily need to run it manually. Bear in mind the network between your BI Server and the database, and also the user ID that the query is being executed as. Finally, the database may have cached the query so could be giving a better impression of the speed.

A2: If the query really does run faster manually against the database then look at nqquery.log to see where the rest of the query time is being spent. It could be the BI Server is having to do additional work on the data before it is returned up to the user. For more on this, see response time profiling.

Q: This all sounds like overkill to me. In my day we just created indexes and were done.
A: I’ve tried to make my method as comprehensive as possible, and usable in both large-scale performance tests but also isolated performance issues. If a particular report is slow for one use, then the test define, design and build is pretty much done already – you know the report, and to start with running it manually is probably fine. Then you analyse why the performance isn’t as good as you want and based on that, you optimise it.

Q: Should I disable query logging for performance? I have read it is a best practice.

A: Query logging is a good thing and shouldn’t be disabled, although shouldn’t be set too detailed either.

Reading & References

The bulk of my inspiration and passion for trying to understand more about how to ‘do’ performance properly has come from three key places:

plus the OBIEE gurus at rittmanmead including markrittman and krisvenkat, along with lots of random twitter conversations and stalking of neilkod, martinberx, alexgorbachev, kevinclosson, nialllitchfield, orcldoug, martindba, jamesmorle, and more.

Comments

I’d love your feedback on this. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

I’ve enabled comments on this article in the series only, to keep the discussion in one place. You can tell me what you think on twitter too, @rmoff

Categories: BI & Warehousing

Performance and OBIEE – part VII – Optimising OBIEE performance (“Tuning”)

Rittman Mead Consulting - Mon, 2013-03-18 15:00
Potato potato?

When I first drafted this blog post, I gave it the title “Tuning”, rather than “Optimising”. Tuning is the word used so often in the context of performance, but it can actually be a bit misleading. A system that has fundamental design flaws doesn’t need “tuning” – it needs redesigning. Tuning can end up being a catch-all phrase meaning “fix performance ”, regardless of the scale of the task.

Why does this matter? It matters, because if you are in the performance test phase of a project with a deadline to keep to it’s useful to be honest with yourself and the project manager in describing the work that needs to be undertaken to improve performance. If your analysis has shown that the design is sound but you’re hitting bottlenecks in performance somewhere along the way, then it is “tuning” you’re doing. However, if your analysis is showing that the metadata model you’ve built in the RPD stinks, the data model on the database sucks, and the dashboard design is suspect – it’s not going to take a bit of “tuning” to fix, it’s going to take some serious redevelopment. From a point of view of time estimation and expectations management, this is important.

A rough analogy would be with piano tuning. A structurally sound piano needs tuning periodically to keep it from being out of tune. However, a piano with woodworm needs a lot more work doing to it than simple tuning – it needs taking to the workshop.

The piano tuning analogy serves one more purpose too: asking a piano tuner to give you a step by step guide to tuning a piano is almost as nonsensical as it is to expect a simple checklist to provide an way to comprehensively fix (“tune”) the performance of OBIEE.

<rant>

This section is deliberately at the end of a long and detailed series of articles describing how to test and analyse the performance of OBIEE. ‘Tuning’ can be a bit of a weasel word in IT, implying light-touch, silver bullet changes that magically improve a system’s performance, when actually it can mean randomly changing a bunch of things in the blind and desperate hope of making bad performance better but with no way of knowing if it worked or what ‘it’ was.

</rant>

Tuning should be undertaken only as part of a performance cycle. It should be approached from one of two ways:

  1. A performance problem has been identified, and you need to resolve it. You should have test data showing where the problem lies, and the analysis you’ve done should have pointed to the root cause. In this case, the tuning you do ought to be nothing more than fixing the problem. It might take lots of diagnosis and investigation to establish the resolution of a diagnosed cause, but crucially all the work you do will be focussed on the root cause that you’ve identified.
  2. You have a performance baseline, and you want to see if you can improve performance or capacity in general. There is no specific problem to resolve, but reducing the runtimes of queries would be nice. To be precise, we’re looking to optimise a system.
Anti-patterns in performance optimisation

There are anti-patterns (“Bad Practices”, if you will) to good performance troubleshooting; here are some of them. You should have your guard up if you see these, and challenge them!

  • Sacred Cows
  • Hero worship
  • Best practice!
  • Tweaking & Tinkering
  • Silver bullets
  • Golden rules
  • Cast iron guarantees
  • Never
  • Always
  • Take my word for it
  • Long bullet point lists

The only “best practice” you should be using all the time is “Use Your Brain”. – Tom Kyte

Generally good design principles to observe in OBIEE

These are not Best Practices! These are things that can work well, but have to be done with brain engaged!

They’re not bad practices either, they’re just good practices.

  • Ensure that the RPD has a correct dimensional Star model in the Logical (BMM) layer
  • Push the dimensional Star schema into the database; don’t build against a normalised data model if performance is key requirement
  • Create aggregate tables to support user queries
    • Use Usage Tracking to spot dimension levels that would benefit, or Summary Advisor on Exalytics
    • Aggregate Persistence Wizard can do the initial hard work in plumbing the metadata into the RPD for you
      • You still need to maintain the aggregate, unless you are happy to just drop/recreate each time the data changes
    • Don’t forget to create associated aggregated dimension tables. These are particularly useful for good performance of prompts where a distinct list of values at a level in a dimension are returned.
  • Make sure that aggregates get used when appropriate. Check the SQL that OBIEE is generating isn’t using a less-efficient source table.
  • OBIEE is not an extraction tool, especially not in the front-end.
    • If users really want a data dump, consider doing that for them outside of the tool, for example with sql*plus.
    • To still make use of the metadata model in the RPD, but without causing big problems in Presentation Services, use an ODBC or JDBC call into the BI Server directly to get the data dump out. Using this method, you could hook in Excel directly to the BI Server.
  • The fastest query is one that never runs – challenge dashboard & reports designs. Don’t just copy what an existing system does. Analyse the user’s workflow, to see if the reports you build can support and make more efficient what the user does.
  • Generally you should avoid building the RPD against database views, as they can hinder OBIEE’s SQL generation with the result of sub-optimal queries. Database views can also hide inefficient or unnecessary joins and logic. Air your dirty washing in public, and put the underlying tables into the Physical layer of the RPD instead and let OBIEE work with them.
    • This is not a hard and fast rule, and it is not a “Best Practice” (sigh). There will be some genuine cases where a database view is a pragmatic solution to a particular data model issue.
  • Minimise the work being done by the BI Server. When using federation to join data across databases it is unavoidable but generally it is to be frowned upon if within the same database. Wherever possible, all work should be seen to be pushed down to the database.
    • Check how much data the BI Server pulls back from the database as a percentage of rows returned to the user (low % is bad).
    • Monitor the BI Server’s temp directory – if this is filling up with large files it means that the BI Server is having to crunch lots of data
    • How many database queries does one report trigger? (higher is generally less efficient).
    • This SQL will help identify reports for investigation, using existing Usage Tracking data:
      SELECT SAW_SRC_PATH, 
             ROW_COUNT, 
             CUM_NUM_DB_ROW, 
            ( ROW_COUNT / CUM_NUM_DB_ROW ) * 100 AS ROWS_PCT, 
             TOTAL_TIME_SEC, 
             NUM_DB_QUERY 
      FROM   S_NQ_ACCT 
      WHERE  ROW_COUNT > 0 
             AND CUM_NUM_DB_ROW > 0 
          -- Update these predicates to tailor your results as required
          --   AND TOTAL_TIME_SEC > 10 
          --   AND CUM_NUM_DB_ROW > 10000 
      ORDER  BY 4 DESC 
      
  • Size hardware correctly to support the BI Server and Presentation Server (based on past experience and/or Oracle’s documentation)
  • Make sure that there a balanced hardware configuration throughout the stack (c.f. Greg Rahn and Oracle documentation)
Optimising OBIEE further

The above section outlines some of the principles you should always be aiming to follow, or have a clear reason why you’re not. There are some other techniques that can be worth investigating when you’re looking to optimise the performance of OBIEE further, discussed below.

BI Server Caching

How have I got this far and still not mentioned caching? To horribly mix two food metaphors, caching is the icing on the cake, it is not the bread and butter of good performance. If you are using caching as your sole means of ensuring good performance then you are skating on thin ice.

That is not to say caching is bad. Caching is good, when its use is thought through and evaluated carefully. Caching has an overhead in terms of management, so you cannot just chuck it in to the mix and forget about it. You need to manage the cache to make sure you’re not serving up stale data to your end users. It might be fast, but it’ll be wrong.

Caching can improve performance for several reasons:

  • It is typically faster to return the results of a query already run and stored in the cache than it is to re-execute the necessary Physical SQL against the source database(s)
  • By not executing the Physical SQL on the database(s), we reduce both the load on the database, and the amount of network traffic, freeing up these resources for non-cached queries
  • The cache stores the results of a Logical SQL query sent to the BI Server, not the data that the database returns. If the BI Server is doing lots of work, for example, federating and aggregating lots of data across sources, then by caching the result post-processing, all of this work can be avoided by subsequent cache hits.

The BI Server cache is more advanced than a ‘dumb’ cache where only a direct match on a previous request will result in a hit. The BI Server will parse a Logical SQL query and recognise if it is either a direct match, a subset , or an aggregate of an existing cache entry. So a cache entry for sales by day could well satisfy a subsequent Logical SQL query for sales by year.

Pre-seeding the cache is a good idea, so that users all benefit from the cache, not just those who come along after the first user has run a report that gets stored in the cache. There are a couple of ways to pre-seed the cache:

  1. A BI Delivers Agent with the Destination set to System Services : Oracle BI Server Cache. This agent could optionally be set to run straight after your ETL batch has loaded the data.
  2. Directly from an ETL tool via ODBC/JDBC using the SASeedQuery statement.

If you don’t pre-seed the cache then only users running queries based on queries already run by others users will benefit from the cache.

The flip-side of pre-seeding the cache is purging it, and there are two sensible ways to do this :

  1. Event Polling Table
  2. ODBC/JDBC command to the BI Server, triggered by the completion of a data load (ETL)

Watch out for the Cache Persistence time in the Physical Table – this defines how long an entry remains in the cache, rather than how frequently to purge it. If you have a daily data load, setting the cache persistence time to 24 hours will not do what you may think. If your data is loaded at 0300, the first user queries it and creates a cache entry at 0900, that cache entry will remain until 0900 the following day, even though the cached data would have been stale for six hours (since the subsequent data load at 0300).

Where Cache persistence time can be useful is in systems with frequent changes in the source data and you want to deliberately introduce a lag in the data the user sees for the benefit of generally faster response times for end-users. For example, you may have a trickle-fed ODS from which you are running OBIEE reports. If the data is being loaded in near-real-time, and the users want to see it 100% current, then evidently you cannot use caching. However, if the users would be happy with a lag in the data, for example ten minutes, then you could enable caching and set the cache persistence time for the relevant physical table to 10 minutes. For the ten minutes that the data is in the cache, the users get fast response times. This could be a pragmatic balance between freshness of data and response times to get the data. Bear in mind that a query taking 2 minutes to run is going to be reporting on data that is 2 minutes out of date already.

Use fast disk for cache and/or temporary files

OBIEE writes various temporary files, including cache data and work files, to disk. By default, these reside in $FMW_HOME/instances/instance1/tmp. You may find that using fast disk (e.g. SSD) or even RAM disk to hold some or all of these temporary files instead could improve performance. Be aware that some of the work files that OBIEE writes can be very big (as big as the data being pulled back from the database, so in the order of gigabytes if you’re unlucky).

Web Tier for static content caching

In a standard OBIEE installation, WebLogic acts as both the application server (running java code, communicating with Presentation Services etc) as well as the HTTP server, handling inbound connections from the web browser, serving up static files such as CSS stylesheets.

It can sometimes be beneficial to introduce a separate HTTP server such as Oracle HTTP Server (OHS), leaving WebLogic to just act as the application server. A dedicated HTTP server such as OHS can be configured to cache and compress static files which can improve the response time for users especially if the network is not a fast one.

For more information, see Venkat’s article Anatomy of BI EE Page Rendering – Use of HTTP/Web Server & Compression

Scale out / Scale up

Increasing the number of instances of the system components can help maximise the capacity of the overall BI Domain and enable it to use the full resource of the hardware on which it is running.

Scaling out is to add additional physical servers and extend the BI Domain onto them. Scaling up is to just increase the number of one or more of the components that are running on an existing server.

An example of where this can be useful is the Javahost component. It is configured with a default maximum number of jobs that it can handle. Certain workload types and volumes can hit this maximum with relative ease, so increasing the number of Javahosts can improve the performance by reducing contention for the process.

In general, if you don’t have any evidence of a capacity limit being reached or in danger of being reached, I would be reluctant to ‘inoculate’ a system by scaling (adding additional component instances) ‘just in case’. You will only add to the number of moving parts to keep track of (and increase complexity of configuration such as shared presentation catalog folder), and without really a reassurance that the change you’ve made will help. It gives a false sense of security, since you’re just presuming, guessing, that the first bottleneck your system will reach is one which is resolved by scaling out/up.

Mark Rittman wrote a post recently in which he discussed the idea of scaling the BI Server (nqsserver) component in order to take advantage of multiple CPU cores, and whether this was in fact necessary. You can read his post here : Does the BI Server System Component Make Full Use of My Multi-Socket, Multi-Core Server?

Optimising the database

This section is most definitely not a comprehensive study; it is a set of a few pointers that I would be looking for before speaking to my honourable DBA colleagues who have longer beards than I and know this stuff inside out.

  • DBAs :
    • you cannot tune the SQL that OBIEE sends to the database; it is generated by OBIEE. If there is a better way to write the SQL query then you need to get the BI Server to generate it in that way by amending the RPD. Missing joins etc indicate problem with the RPD
    • you can try optimising the physical implementation of the underlying database objects to make a query that OBIEE generates run faster

Techniques to evaluate in your physical implementation of the data model include:

  • Appropriate use of Parallel query, including Auto DOP (as written about by my colleague Pete Scott recently)
  • Partitioning
  • Appropriate – but not excessive – indexing
  • Correct data types, particularly on join columns
  • Materialized Views for holding pre-built aggregations of your fact data
    • (including PCT to incrementally refresh)
  • Statistics
    • Make sure that they’re representative/accurate
    • Have a proactive statistics management strategy. Don’t just rely on the auto stats jobs.
    • Incremental statistics can be useful
  • Resource Manager is useful for granular control of resources such as parallelism, optionally between different groups of users. For example, power users could be given a greater DOP than normal users.
Oracle’s tuning document

You can get the Oracle official OBIEE tuning document from here: Best Practices Guide for Tuning Oracle® Business Intelligence Enterprise Edition.

When Oracle released this document in December 2012 it caused a mini-twitter-storm amongst some OBIEE professionals. On the one hand, publishing a list of settings to evaluate is of great help. On the other, publishing a list of settings to evaluate with no context or method with which to validate them is no help whatsoever. In my [not very humble] opinion, a supplemental list of configuration parameters – especially from the software vendor themselves – should only go hand-in-hand with details of how to properly evaluate them. Here is why I think that:

  1. Turn the dial to 42. If Oracle says so, then that’s what we’ll do. Time is wasted changing configuration without understanding why.
  2. Each configuration change is a move away from default, and thus increases chances of being missed in environment migrations and product upgrades
  3. If there is a suggested default, with no context or caveat, then why isn’t it an instruction in the installation guide? (“For a stable / an even performance over time, you should at least set two Presentation Services per server and two Javahost per server.”)
  4. The document suggests enabling BI Server caching, with no observation that this could lead to users getting wrong (stale) results
  5. It propagates the idea that performance is only a case of setting the correct values in configuration files. In the tuning guide there is a generic mention of “Application Design”, but in a document that discusses parameter changes throughout the OBIEE stack where is the lengthy discussion of underlying data model, appropriate RPD model, query push-down? These fundamental design principles count a thousand times over against how many tcp_listeners an OS is configured for
  6. No explanation of (a) why a setting should be changed and (b) in what situation. For example, changing tcp settings will make no difference on a low-concurrency system horrendously bottlenecked on poor database queries resulting from a bad data model. If a tcp setting needs changing, which corresponding OS network stat will show a bottleneck? Or response time profile to show excessive wait times at the network layer?
  7. A shopping list of parameters to change should be the last step of performance optimisation, but by being the only tuning document available, gives the impression that this is all there is to performance – turn some knobs and you’re done.
  8. A little knowledge is a dangerous thing. By putting these low-level configuration options out as a generally available document it increases the chances of more harm than good being done by people. If someone doesn’t understand a system then it is best to leave it alone rather than tinker with some changes that have an Oracle stamp of authority on.
    If my car is running badly, I won’t start trying to remap the engine. I’ll take it to the garage. When they tell me what the problem was and what they did to fix it, I won’t take that recommendation and tell my friend whose car is running slowly to do the same.

This isn’t to say the document isn’t useful. It is very useful. It’s just a shame that it is the only OBIEE tuning document from Oracle. It’s really useful to have at the end of a performance test once test results are in and diagnosis and resolution of the problems start. But it’s necessary to have the diagnosis and means to validate a change first, before throwing changes at a system.

Summary

This is the penultimate article in this series about OBIEE and performance. For an overview and conclusion of my method for improving performance in OBIEE, see the final post in this series, here.

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ
Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the thread of comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Performance and OBIEE – part VI – Analysing results

Rittman Mead Consulting - Mon, 2013-03-18 15:00

This part of the OBIEE performance cycle is the one which arguably matters most. Having defined what we’re going to test, built a means by which to test it, and executed that test, we now need to sift through the tealeaves and work out what the data we collected is telling us. Except we’re not going to use hocus-pocus like tea leaf reading, gut feeling or best practice checklists, we’re going to use cold hard data and analysis.

Analysing the data breaks down into several stages, and is often an iterative process:

  1. Analyse the net response time. Is it as fast as it needs to be, at the required level of user concurrency?
  2. If the response time is too slow (“too slow” being defined by you or your users, in advance of the test), then diagnose to determine why. This is another phase of analysis, breaking down the net response time into its constituent parts, analysing system and OBI metrics for signs of a bottleneck. The output of this phase will be an hypothesis as to the cause of the performance problem
  3. Based on the diagnosis of the issue, apply one change to the system to improve it, that is, resolve the performance issue. Having made one change (and one change only), the original test should be repeated and the analysis cycle repeated to determine the impact of the tuning.
Analysis

How you analyse your data determines whether you will be accurately and fairly representing the results of the test in your diagnoses and conclusions.

Avoid Averages

From your test execution you will have a series of response times. You need to summarise, that is, aggregate these into a single figure to give as a headline figure in your test report. If you take only one thing away from reading this, let it be the following point: don’t use average figures! I’ll say it again for emphasis : Averages are not a good way to represent your test data. What I am saying here is nothing that you won’t read in every other decent article written on performance testing and analysis.

When you average out a series of data, you mask and muddy your data by inadvertently hiding extreme values in the series. A much better summary to use is the percentile.

Consider a performance test of a single dashboard for a single user. It is run ten times, so as to get a representative set of data for the response time. Here are two series of data, both with an average response time of five seconds. If we look at the 90th Percentile for the same two series of data, we can see that series ‘A’ has a response time of ten seconds, whilst series ‘B’ has a response time of six seconds.

As a user, if you run this dashboard, which behaviour would you prefer? Series ‘A’, where it might take a second to run or it might take ten seconds, or Series ‘B’ where it is going to be five seconds, give or take one second either side? As human beings we like consistency and certainty. Sure, it’d be nice if the dashboard ran in a second, but most people would rather know that it’s definitely going to run within six seconds and not almost double that. That uncertainty can also be seen in the standard deviation figure in the two series. The lower the standard deviation, the more consistent the response times are.

For more detail and clear statistical explanations, read “Averages Only” in Zed Shaw’s Programmers Need To Learn Statistics and “Percentile Specifications” in Cary Millsap’s Thinking Clearly about Performance.

Throw away your test data

Well, not literally. But, if you are testing user concurrency, make sure that when you calculate your percentile (eg 90th percentile) response time, do it for a given number of users. Otherwise you are distorting the figure. Typically a test will have a ‘ramp up’ period where the concurrent users are gradually introduced onto the system, until the target number is active, at which point the system is in ‘steady state’. It is from this point, the steady state, that you should be deriving your response time calculation. It is useful to look at how the response time might vary as the workload is increased, but for an accurate figure of a response time at a given number of users, you should be ignoring the data except where the full number of users was running.

Analysis summary

The output of this phase of the analysis should be very simple:

The 90th Percentile response time for dashboard <xx> is <yy> at a user concurrency of <zz>

And this should then satisfy a pass/fail criterion that was specified when you defined the test.

  • If the test passes, great. Record all your test parameters and data, and move on to the next test.
  • If the test doesn’t pass, then you need to work out why, and for that, see below.

I’m oversimplifying, since there is other data (eg standard deviation) that you might want to include in your summary, along with some commentary around variances observed and so on.

Diagnosing poor OBIEE performance Get to the root of the problem

So, the test results showed that the dashboard(s) run too slowly. Now what? Now, you need to work out why there is a performance problem. I am deliberately spelling this out, because too many people jump forward to attempting to fix a performance problem without actually understanding exactly what the problem is. They whip out their six-shooters loaded with silver bullets and start blasting, which is a bad idea for two reasons:

  1. You may never know what the problem was – so you won’t be able to avoid doing it again! Everyone makes mistakes; the mark of a good programmer is one who learns from them.
    If I run a dashboard on a 2 CPU 4GB server and find it’s slow, one option could be to run it on a 8 CPU 32GB server. Tada! It’s faster. But, does that mean that every report now needs to be run on the big server? Well, yes it’d be nice – but how do we know that the original performance problem wasn’t down to machine capacity but perhaps a missing filter in the report? Or a wrong join in the RPD? It could be an expensive assumption to make that the problem’s root cause was lack of hardware capacity.
  2. In determining the root cause, you will learn more about OBIEE. This better understanding of OBIEE will mean you are less likely to make performance errors in the future. You will also become better at performance diagnostics, making solving live problems in Production as well as future performance tests easier and faster to resolve.

“I broke things, so now I will jiggle things randomly until they unbreak” is not acceptable Linus Torvalds

There are always exceptions, but exceptions can be justified and supported with data. Just beware of the the silver bullet syndrome…The unfortunate part […] is that rarely anyone goes back and does the root cause analysis. It tends to fall into the bucket of “problem…solved”. Greg Rahn

Performance vs Capacity

I always try to split it into #performance tuning (response time) and capacity tuning (throughput/scalability) – Alex Gorbachev

Performance issues can be local to a report, or global to a system implementation and exacerbated by a particular report or set of reports – or both.

If an individual dashboard doesn’t perform with a single user running it, then it certainly isn’t going to with a 100, and there is clearly a performance problem in the design (of the dashboard, RPD, or physical data model design or implementation).

However, if an individual dashboard runs fine with a single user but performance gets worse and worse the more users that run it concurrently, this would indicate a capacity problem in the configuration or physical capacity of your system.

So which is which? An easy way to shortcut it is this: before you launch into your mega-multi-user-concurrency tests, test the dashboard with a single user. Is the response time acceptable? If not, then you have a performance problem. You’ve eliminated user concurrency from the equation entirely. If the response time is acceptable, then you can move onto your user concurrency tests.

If you have already run a big user concurrency test and are trying to identify whether the issue is performance or capacity, then look at what happens to your response time compared to the number of users running. If the response time is constant throughout then it indicates a performance problem; if it is increasing as more users are added it shows a capacity (which can include configuration) problem. Being able to identify this difference is why I’d never run a user concurrency test without a ramp-up period, since you don’t get to observe the behaviour of the system as users are added.

Response time vs active users

In the above graph there are two points evident:

  1. Up to ten users the response time is consistent, around 30 seconds. If the response time needs to be faster than this then there is a performance problem
  2. If 30 seconds is the upper limit of an acceptable response time then we can say that the system has a capacity of 10 concurrent users, and if the user concurrency needs to be greater than this then there is a capacity problem
Errors

Don’t overlook analysing the errors that may come out during your testing. Particularly as you start to hit limits within the stock OBIEE configuration, you might start to see things like:

  • Too many running queries. Server is too busy to process any more queries at this time.
  • com.siebel.analytics.javahost.standalone.SAJobManagerImpl$JobQueueFullException
  • Graph server does not appear to be responding in a timely fashion. It may be under heavy load or unavailable.
  • The queue for the thread pool ChartThreadPool is at it's maximum capacity of 512 jobs.

If you see errors such as these then they will often explain response time variances and problems that you observe in your test data, and should be top of your list for investigating further to resolve or explain.

Response time profiling

A good way to get started with identifying the root cause(s) of a problem is to build a time profile of the overall response time. This is something that I learnt from reading about Method R, and is just as applicable to OBIEE as it is to the Oracle RDBMS about which it was originally written. This link gives a good explanation of what Method R is.

You can improve a system without profiling, and maybe you can even optimize one without profiling. But you can’t know whether a system is optimal without knowing whether its tasks are efficient, and you can’t know whether a given task is efficient without profiling it. Cary Millsap

Given the number of moving parts in any complex software stack there’s often more than one imperfection. The trick is to find the most significant that will yield the best response time improvement when resolved. It also lets you identify which will give the “biggest bang for your buck” – maybe there are several problems, but the top one requires a complete redesign whilst the second one is an easy resolution and will improve response times sufficiently.

So in the context of OBIEE, what does a response time profile look like? If you hark back to the OBIEE stack that I described previously, a simple example profile could look something like this:

OBIEE response time profile

Here we can see that whatever we might do the speed up the chart rendering (5 seconds) the focus of our investigation should really be on the 20 second query run on the database, as well as the 10 seconds it takes BI Server to join the results together. Can we eliminate the need for two queries, and can we do something on the database to improve the query run time?

When building a time profile, start at the highest level, and break down the steps based on the data you have. For example, to determine the time it takes Presentation Services to send a query to BI Server is quite a complex process involving low-level log files. Yet, it probably isn’t a significant line entry on the profile, so by all means mark it down but spend the time on the bigger steps – which is usually the fetching and processing of the report data.

OBIEE response time profile

A more complicated profile might be something like this:

OBIEE response time profile

Graphing a response time profile can also help us comprehend at a glance what’s happening, and also gives a ‘template’ to hold up to profiles that are created. In general you would want to see the split of a time profile heavily weighted to the database:

OBIEE response time profileIf the response time profile shows that just as much of the total response time is happening on the BI Server then I would want to see what could be done to shift the weight of the work back to the database:

OBIEE response time profile
For more on this subject of where work should ideally be occurring, see the section below “Make sure that the database is sweating”.

Here are the sources you can look for response time profile data, starting at the user interface and going down to the database

  • Rendering time – Web browser profiler such as Chrome Developer Tools, YSlow, Google Page Speed
  • WebLogic – access.log will show the HTTP requests coming in
  • Presentation Services – sawlog0.log, but may require custom log levels to get at low-level information
  • BI Server
    • nqquery.log
      • Time to create physical SQL(s), i.e. compile time
      • Time to connect to DB
      • Time to execute on DB
      • Time to process on BI server and return to PS
    • Usage Tracking
      • S_NQ_ACCT
      • S_NQ_DB_ACCT
  • Database – whilst profiling can be extended down to the DB (for example, using an 10046 trace in Oracle), it makes more sense to do as a standalone analysis piece on an individual query where necessary. In extreme examples the profiling could actually go beyond the database down into the SAN, for example.
Diagnosing capacity problems

If a dashboard is performing acceptably under a single user load, but performance deteriorates unacceptably as the user currency increases, then you have a capacity issue. This capacity could be hardware, for example, you have exhausted your CPUs or saturated your I/O pipe. Capacity can also refer to the application and how it is configured. OBIEE is a powerful piece of software but to make it so flexible there are by definition a lot of ways in which is can be configured – including badly! Particularly as user concurrency (as in, concurrent executing reports) increases into three figures and above it may be the default configuration options are not sufficient. Note that this “three figures and above” should be taken with a large pinch of salt, since it could be lower for very ‘heavy’ dashboards, or much higher for ‘light’ dashboards. By ‘heavy’ and ‘light’ I am primarily referring to the amount of work they cause on the BI Server (e.g. federation of large datasets), Presentation Services (e.g. large pivot tables) and Javahost (e.g. lots of chart requests such as you’d see with trellis views).

To diagnose a capacity problem, you need data. You need to analyse the response time over time against the measures of how the system was performing over time, and then investigate any apparent correlations in detail to ascertain if there is causation.

ObXKCD

ObXKCD

This is where you may need to re-run your performance test if you didn’t collect this data the first time around. See the System Metrics section above for details on how and what. The easy stuff to collect is OS metrics, including CPU, Memory, Disk IO, and Network IO. You should include both the OBI and Database server(s) in this. Look at how this behaves over time compared to the performance test response times. Using a relatively gradual user ramp-up is a good idea to pinpoint where things might start to get unstuck, rather than just plain break.

Network bottleneck observed as load increases beyond c.9 active users

If the OS metrics are unremarkable – that is, there is plenty of capacity left in all of the areas but response times are still suffering as user concurrency increases – then you need to start digging deeper. This could include:

  • OBI Metrics
  • Analysis of the performance of the database against which queries are running
  • End-to-end stack capacity, eg Network, SAN, etc.

OBI Metrics can be particularly enlightening in diagnosing configuration issues. For example, an undersized connection pool or saturated javahost.

Don’t forget to also include the OBI logs in your analysis, since they may also point to any issues you’re seeing in the errors or warnings that they record.

Additional diagnosis tips

Having profiled the response time you should hopefully have pinpointed an area for investigation for coming up with your diagnosis. The additional analysis that you may need to do to determine root cause is very dependent upon the area you have identified. Below are some pointers to help you.

Make sure that the database is sweating

As mentioned above, a healthy OBI system will wherever possible generally push all of the ‘heavy lifting’ work such as filtering, calculations, and aggregations down to the database. You want to see as little difference between the data volume returned from the database to the BI Server, and that returned to the user.

Use nqquery.log to look at the bytes and rows that OBIEE is pulling back from the database. For example, you don’t want to see entries such as this:

Rows 13894550, bytes 3260497648 retrieved from database query id: xxxx

(13.8 million rows / 3GB of data!)

If you return lots of of data from the database to the BI server, performance suffers because:

  • You’re shifting lots of data across the network, each time the query runs
  • As well as the database processing lots of data, the BI Server now has to process the same volume of data to pare it down to the results that the user wants
  • If the data volumes are large the BI Server will start having to write .TMP files to disk, which can have its own overhead and implications in terms of available disk space

You can read more on this topic here.

N.B. If you’re using cross-database federation then this processing of data by the BI Server can be unavoidable, and is of course a powerful feature of OBIEE to take advantage of when needed.

A related point to this is the general principle of Filter Early. If dashboards are pulling back data for all months and all product areas, but the user is only looking at last month and their own product area then change the dashboard to filter it so. And if you use dashboard prompts but have left them unset by default then every time the user initially navigates to the dashboard they’ll be pulling back all data, so set defaults or a filter in the constituent reports.
As a last point on this particular subject – what if there are 13 million rows pulled back from the database because the user wants 13 million rows in their report? Well, other than this:
shudder

I would say: use the right tool for the right job. Would the user’s workflow be better served by an exception-based report rather than a vast grid of data just ‘because we’ve always done it that way’? If they really need all the data, then it’s clear that the user is not going to analyse 13 million rows of data in OBIEE, they’re probably going to dump it into Excel, or some other tool – and if so, then write a data extract to do it more efficiently and leave OBIEE out of the equation. If you want to make use of the metadata model you’ve built in the RPD, you could always use an ODBC or JDBC connection directly into the BI Server to get the data out. Just don’t try and do it through Answers/Dashboards.

Instrumenting connection pools

For a detailed understanding of how the database behaves under load as the result of BI queries, consider using instrumentation in your connection pools as way of correlating [problematic] workload on the database with originating OBIEE queries and users.

I have written previously about how to this, here

Why’s it doing what it’s doing

If a report ‘ought’ to be running well, but isn’t, there are two optimisers involved to investigate to see why it is running how it is. When the inbound Logical SQL is received by the BI Server from Presentation Services, it is parsed (‘compiled’) by the BI Server through the RPD to generate the Physical SQL statement(s) to run against the database.

To see how OBIEE analyses the Logical SQL and decides how to run it, use a LOGLEVEL setting of 4 or greater. This writes the execution plan to nqquery.log, but be aware, it’s low-level stuff and typically for Oracle support use only. To read more about log levels, see here. The execution plan is based entirely upon the contents of the RPD, so if you want different Physical SQL generated, you need to influence it through the RPD.

The second optimiser is the database optimiser, which will take the Physical SQL OBIEE is generating and decide how best to execute it on the database. On Oracle this is the Cost-Based Optimiser (CBO), about which there is plenty written already and your local friendly DBA will be able to help with.

Footnote: Hypotheses

Finally, in analysing your data to come up with a diagnosis or hypothesis as to the root cause of the problem, bear this quotation in mind:

If you take a skeptical attitude toward your analysis you’ll look just as hard for data that refutes your hypothesis as you will for data that confirms it. A skeptic attacks the same question from many different angles and dramatically increases their confidence in the results. John Rauser

What next?

If your testing has shown a performance problem then you should by now have a hypothesis or diagnosis of the root cause. Read all about optimisation here. If your testing has shown performance is just fine, you might want to read it anyway …

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ
Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Performance and OBIEE – part V – Execute and Measure

Rittman Mead Consulting - Mon, 2013-03-18 15:00

Having designed and built our tests, we now move on to looking at the real nitty-gritty – how we run them and collect data. The data that we collect is absolutely crucial in getting comprehensible test results and as a consequence ensuring valid test conclusions.

There are several broad elements to the data collected for a test:

  • Response times
  • System behaviour
  • Test details

The last one is very important, because without it you just have some numbers. If someone wants to reproduce the test, or if you want to rerun it to check a result or test a change, you’ve got to be able to run it as it was done originally. This is the cardinal sin that too many performance tests I’ve seen commit. A set of response times in isolation is interesting, sure, but unless I can trace back exactly how they were obtained so that I can:

  • Ensure or challenge their validity
  • Rerun the test myself

then they’re just numbers on a piece of paper.
The other common mistake committed in performance test execution is measuring the wrong thing. It might be a wrong metric, or the right metric but in the wrong context. For example, if I build a test that runs through multiple dashboards, I could get a response time for “Go to Dashboard” transaction:

But what does this tell me? All it tells me really is that some of my dashboard transactions take longer than others to run. Sure, we can start aggregating and analysing the data, talking about percentile response times – but by only measuring the transaction generically, rather than per dashboard or dashboard type, I’m already clouding the data. Much better to accurately identify each transaction and easily see the clear difference in performance behaviour:

 

How about this enticing looking metric:

We could use that to record the report response times for our test, yes? Well, honestly, I have no idea. That’s because a “Request” in the context of the OBIEE stack could be any number of things. I’d need to check the documentation to find out what this number was actually showing and how it was summarised. Don’t just pick a metric because it’s the first one you find that looks about right. Make sure it actually represents what you think it does.

As Zed Shaw puts it:

It’s pretty simple: If you want to measure something, then don’t measure other shit.

Please consider the environment before running this test

Your test should be done on as ‘clean’ an environment as possible. The more contaminating factors there are, the less confidence you can have in your test results, to the point of them becoming worthless.

  • Work with a fixed code version. This can be difficult to do during a project particularly, but there is little point testing the performance of code release 1.00 if when you come to test your tuning changes 1.50 is in use. Who knows what the developers changed between 1.00 and 1.50? It whips the rug from out under your original test results. By fixed code, I mean:
    • Database, including:
      • DDL
      • Object statistics
    • BI Server Repository (RPD)
    • Dashboard and report definitions (Webcat)
      If you can’t insist on this – and sometimes pragmatism dictates so – then at least have a very clear understanding of the OBIEE stack. This way you can understand the potential impact of an external code change and caveat your test results accordingly. For example, if a change was made to the RPD but in a different Business Model from the one you are testing then it may not matter. If, however, they have partitioned an underlying fact table, then this could drastically change your results to the extent you should be discarding your first results and retesting.

    In an ideal world, all the above code artefacts will be under source control, and you can quote the revision/commit number in your test log.

  • Make sure the data in the tables from which you are reporting is both unchanging and representative of Production. Unchanging is hopefully obvious, but representative may benefit from elaboration. If you are going live with 10M rows of data then you’d be pretty wise to do your utmost to run your performance test against 10M rows of data. Different types of reports might behave differently, and this is where judgement comes in. For example, a weekly report that is based on a fact table partitioned by week might perform roughly the same whether all or just one partition is loaded. However, the same fact table as the source for a historical query going back months, or a query cutting across partitions, is going need more data in to be representative. Finally, don’t neglect future growth in your testing. If it’s a brand new system with brand new data, you’ll be starting with zero rows on day one, but if there’ll be millions of rows within a month or so you need to be testing against a million rows or so in your performance tests.
  • The configuration of the software should be constant. This means obvious configuration such as BI Server caching, but also things like version numbers and patch levels of the OBIEE stack. Consider taking a snapshot of all main configuration files (NQSConfig.INIinstanceconfig.xml, etc) to store alongside your test data.
    • You should aim to turn off BI Server caching for your initial tests, and then re-enable it if required as a properly tested optimisation step. The appropriate use and implementation of BI Server caching is discussed in the optimisation article of this series.

Measure

Before you execute your performance test, work out what data you want to collect and how you will collect it. The reason that it is worth thinking about in advance is that once you’ve run your test you can’t usually retrospectively collect additional data.

The data you should collect for your test itself includes:

  • Response times at the lowest grain possible/practical – see Dashboard example above. Response times should be 2-dimensional; transaction name, plus time offset from beginning of test.
  • Number of test users running over time (i.e. offset from the start of your test)
  • Environment details – a diagram of the top-to-bottom stack, software versions, code levels, and data volumes. Anything that is going to be relevant in assessing the validity of the test results, or rerunning the test in the future.
  • System metrics – if response times and user numbers are the eye-catching numbers in a test, system metrics are the oft-missed but vital numbers that give real substance to a test and make it useful. If response times are bad, we need to know why. If they’re good, we need to know how good. Both these things come from the system metrics.
  • Query Metrics – depending on the level at which the testing is being done, collecting metrics for individual query executions can also be vital for aiding performance analysis. Consider this more of a second round, drill down, layer of metrics rather than one to always collect in a large test since it can be a large volume of data.
Response times

Depending on your testing method, how you capture response time will be different. Always capture the raw response time and test duration offset – don’t just record one aggregate figure for the whole test. Working in BI you hopefully are clear on the fact that you can always aggregate data up, but can’t break a figure down if you don’t have the base data.

JMeter has “Sample time” or “Response Time”. Use the setSampleLabel trick to make sure you get a response time per specific dashboard, not just per dashboard refresh call. Some useful listeners to try out include:

  • jp@gc - Response Times Over Time
  • jp@gc - Response Times vs Threads (although be aware that this shows an average response time, which is not the best summary of the metric to use)
  • View Results in Table

JMeter can also write data to csv file, which can be a very good starting point for your own analysis of the data,

If you are doing a test for specific tuning, you might well want to capture response times at other points in the stack too; for example from the database, BI Server, and Presentation Server.

Whichever time you capture, make sure you record what that time represents – is it response time of the query at the BI Server, response time back to the HTTP client, is it response time including rendering – and so on. Don’t forget, standard load test tools such as JMeter don’t include page render time.

System metrics illustration

There are two key areas of system metrics:

  • Environment metrics – everything outside OBI – the host OS, the database, the database OS.
  • OBI metrics – internal metrics that let us understand how OBI is ticking along and where any problems may lie

One of the really important things about system metrics is that they are useful come rain or shine. If the performance is not as desired, we use them to analyse where the bottlenecks are. If performance is good, we use them to understand system behaviour at a “known-good” point in time, as reference for if things do go bad, and also as a long-term capacity planning device.

Some of the tools described below for capturing system metrics could be considered for putting in place as standard monitoring for OBIEE, whilst others are a bit more detailed than you’d want to be collecting all the time.

OS metrics

OS stats should be collected on every server involved in the end-to-end serving of OBIEE dashboards. If you have a separate web tier, a 2-node OBIEE cluster and a database, monitor them all. The workload can manifest itself in multiple places and the more “eyes-on” you have the easier it is to spot the anomalies in behaviour and not just be stuck with a “it was slow” response time summary.

As well as the whole stack, you should also be monitoring the server(s) generating your load test. If you’re testing large numbers of concurrent users the overhead on the generator can be very high, so you need to be monitoring to ensure you’re not hitting a ceiling there, rather than in what is being monitored.

On Windows, I would use the built-in Performance Monitor (perfmon) tool to collect and analyse data. You can capture CPU, IO, Memory, Network and process-specific data to file, and analyse it to your heart’s content afterwards within the tool or exported to CSV.

Windows perfmon

On Linux and other *nix systems there is a swathe of tools available, my tool of choice being collectl, optionally visualised through graphite or graphiti. There are plenty of alternatives, including sar, glance, and so on
collectl data rendered in graphite

collectl data rendered in graphiti

Finally, it’s worth noting that JMeter also offers collection of OS stats through the JMeter plugins project.

OBI metrics

The OBIEE performance counters are a goldmine of valuable information, and one it’s well worth the time mining for nuggets. The counters give you both a better picture of how different workloads are executed within OBIEE but also where any bottlenecks may arise.

RittmanMead OBIEE monitoring tool RittmanMead OBIEE monitoring tool

The counters can be accessed in several ways:

  1. Through Fusion Middleware Control, under Capacity Management -> Metrics -> View the full set of system metrics. This gives a point-in-time view of the data for the last 15 minutes, and does not store history.
    Performance Metrics in EM FMC
  2. Presentation Services includes its own Performance Monitor which can be accessed at http://<yourserver>:<analytics port>/saw.dll?perfmon. In OBIEE 10g it showed BI Server metrics too, but in 11g seems to only show Presentation Services (Oracle BI PS) metrics. It is point in time with no history.
    OBIEE Performance Monitor
  3. Similar to Performance Monitor but with a lot more metrics available, DMS Spy is a java deployment hosted on the Admin Server by default, available at http://<yourserver>:<adminserver port>/dms/Spy
    DMS Spy
  4. Through a manual call to opmn on the commandline. For example:
    
        [oracle@obieesampleapp bin]$ ./opmnctl metric op=query COMPONENT_TYPE=OracleBIServerComponent
        HTTP/1.1 200 OK
        Connection: close
        Content-Type: text/html
    
        <?xml version='1.0'?>
        <!DOCTYPE pdml>
        <pdml version='11.0' name='Oracle BI Server' host='obieesampleapp' id='4399' timestamp='1359703796346'>
        <statistics>
        <noun name="Oracle BI Server" type="Oracle_BI_Applications">
        <noun name="Usage_Tracking" type="Oracle_BI_Thread_Pool">
        <metric name="Peak_Thread_Count.value">
        <value type="integer"><![CDATA[5]]></value>
        </metric>
        <metric name="Current_Thread_Count.value">
        <value type="integer"><![CDATA[2]]></value>
        </metric>
        <metric name="Lowest_Queued_Time_(milliseconds).value">
        <value type="integer"><![CDATA[0]]></value>
        </metric>
        […]
    

    See the documentation for details

  5. DMS through WLST
  6. BI Server diagnostics through ODBC/JDBC
  7. It is worth noting that one place you cannot get the OBI metrics from any more is through JMX. In 10g this was an option and interfaced very well with industry-standard monitoring tools. In 11g, JMX is available for metrics outside core OBI, but not the core metrics themselves.

In addition to the out-of-the-box options above, here at RittmanMead we have developed our own OBIEE monitoring tool. DMS metrics are stored directly on disk or through a database, enabling both immediate and retrospective analysis. Custom dashboards enable the display of both OBIEE and OS data side-by-side for ease of analysis. Integration with third-party tools is also an option.

RittmanMead OBIEE monitoring tool Query Metrics

If you are running a test for a specific performance issue then capturing query metrics is important as these will feed into the diagnosis that you do including building a time profile.

A Logical SQL query has two elements to it for which we capture information:

  • BI Server (Logical SQL)
    • Query runtime
    • Total Rows returned from the database
    • Total Bytes returned from the database
    • Number of database queries
  • Database (x n Physical SQL queries)
    • Response time
    • Rows returned
    • Bytes returned

All of the above information can be obtained from the BI Server’s nqquery.log, or most of it from Usage Tracking tables S_NQ_ACCT and S_NQ_DB_ACCT.

For really detailed analysis you may want to capture additional information from the database about how a query ran such as its execution plan. On Oracle, Real Time SQL Monitoring is a very useful tool, along with several others. For further details, speak to a DBA … this is an OBIEE blog ;-)

Execute

And now the moment you’ve all been waiting for … it’s party time!

Here is a checklist to work through for executing your test:

  • Clear down the logs of any component you’re going to be analysing (eg nqquery.log, sawlog.log)
  • Record any configuration/environment changes from the original baseline
  • Record test start time
  • Restart the OBIEE stack (i.e. WLS, OPMN)
  • Start OS metric collection
  • Start OBIEE metric collection (if used)
  • Run the test!
  • Monitor for errors and excessively bad response times
  • Record test end time
  • Record any errors observed during test
  • Copy all logs, metric outputs, etc to a named/timestamped folder

Monitoring the test as it executes is important. If something goes wrong then time can be saved by abandoning the test rather than let the test script run to completion. There’s no point letting a big test run for hours if the results are going to be useless. Some of the things that could go wrong include:

  1. Your test script is duff. For example, there’s a typo in the login script and no users are logging in let alone executing dashboards. All your test will tell you is how fast OBIEE can reject user logins.
  2. Your test has hit a performance bottleneck in the stack. If you leave your test running beyond a certain point, all you’re doing is collecting data to show how bad things still are. If response times are flat on their back at 50 concurrent users, what’s the point leaving a test to run all the way up to 200? It’s best to curtail it and move swiftly on with the analysis and tuning stage
  3. Your test framework has hit a bottleneck in itself. For example, the host machine cannot sustain the CPU or network traffic required. If this happens then your test data is worthless because all you’re now measuring is the capacity of the host machine, not the OBIEE stack.

Monitoring for errors is also vital for picking up messages that OBIEE might start to produce if it is hitting an internal threshold that could constrain performance.

Don’t fiddle the books!

If your test execution doesn’t work, or you spot an improvement or fix – resist the biggest temptation which is to ‘just fix it’. Hours become days with this approach and you lose complete track of what you changed.

Take a step back, make a note of what needs fixing or changing, and document it as part of the full cycle.

There is nothing wrong whatsoever with aborting a test for the reason that “I defined it incorrectly” or “I forgot to change a config setting”. Better to have a half dozen aborted tests lying around showing that you got your hands dirty than a suspiciously pristine set of perfectly executed tests.

Don’t forget that pesky documentation

Always document your testing, including method, definition, and results.

  • You will not remember precisely how you ran the test, even a few days later
  • How can you identify possible confounding of results, without recording a clear picture of the environment?
  • If you find something unexpected, you can quickly seek a second opinion
  • Without things written down, people will not be able to reproduce your testing
  • Test results on their own are worthless; they are just a set of numbers.
  • If it’s not written down, it didn’t happen
What next?

With a test run completed and a set of data collected, it’s time to make sense of the numbers and understand what they can tell us by analysing the results

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ
Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Performance and OBIEE – part I – Introduction

Rittman Mead Consulting - Mon, 2013-03-18 15:00

Performance matters. Performance really matters. And performance can actually be easy, but it takes some thinking about. It can’t be brute-forced, or learnt by rote, or solved in a list of Best Practices, Silver Bullets and fairy dust.

The problem with performance is that it is too easy to guess and sometimes strike lucky, to pick at a “Best Practice Tuning” setting that by chance matches an issue on your system. This leads people down the path of thinking that performance is just about tweaking parameters, tuning settings, and twiddling knobs. The trouble with trusting this magic beans approach is that down this path leads wasted time, system instability, uncertainty, and insanity. Your fix that worked on another system might work this time, or a setting you find in a “Best Practice” document might work. But would it not be better to know that it would?

I wanted to write this series of posts as a way of getting onto paper how I think analysing and improving performance in OBIEE should be done and why. It is intended to address the very basic question of how do we improve the performance of OBIEE. Lots of people work with OBIEE, and many of them will have lots of ideas about performance, but not all have a clear picture of how to empirically test and improve performance.

Why does performance matter?

Last and only Stopwatch graphic, promise! (image src)

Why does performance matter? Why are some people (me) so obsessed with testing and timing and tuning things? Can’t we just put the system live and see how it goes, since it seems fast enough in Dev?…

Why performance matters to a project’s success
  • Slow systems upset users. No-one likes to be kept waiting. If you’re withdrawing cash from an ATM, you’re going to be quite cross if it takes five minutes. In fact, a pause of five seconds will probably get you fidgeting.
    Once users dislike a system, regaining their favour is an uphill battle. “Trust is hard to win and easily lost”. One of the things about performance is perception of speed, and if a user has decided a system is slow you will have to work twice as hard to get them to simply recognise a small improvement. You not only have to fix the performance problem, you also have to win round the user again and prove that it is genuinely faster.
  • From a cost point of view, poorly performing systems are inefficient:
    • They waste hardware resource, increasing the machine capacity required, decreasing the time between hardware upgrades
    • They cost more to support, particularly as performance bottlenecks can cause unpredictable stability issues
    • They cost more to maintain, in two ways. Firstly, each quick-win used in an attempt to resolve the problem will probably add to the complexity or maintenance overhead of the system. Secondly, a proper resolution of the problem may involve a redesign on such a scale that it can become a rewrite of the entire system in all but name.
    • They cost more to use. User waiting = user distracted = less efficient at his job. Eventually, User waiting = disgruntled user = poor system usage and support from the business.
Why performance matters to the techie

Performance is not a fire-and-forget task, and box on a checklist. It has many facets and places in a project’s life cycle.

Red Herring

Done properly, you will have confidence in the performance of your system, knowledge of the limits of its capacity, a better understanding of the workings of it, and a repeatable process for validating any issues that arise or prospective configuration changes.

Performance Goblin

Done badly, or not at all, you might hit lucky and not have any performance problems for a while. But when they do happen, you’ll be starting from a position of ignorance, trying to learn at speed and under pressure how to diagnose and resolve the problems. Silver bullets appear enticing and get fired at the problem in the vain hope that one will work. Time will be wasted chasing red herrings. You have no real handle on how much capacity your server has for an increasing user base. Version upgrades fill you with fear of the unknown. You don’t dare change your system for fear of upsetting the performance goblin lest he wreak havoc.

Building a good system is not just about one which cranks out the correct numbers. A good system is one which not only cranks out the good numbers, but performs well when it does so. Performance is a key component of any system design.

OBIEE and Performance

dotmatrix report

Gone are the days of paper reports, when a user couldn’t judge the performance of a computer system except by whether the paper reports were on their desk by 0800 on Monday morning. Now, users are more and more technologically aware. They are more aware of the concept and power of data. Most will have smartphones and be used to having their email, music and personal life at the finger-swipe of a screen. They know how fast computers can work.

One of the many strengths of OBIEE is that it enables “self-service” BI. The challenge that this gives us is that users will typically expect their dashboards and analyses to run as fast as all their other interactions with technology. A slow system risks being an unsuccessful system, as users will be impatient, frustrated, even angry with it.

Below I propose an approach, a method, which will support the testing and tuning of the performance of OBIEE during all phases of a project. Every method must have a silly TLA or catchy name, and this one is no different….

Fancy a brew? Introducing T.E.A., the OBIEE Performance Method

In working with performance one of the most important things is to retain a structured and logical approach to it. Here is mine:

  1. Test creation
    • A predefined, repeatable, workload
  2. Execute and Measure
    • Run the test and collect data
  3. Analyse
    • Analyse the test results, and if necessary apply a change to the system which is then validated through a repeat of the cycle

T.E.A.

The emphasis is on this method being applicable at any time in a system’s lifecycle, not just the “Performance Test” phase. Here are a few examples to put it in context:

  1. Formal performance test stage of a project
    1. Test : define and build a set of tests simulating users, including at high concurrency
    2. Execute and Measure: run test and collect detailed statistics about system profile
    3. Analyse : check for bottlenecks, diagnose, redesign or reconfigure system and retest
  2. Continual Monitoring of performance
    1. Test could be a standard prebuilt report with known run time (i.e. a baseline)
    2. Execute could be just when the report gets run on demand, or a scheduled version of the report for monitoring purposes. Measure just the response time, alongside standard OS metrics
    3. Analyse – collect response times to track trends, identify problems before they escalate. Provides a baseline against which to test changes
  3. Troubleshooting a performance problem
    1. Test could be existing reports with known performance times taken from OBIEE’s Usage Tracking data
    2. Execute Rerun reports and measure response times and detailed system metrics
    3. Analyse Diagnose root cause, fix and retest
Re-inventing the wheel

T.E.A. is nothing new in the overall context of Performance. It is almost certainly in existence elsewhere under another name or guise. I have deliberately split it into three separate parts to make it easier to work with in the context of OBIEE. The OBIEE stack is relatively complex and teasing apart the various parts for consideration has to be done carefully. For example, designing how we generate the test against OBIEE should be done in isolation from how we are going to monitor it. Both have numerous ways of doing so, and in several places can interlink. The most important thing is that they’re initially considered separately.

The other reason for defining my own method is that I wanted to get something in writing on which I can then hang my various OBIEE-specific performance rants without being constrained by the terminology of another method.

obXKCD

What’s to come

This series of articles is split into the following :

  1. Introduction (this page)
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ

I’m tempted to hyperlink these in the fashion of Choose Your Own Adventure and if you click straight from here onto the last section, Optimise, without having read the other parts first, it will redirect you back to them ;-)

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep the thread of comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Performance and OBIEE – part IV – Test – Build

Rittman Mead Consulting - Mon, 2013-03-18 15:00

This article is the fourth in a series about performance and OBIEE. In previously articles I described how test design should be approached, and in this article I explain in detail how to use some of the tools available to build the tests.

JMeter

JMeter is open-source software maintained by the Apache Software Foundation. It is a flexible and powerful tool that has a GUI design interface that can run on any java-supporting desktop (e.g. Windows/Linux/Mac). The test execution can be run from the GUI, or “headless” from the commandline.

To get started with JMeter, simply download it, and uncompress the archive (tgz / zip). Go to the bin folder and doubleclick on jmeter.sh or jmeter.bat. This will launch JMeter.

There are plenty of JMeter tutorials on the web and I am not going to replicate them here. Any tutorial that demonstrates how to record a web browsing session into a JMeter script should suffice for creating an initial OBIEE load test. Below I will detail some specifics that it is useful to be aware of.

  • When a user clicks on a dashboard in OBIEE, the server returns shortly after the text “Loading…”. You need to make sure that your test doesn’t incorrectly accept this as the completed report. There are a couple of ways to do this:
    • Design your test so that it waits until the text isn’t “Loading…”
    • Set NewCursorWaitSeconds in instanceconfig.xml so that the “Loading…” text isn’t displayed before the query results are returned. See this blog post for more details
  • The JMeter Plugins are very useful for additional graphing options and better user control (eg jp@gc – Stepping Thread Group)
  • Try to use variables throughput to increase code reusability and reduce maintenance. It’s worth taking time to refactor a test that has evolved into something complex.
  • Use the Constant Timer object to add think time
  • Response Assertion steps are a very good way of ensuring that your test is getting the result it should at each stage. For example, to check the OBIEE login page is loading, check for Enter your user id and password..
  • Use the Not option in a Response Assertion to check for things that definitely shouldn’t be there, such as Odbc driver returned an error or Loading…
  • For a flexible test, parameterise the dashboard pages fetched. This is done in several stages:
    1. Add a CSV Data Set Config step, configured to read a TSV (Tab-Separated) file from a path you specify. NB relative paths in JMeter are relative to the folder that the JMeter script (JMX) resides

      If you want to end the test once each dashboard page has been run once, set Stop thread on EOF to False. To control the longevity of the test elsewhere, set Recycle on EOF to True

    2. Load OBI’s Catalog Manager and navigate to the part of the catalog for which you want to generate a list of dashboards. Go to Tools -> Create Report. Set the Type to Dashboard and columns: Folder and Dashboard Page Name.


      Save the resulting TSV file local to the JMeter script you have built.

    3. The TSV file is written with a header, and this acts as a source for the variable names, so in this case Folder and Dashboard Page Name. Use these in the Dashboard HTTP request
    4. Finally, use the following trick to ensure that each dashboard is recorded as a separately labelled transaction. Insert a BeanShell PostProcessor as a child of the HTTP Request with the following code:
      prev.setSampleLabel("Get Dashboard: " + vars.get("Dashboard Page Name"));
      
  • Validate your JMeter test response times by actually using the system yourself, both at single user and when the test is ramped up. If JMeter says response times are subsecond but you’re waiting 30 seconds to run a report, it’s possible that your JMeter test is not measuring the correct timings.
  • If a system under test is failing to cope, an increased response time and/or increase error rate is usually seen. To avoid unnecessarily excessive test durations, a test can be set to abort at a certain threshold, using the jp@gc – AutoStop Listener step.
  • A simple OBIEE JMeter script may have the following steps:
    A set of JMeter steps in a typical OBIEE script
  • A sample OBIEE JMeter jmx script can be downloaded from here

The other thing to remember when you are building your JMeter tests is where and how you will run them:

  • Be very concious of where you run your JMeter script from. If you run it over a poor wifi connection from a 3 year old laptop, don’t be surprised if your test apparently tells you that your 40-core OBIEE server only supports a handful of users. Rather than observing bottlenecks on OBIEE, you’ll actually just be measuring how easy it is to saturate your local machine’s CPU and/or network connection. Particularly around network, be aware that each dashboard request will pull back the complete dashboard content, which can be 100s of KB. Multiply that by the number of simultaneous users you plan to simulate, and you’ll quickly see that your JMeter script needs to run from a machine with a nice fat network pipe
  • JMeter can be run distributed across servers, controlled from a central one. This is a good way to simulate many users if you find a single machine cannot keep up with the load.
  • JMeter can be run from the command line if you are not able (or don’t want) to use a GUI.
BI Server + Database

The BI Server uses ODBC or JDBC for communication with clients (such as Presentation Services). We can use a client to send Logical SQL requests to it and record the response time :

nqcmd

nqcmd is the ODBC client that is provided with OBIEE. You can wrap it in a series of shell scripts to create a complete test

This method is as simple as extracting the Logical SQL for a report, putting it in a file, and referencing this file in invoking nqcmd.

Extracting the Logical SQL can be done from several places including:

  • Usage Tracking
    select query_blob from s_nq_acct;
    
  • nqquery.log
    You’ll find this file in the path $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIServer/coreapplication_obis1/ (it will be different for deployments using scaleout).
    Logical SQL in nqquery.log

    You can also view and download nqquery.log through Enterprise Manager

  • Answers editor
    This is particularly useful for extracting Logical SQL in a report you’re working on or investigating.
    Go to Edit Report and then the Advanced tab:
    Logical SQL in the Advanced tab of Answers

Having extracted the Logical SQL, write it to a file, for example test_report.lsql. I use the suffix lsql (Logical SQL) to differentiate it from normal SQL (‘Physical’ SQL).

Now you can send the Logical SQL to the BI Server from an ODBC/JDBC client, of which the most commonly used with OBIEE is nqcmd.

To use nqcmd, you first need to set up the OBI environment variables. On Linux/Unix this is done by “dot-sourcing” a file which sets the environment for the current shell. The first character to type is dot, then space, then the path to the bi-init.sh script:

 cd $FMW_HOME
 . instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

On Windows, you can just run bi-init.cmd and it will open a new command window for you to use, with the environment correctly set.

Once the environment is set correctly, call nqcmd, supplying:

  • The DSN of the BI Server
    • on Linux/Unix as defined in $FMW_HOME/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/odbc.ini
    • In Windows, it’s defined in Administrative Tools -> Data Sources
  • Your OBIEE username and password
  • The path of the Logical SQL to run:
$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s test_report.lsql

This will execute the Logical SQL and return the results to your screen. With a few more tweaks we can easily get some performance data:

  • Unless you want to see the data, use the -o flag to route it to a file, or -q for no output. Remember if there is lots of data from the query there will be a time overhead in sending it all to your console (or to file with -o)
  • Use the undocumented but very useful -T flag to return a detailed timings breakdown.
  • On Linux/unix prefix the nqcmd call with time to give a report of the time taken by the whole command:
    $ time nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -T -s test_report.lsql -q
    
    -------------------------------------------------------------------------------
                  Oracle BI ODBC Client
                  Copyright (c) 1997-2011 Oracle Corporation, All rights reserved
    -------------------------------------------------------------------------------
    
    Connection open with info:
    [0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16
    
    […]
    
    Clock time: batch start: 08:08:07.000 Query from: 08:08:07.000 to: 08:08:07.000 Row count: 36
     total: 0.046 prepare: 0.003 execute: 0.034 fetch: 0.009
    Cumulative time(seconds): Batch elapsed: 0 Query total: 0.046 prepare: 0.003, execute: 0.034, fetch: 0.009, query count:  1, cumulative rows:  0
    
    ----------------------------------------------
    
    Processed: 1 queries
    
    real    0m1.783s
    user    0m0.060s
    sys     0m0.017s
Running Logical SQL through JMeter

You can build a more flexible and scalable test rig based on the same design as above using JMeter with the OBIEE JDBC driver.

Take the OBI JDBC driver file bijdbc.jar from $FMW_HOME/Oracle_BI1/bifoundation/jdbc/ and put it in the JMeter /lib folder, or specify it in JMeter’s root test step under Add directory or jar to classpath.

In JMeter add a User Group and within it:

  • JDBC Connection Configuration
    • Variable Name : eg, OBI
    • Validation Query : blank
    • Database URL: jdbc:oraclebi://<host>:9703/ (see the jdbc README.txt for full syntax of the URL including clustering)
    • JDBC Driver class: oracle.bi.jdbc.AnaJdbcDriver
    • Username + Password : Valid OBIEE logins
  • JDBC Request
    • Variable Name : OBI (or whatever you specified in the config above)
    • Query Type : Select Statement
    • SQL Query: <Your Logical SQL>

Add a listener to visualise the results, such as jp@gc – Response Times Over Time (part of JMeter Plugins)

Since this is JMeter, you can easily parameterise this test, running a predefined set of Logical SQL extracted from Usage Tracking or nqquery.log.

Download the JMeter JMX script here.

Database

If nothing in your OBI configuration has changed and you just want to test the performance of the database, then you can just use standard database performance test tools. Take the Physical SQL from either the nqquery.log or (in 11.1.1.6+) table S_NQ_DB_ACCT and run it against the database.

For example, using sqlplus, the input file would look like this, with a set termout off to supress the rows, and exit to quit sqlplus after execution:

$ cat test_report.sql 
set termout off

select sum(T233609.DEPARTURES_PERFORMED) as c1,
     T233732.Cal_Month as c2,
     T233804.ROUTE as c3
from      BI_AIRLINES.OBIEE_ROUTE T233804 /* 18 ROUTE */ ,     BI_AIRLINES.OBIEE_TIME_MTH_D T233732 /* 41 TIME MONTH */ ,     BI_AIRLINES.DOMESTIC_SEGMENT T233609 /* 01 DOMESTIC Segment */
where  ( T233609.DEST = T233804.DEST and T233609.MONTH = T233732.Cal_Month and T233609.ORIGIN = T233804.ORIGIN and T233609.QUARTER = T233732.Cal_Qtr and T233609.YEAR = T233732.Cal_Year )
group by T233732.Cal_Month, T233804.ROUTE;

exit

Run it with time prefix to see how long it takes:

$ time sqlplus BISAMPLE/BISAMPLE @test_report

SQL*Plus: Release 11.2.0.3.0 Production on Wed Feb 6 22:32:41 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

real    0m0.669s
user    0m0.028s
sys     0m0.083s
LoadTest.jar

Snuck in with the other goodies of SampleApp v207 is LoadTest.jar. You’ll find it in /home/oracle/scripts/loadtest, along with a readme file detailing how to use it. It takes as input a single Dashboard path, and runs concurrent user requests against it

[oracle@obieesampleapp loadtest]$ ./runtest 
Start time: Wed Feb  6 22:42:39 GMT 2013
Load Test Starting...

----------------------------------------------
Creating User Sessions for Concurrency Test..
Total active sessions: 54

Initiating Queries..
Total queries initiated: 54

Cleaning up User Sessions created for Concurrency Test..
        - Remaining Active Sessions: 54
Completed User Sessions Cleanup
----------------------------------------------

Load Test Completed...
End time: Wed Feb  6 22:42:45 GMT 2013

It requires a list of usernames (all having the same password) and these users to have been created in the security directory beforehand.

What next?

With a test designed and built we’re now ready for launch … read on to see how best to make sure all the important metrics are captured with your test when you execute it.

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ
Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to the comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Performance and OBIEE – part III – Test – Design

Rittman Mead Consulting - Mon, 2013-03-18 15:00

Having defined our test, we can now design it. Wordplay? Not really. Regardless of how we run our tests, we’ve got to define up front what it is we are testing. Once we’ve got that, we can then work out how to actually create a test that simulates what we’ve defined.

Pretty boring huh? Can’t we just, y’know, like, run LoadRunner a bunch of times?

Sure, go ahead. But unless everything is hunky-dory and passes with flying colours, you’ll only end up retrofitting the definition to your test in order to properly analyse the problem and create further tests to get to root-cause. Therein lies the path to madness, and more importantly, wasted time and poor quality tests.

Don’t jump the gun and wade into the ‘fun’ part, otherwise you’ll spend too much time trying to figure out what to build into your test on the hoof instead of carefully working it out up front.

The OBI stack

The key to designing a good performance test, and comprehending the results from it, is understanding the OBI stack. Doing so enables us to inject tests into the OBI stack at various points to focus our testing and reduce the time it takes to build a test. Being able to pinpoint where the work happens is also essential to focussing troubleshooting.

At a very high level, the stack looks like this:

High level view of the OBI stack
(NB this is not an exhaustive diagram, for example, it doesn’t include any alternative security providers)

The user accesses OBI through their web browser, which connects to WebLogic. Weblogic is a web and application server. It connects to OBI which in turn generates queries to run on the database

Examining the stack in a bit more detail, it looks like this:

OBI stack

When the user interacts with OBI all the web browser calls are done through WebLogic. WebLogic runs Java deployments including one called analytics which communicates with the Presentation Services OBI component. When the user requests a dashboard, the following happens:

  1. Presentation Services takes all of the constituent analyses in the dashboard and sends the Logical SQL for each one to BI Server
  2. BI Server runs each Logical SQL through the RPD and generates one or more Physical SQL queries to run on the Database
  3. Once the Database has executed all of the queries required for one of the Logical SQL statements, BI Server processes the results and does any further calculations, federation, and so on. A single resultset per Logical SQL request is passed back to Presentation Services.
  4. Presentation Services receives the data from BI Server, sends any chart requests to Javahost, and does additional processing on the data for things like pivot tables. It then sends the rendered data for each analysis back to the web browser.
  5. In the web browser each analysis in the dashboard can receive its data independently. As the data is received, the browser will render it.

Logical SQL in OBIEE

For each Logical SQL request from Presentation Services, there will be one or more Physical SQL queries, but just one resultset passed back to Presentation Services. For each Dashboard there may be many Logical SQL queries run.

OBI stack

£5 for the hammer, £95 for knowing where to hit it

Building an accurate, flexible performance test rig takes time and therefore it is wise to test only what needs testing. For example:

  • If our performance test is for a greenfield OBI deployment then we’d be pretty foolish not to conduct an end-to-end test. We need to make sure that all the components are performing together at the required workload.
  • However, what about if we’re testing a existing mature deployment where there’s a specific problem in the database? We could do an end to end test, but if all that’s changing is the database (for example, new indexes) then we should probably just focus our test on the database.
  • What about if the RPD is changing in an existing deployment and nothing else? Is there really any point performance testing the response time all the way to the user? If the data returned is the same then we can presume that the performance in Presentation Services and back to the user is going to be the same. Therefore we can build a more streamlined (and thus efficient) test against just the BI Server (because the RPD has changed) and the database (because a changed RPD could generate different Physical SQL).

The flipside of not testing what doesn’t need testing is to test what does need testing and may get overlooked. For example, are your users connecting over 3G yet you’re conducting your performance tests on a LAN?

Tools & Designs for OBI performance testing

toolshed

There are several different ways to build a performance test for OBIEE. I have described them here as well as suggested where each one is most appropriate to use.

End-to-end (almost)

Traditional load testing tools such as JMeter, Oracle Application Testing Suite (OATS), and HP LoadRunner can be used to simulate a user clicking on a report in a web browser and recording how long it takes for the report to run. They can have multiple report definitions, and run many concurrently. This is usually sufficient for a comprehensive performance test. One should be aware when using these that there is still an element that has not been tested:

  1. Outbound. A user interacting with OBI uses a modern web browser with lots of asynchronous traffic (AJAX). For example, clicking on a prompt to get a list of values, clicking on the dashboard menu to see a list of available dashboards. A standard load test will typically only send the HTTP request necessary to trigger the dashboard refresh itself.
  2. Inbound. A real user’s web browser will receive the data from a dashboard as it is returned from the OBI server and render it. Once it’s finished rendering, it is viewable to the user. If it is a large pivot table with hundreds of thousands of cells (“because that’s how the Excel report we’re duplicating does it”) then this might take many seconds or even minutes – or even not work at all. But a load test tool will parse the data that comes back from the server, and once all the data’s received, it will mark that as the completed response time – the time to render is not included.

How traditional load testing tools differ from real users

This is not to dismiss this approach of test design – generally it is a very good one. These are the subtleties that a standard load test can overlook. The AJAX dashboard menu call might not take long, but the list of values might be a hidden bottleneck if it hits an unoptimised data source (for example, SELECT DISTINCT DIM_VALUE FROM 60M_ROW_UNAGGREGATED_DIM_TABLE).

It is important that the above limitations are noted, and the test design adjusted to compensate where these weaknesses are observed (for example, large pivot tables, list of values, etc).

Presentation Services via Web Services

Snuck in with the other goodies of SampleApp v207 is LoadTest.jar. You’ll find it in /home/oracle/scripts/loadtest, along with a readme file detailing how to use it. It takes as input a single Dashboard path, and runs concurrent user requests against it. Since it is using a dashboard path we can presume it is interfacing with Presentation Services, and my guess is that it utilises the OBIEE web services to generate the load.

This utility will presumably have the same limitation as JMeter, LoadRunner et al in that page render times are not included in the response time. LoadTest.jar also only returns timings at a total level for a batch of concurrent users – how long does it take, in total, for x users to run a given dashboard. Individual response times from the BI Server for the constituent queries can be seen in Usage Tracking. LoadTest.jar could also be invoked multiple times to simulate multiple users/dashboards and the timing for each invocation be recorded to get a more granular set of response times. Finally, LoadTest.jar can only fetch a dashboard in its default state – there is no apparent way to change prompt values or interact with hierarchical columns, drill paths, etc.

An alternative to LoadTest.jar for this type of testing could be using JMeter to invoke the web services, which could give more flexibility (at the expense of time to set it up and complexity to maintain).

BI Server + Database

The BI Server supports ODBC or JDBC for communication with clients (such as Presentation Services). We can use a client to send Logical SQL requests to it and record the response time. Because Presentation Services also sends Logical SQL to BI Server this test approach is a good one because from the BI Server + Database point of view, the behaviour is exactly the same whether the Logical SQL comes from Presentation Services or another client (eg injected by us for performance testing).

nqcmd is the ODBC client that is provided with OBIEE. You can wrap this in a series of shell scripts to create a complete test, or you could build a test using JMeter with the OBIEE JDBC driver for a more flexible and scalable test rig.

What this method doesn’t test is Presentation Services and upstream, so the additional processing of resultsets, passing the data back to the web browser, and the rendering of it. But, if you are targeting your testing at just the RPD and Database then this is a very good way to avoid unnecessary complexity.

Database

If nothing in your OBI configuration has changed and you just want to test the performance of the database, then you can just use standard database performance test tools. Take the Physical SQL from the report being tested and run it against the database using your database’s standard tools.
This is useful if you are doing tuning work at the database level (partitioning, indexing, etc), but only use this approach if nothing is changing in your RPD. As soon as the RPD changes, different physical SQL could potentially result.

Remember also that the response time from a physical SQL executed on the database is not the same as the response time of a report – it still has to go back up through the stack back to the user’s web browser. For a complex report or a large number of rows this can add significant time.

The only genuine end-to-end test – a stopwatch! Monkey

I kid you not – before spending weeks and £££ building complex LoadRunner scripts, stop and think about the testing you are doing. If you want to test all the dashboards with a single user concurrency – find a willing monkey to sit there for 20 minutes and run the dashboards. Maybe you need to test multiuser concurrency but can make do with fewer dashboards tested this way than for a single user.

This is also a very easy way to determine page render times. As I discussed in the paragraph above, traditional load test tools such as LoadRunner and JMeter only measure the time taken for an OBI dashboard to return data and not how long it takes to render it. For the user obviously the only time they are going to be interested in is the complete time including rendering. So using a stopwatch (or more scientifically, inbuilt browser tools such as Chrome’s Developer Tools) you can measure the complete time as the user will experience it.

Which tool?

In an OBI performance test lifecycle I would expect to use several of the different tools described above. Typically a greenfield test will look at the performance of the stack end-to-end. If there are problems with the end-to-end, or if the testing is in response to a problem in an existing system, I would use a response time profile (of which more later) to understand where the bottleneck lay and focus the testing (and tuning) there. For example, if the bottleneck appeared to be the SQL that the RPD was generating I might use nqcmd to validate changes I made to the RPD and/or the database model

Test Design tips

Whichever method you use, bear the following in mind:

  1. Generating a big load is easy; generating a big load that provides useful data is more difficult. For example, when a test shows 100 concurrent users running 10 different dashboards doesn’t work, how do you break it down into components to see where the bottleneck is? A good test design will take this into account up front and include granular response times and instrumentation.
  2. K.I.S.S. : Sometimes a set of shell scripts run concurrently will do just as good a job of applying a high load to OBI as a “proper” load testing tool can. The time command will always give you the duration of an execution, and you can modularise sets of shell scripts to your heart’s content.
    1. If you use shell scripting, don’t forget to KISS – a complex mess of shell scripts isn’t Simple either.
  3. You may see a conflict between the above points. How do you build a test that is granular and well instrumented, yet isn’t overly complex? Well, therein lies the gold ;-) Because all performance tests are different, I can only outline here good approaches to take, but you have to walk that line between complexity and flexibility in your design.
Heuristic and Holistic performance testing

There is more than one way to evaluate the performance of OBIEE. Tools like those I have described above are great for simulating known workloads with a fixed definition. Unfortunately, all of these assume a known set of reports and predicates, and user behaviour. What about testing a system where there is ad hoc capability and/or a great number of possible predicate values and thus execution plans?

Here I would recommend a thorough review of the system’s implementation, with the view to determining possible bottlenecks before they occur. It is more difficult to prescribe this area as it comes down to a combination of personal experience of good design, careful evaluation of right practices, and the ability to isolate and test particular configurations.

The difficulty with this type of testing is that is produces less tangible data with which to say yea or nay on a system’s ability to perform. It means that the audience for the performance test report must be receptive to “It Depends”. Just because a table doesn’t have a particular index doesn’t mean in itself that the performance test is a failure and the index must be defined before go live. What it does mean is that the missing index may cause a worse execution plan which may require additional database resources which may limit how many queries can run concurrently before a bottleneck is reached.

For specific suggestions of what to examine in this type of evaluation, see here. I am well aware of the apparent contradiction in advocating a heuristic approach here yet criticising “best practices”. My problem with best practices is when they are treated as absolutes and without context; they normally originate from good intentions and are subsequently abused. In the context of evaluating a system’s performance that is impossible to comprehensively test (eg ad hoc report workload) then looking at how past implementations of the technology is the only sensible option. The success of these past implementations may or may not be encoded as ‘best practices’; the point is that they are evaluated in a conscious manner and not followed blindly.

Resource management

To reduce the risk to a Production system of running ad hoc or untested reports, resource management at the database level could be used. On Oracle, Database Resource Manager (DBRM) enables queries to be allocated a greater or lower priority for CPU on the database, or limit the execution time. Fixed reports which have been through a strict performance QA process could be given greater resource since the effect has been quantified and analysed in advance, whilst ad hoc or unproven reports could be more restricted.

To implement this, pass through the report or user name through in the connection pool, and then define a Resource Manager rule to assign the resource management policy accordingly.

Resources can also be controlled at the BI Server level by setting restrictions on query runtimes and rows returned on a per-application role basis.

What next?

Now that we’ve defined the test we want to perform, and have designed how it will be implemented, we move onto actually building it, here

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ
Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the discussion in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Performance and OBIEE – part II – Test – Define

Rittman Mead Consulting - Mon, 2013-03-18 15:00

Take the first tentative steps along the yellow brick road of investigating performance and OBIEE with this jump into the nitty gritty of how we should define a performance test.

An overview of defining a performance test

The first step in assessing the performance of OBIEE is defining the test we are going to use. For troubleshooting a performance problem in Production, this might be a single report. For performance testing a new system it might be more complex. The process to define it needs to be the same though.

Let us consider for a moment what it is we want to achieve through a performance test. It should be a validation – or possibly determination – of some or all of the following, which may or may not be formally documented in a project as ‘Non-Functional Requirements’ (NFRs):

  • Report response times
  • User concurrency levels
  • Hardware utilisation

It doesn’t sound too difficult does it? We can run a report and time how long it takes, and we can run the same report multiple times in parallel and see how long it takes then, and around all of that we can gather system OS stats.

This seems to be the level at which some people perceive the complexity of performance testing; that is, not at all.

But, to put any weight in the above results would presume that the system under test had :

  • Only one report
  • No parameters or other interactivity on the report
  • Every user runs the report at the exact same time

How many OBIEE implementations do you know that match this description?

As soon as you accept this premise, then you have to accept the need to design your performance test properly if you want any kind of confidence in the results that you get.

Of course, there is the chance – [sarcasm]vastly improbable and unheard of I know[/sarcasm] – that Performance Testing is a tick box on a project plan (normally about a week before Go Live), from which there is the expectation that it can deliver the confidence of an accurate test with the simplicity and brevity of the one-report test I describe above. If that is the case then I hope this article will explain why it is a fallacy.

Walk before you run

There are three key things to aim for in the design of a performance test:

  1. Repeatability. You must be able to rerun the test, so that you can (a) prove bugs/performance problems have been fixed, and (b) re-use the test as a baseline for future measurements
  2. Accuracy. This can be difficult with a stack as complex as the OBIEE one, but you must do all you can to avoid conflation and misinterpretation of data.
  3. Ease of Analysis. If the result of your test is unsuccessful, you will have to start analysing the issue. If your test is unnecessarily big or complex then it will be very difficult and time consuming to pare it back to find root cause.

Blunderbuss Avoid the blunderbus approach. It’s tempting (and fun!) to get carried away building a huge test and run it with a big number of users – but if it brings the system down then the only immediately conclusion you can draw is that you can crash the system. To get to any kind of root cause of the limitation you will have to run more tests, and you will have to simplify the test that you did run to pinpoint the actual issue(s).

Build up small tests with single users and check things work at this level first. You can then combine these together into larger tests for load testing. By working this way you get the added benefit of a better understanding of the workings of your load test and system being tested, as you will have gained this experience from the smaller tests.

Think of it as assembling a car’s engine for yourself versus buying a car out of the showroom. If it breaks (or you want to tune it), you’ll be in a better place to start if you have been under the bonnet [hood] already.

Define a User

The first step in defining a good test is to define extremely precisely and clearly what it is we are testing. This section’s heading I take from Zed Shaw’s superb article “Programmers need to learn statistics…”. For me it encapsulates in three words the challenge that we face in designing a good test. We know we want our system to support more than one user, and we probably have a user concurrency figure (of which more below) that the system must support. So, how do we define ‘a user’? To quote from the aforementioned article, do we just “click around a bit, you know, like a user”? Maybe we run the OBIEE-equivilent of SELECT 1 FROM DUAL a thousand times simultaneously, and then declare our system can support a thousand concurrent users?

The challenge is that good test or not, test results and system capacity will often be boiled down to a discussion of simple “user” numbers. Will the system support 100 users? 200 users? And so on.

Modelling the workload of a transactional (OLTP) system is relatively straightforward in that there is a finite number of transactions, even if it is a large number. When we consider how to model an OBIEE user in a performance test, we have additional challenges:

  • Navigation paths
    • In an OLTP system such as an online store once a user clicks on ‘add to cart’ we can be pretty sure they will then either (a) carry on shopping or (b) go to checkout. What about a BI user once they’ve run their first report, which will they run then? Maybe a different report, but quite likely it will be navigation within the existing report (eg drill down, prompt changes, etc)
  • Ad hoc
    • Whilst some reporting will be fixed reports, much business value from BI comes in the analysis and exploration of data in an ad hoc manner to discover trends etc to give competitive advantage. These ad hoc queries cannot, by definition, be known in advance, so the performance test definition must find a way to address this.

Think about how many types of users you have on your system, and how you would characterise and group them. There will be different sets of data they access, and varying degrees of interactivity with which they use the tool. Some will be using pre-built reports with fixed filters, other may be doing extensive ad-hoc analyses, and somewhere in the middle will be pre-built reports but with a fair amount of interactivity through functionality such as prompts and selection steps.

Define a Report

So for each of our User types that we define, we need to precisely define the workload for them. This will be made up of the execution of one or more dashboards/reports. Defining the report that our model users are going to run is not as simple as it may appear. As well as the distinction to make between individual reports (Analyses/Answers) and Dashboards, we also need to consider:

  • Filters
    • How do we model a report which has six possible filters? We can’t just call that ‘report A’, because each filter permutation could cause different SQL to be generated by the BI reporting tool and consequently vastly different execution plan(s) on the database where the query runs.
  • Aggregate navigation
    • One of OBIEE’s many strengths is its aggregate navigation capability, enabling the user to drill down into data whilst in the background the tool switches between pre-defined aggregate tables on the database to pick the fastest one to use. For performance testing we need to make sure that within a report each possible aggregate is considered separately.

As an example, let us examine a dashboard from SampleApp.

SampleApp Dashboard 3.10

At an initial glance, we could specify our “Report” as simply being “Dashboard 3.10”. But…the dashboard is made up of three analyses (plus some static text/image objects)

  • “Product Line Analysis”
  • “Ragged Hierarchy”
  • “Value Based (Parent Child) Hierarchy”

SampleApp Dashboard 3.10 breakdown

We need to consider each of these reports individually. In addition, there are hierarchical columns which, like Prompts, must be included in our test specification. Here’s one of the reports from the dashboard in two different states of hierarchical navigation.

Collapsed - 3 DB queries
Expanded - 21 DB queries

The first version, mostly collapsed, requires 3 queries against the database. The second version requires 21 queries! Clearly the performance of the two different versions of the report is going to differ.

Depending on the hierarchical column states, depending on dashboard prompt values, depending on aggregate drill paths followed, the same individual report has the potential to behave quite differently.

Any report or dashboard that’s using prompts needs to have those prompt values stated in the test definition. The execution of a query on the database can vary considerably depending on the predicates supplied, and it is in part from the report/dashboard prompts that these predicates come.

Ad-hoc reports

Conspicuous by its absence from the above is how we deal with ad-hoc reporting with OBIEE in our performance tests. This is the elephant in the room when it comes to performance testing OBIEE, because we cannot go through the same methodical process of cataloging the workload – because it is ad hoc. This is discussed in detail later.

User workload

Having precisely defined the reports which are going to be run, we can sequence them together to represent different types of users. Clearly this is going to be a generalisation, an estimation, of how a particular group of users is going to tend to use the reports. Unlike an OLTP transaction, where the steps are much more predefined and free from possible deviations, users of OBIEE have the flexibility to run any number of reports, drill within existing reports, and so on.

Taking the above SampleApp dashboard as an example, one user (let’s call them the “Product Analyst”) might have the following test sequence:

  1. View Dashboard 3.10 in default state
  2. Drill into hierarchical columns of Product Line Analysis
  3. View 1.3 Sample Sales - Product Details

Some user types may just run one report and exit; they don’t have to have a complex sequence.

You don’t need to restrict yourself to one user type per actual group of users. You could model several different instances of each user group in order or increase the test coverage and realism of different reports running concurrently. Be aware that the more user types and tests you have, the longer it will take to build, and the more complex the troubleshooting could be.

Think Time

Users are not computers, which is a shame because it would make reasoning with them a lot easier. It also means that they do not just one run report after another. They run a report, look at the numbers, get a cup of tea, change a prompt value, have a chat with a colleague, and so on.

As part of your user workload, plan in some think times. Don’t spend too long thinking about it (geddit), because remember this user workload is an approximation of what a certain type of user is going to be doing.

User concurrency


Once we have designed a set of individual user workloads, we can bring these together into a larger test to assess the performance of the system as a whole under load. The aim of load testing is to assess what happens to the response time of reports, and hardware utilisation, as the number of concurrent users on the system increases. There are three key things to define in a load test:

  • the workload of each user
  • the different types of user
  • the number of concurrent users

The first two of these are discussed above. When it comes to user concurrency, we need to be very careful how we derive our figure, because it has a big impact on the scale of the tests that we run and the discussion of the outcomes.

First, make sure that you are clear in the definition of concurrency that you use. These include:

  • Concurrent user sessions logged into OBIEE, viewing dashboards, formatting reports, and so on.
  • Concurrently executing queries at the BI Server level, which could be a single user running one report or one dashboard.
  • Concurrently executing queries on the database, which could be a single user in OBIEE running a single report or dashboard requiring more than one SQL request to get its data.

Each of these have their own particular load on the system, but can easily differ by orders of magnitude. Typically a user concurrency figure will be derived from “bums on seats”, i.e. number of users. From this, a reckoning is made on how many will be using the system at one time. So unless it’s qualified otherwise, I would work with the first definition – concurrent user sessions logged into OBIEE. But if this is your definition, remember it when we come to simulating users later on …

For some reason, user concurrency is often a contentious number to define. With the greatest of apologies to Monty Python, I offer this paraphasing of a familiar conversation:

“yah, my system needs to support 100 concurrent users”

“100? I’d take that as a QUIET DAY for my system, bwahaha, I need to test at least 1,000 concurrent users!”

“1,000? You were LUCKY! My system would run a gazillion concurrent users at three o’clock in the morning, work twenty hours down t’pit and still run 5,000 concurrent users before having a handful of cold gravel for its tea”

There does seem to be BSD syndrome around user concurrency figures, with people always pitching high numbers. This is counterproductive. If you run a performance test using a completely simulated workload with an excessively high number of users as the target then the margin of ‘error’ (how accurately your test represents reality) inherent in the simulation approach will be magnified as the user concurrency increases. The net effect of this is that you cannot be confident that any problems that do happen will actually happen in real life. Conversly, if you are taking real workload and running it for many users, you can honestly say if it doesn’t work in testing, it won’t work in real life.

If you are worried about pitching your user estimate too low and so double it or stick a zero on the end just in case, consider the following: If you define your test as being a fail if it does not support your high figure of concurrent users – what then? You have to ask for money, either literally to buy bigger hardware, or in kind, to support the time to optimise the system you do have. It’s not a great place to be in, and one which may be unnecessary. A sensible approach is the following:

  1. Load test to confirm whether the system can support the realistic number of users in the short-term, e.g. go-live and a few months beyond
  2. Defer part of the load testing until after go-live, when you can run a test for medium/long-term user numbers using real user workload for your tests. The benefit of this is that your second test will be far more accurate because your simulation will not be based on how you think you system will be used but how you know is it being used.

If this approach doesn’t suit, then your alternative should be to

  1. take a realistic figure to start with for your first test
  2. run a second performance test to validate a future growth figure for your user base. If you don’t have a future growth figure, then you could consider the second test being to simply determine how many users your system can support by ramping numbers up until things start to creak.

The point of doing it this way is that the first test – assuming it gives a positive result – gives clarity and focus to your overall results. If the second test fails, then at least you have the confidence from the first with which to proceed with a decision. If you just run one big test then it’s an “all or nothing” approach, and if the test fails you have the option of either delaying launch until a successful test, or just crossing your fingers and hoping.

User concurrency numbers for an existing system

If you have an existing system then it is easy to determine your actual user concurrency through either Usage Tracking (you do have it enabled, don’t you?) or the OBI DMS metrics.

  • If your user concurrency figure is based on users logged into OBIEE, not necessarily running reports, then the metric Oracle BI PS Sessions : Sessions Logged In should be appropriate.
  • For the number of queries (logical SQL) concurrently running on the BI Server, use Usage Tracking (see below), or DMS metric t.b.c!
  • The number of concurrent queries on the database can either be obtained from your database’s monitoring tool, or through the Oracle BI DB Connection Pool DMS metrics.

To find out more about viewing and capturing OBI metrics over time, see here.

To estimate user concurrency from Usage Tracking, you can use the following query. It will analyse Usage Tracking to give an approximate number of concurrent users, but this is based on individual queries run on the BI Server, so it not the same as users simply logged into OBIEE, nor the same as dashboard refreshes (since one dashboard could be many queries):

-- Usage Tracking Concurrency usage estimate
-- Per minute granularity

SELECT TO_CHAR(start_ts, 'YYYY-MM-DD HH24:MI') AS start_ts_min, 
       COUNT(DISTINCT user_name)               AS distinct_users 
FROM   s_nq_acct 
GROUP  BY TO_CHAR(start_ts, 'YYYY-MM-DD HH24:MI') 
ORDER  BY 1 DESC    
Estimating user concurrency numbers

If you don’t have empirical numbers to use then 10% of your total user base is the rule of thumb often worked to.

User concurrency is multi dimensional

Just as there is no such thing as “A User”, to ensure an accurate performance test user concurrency figures should also be specified in terms of :

  • Types of concurrent user, for example:
    • “10 Concurrent Finance Power Users”
    • “2 Concurrent Finance Ad-Hoc Users”
    • “60 Concurrent HR Report Consumers”
  • Periods of system usage, for example:
    • Monday morning peak
    • EOD peak
    • ‘Average’

You may only actually test the peak user concurrency model, but it’s a good idea to have a picture of the system profile, particularly when analysing the impact and capacity for ad-hoc usage.

Summary : User and Report Definition

Remember:

Before you can measure something you really need to lay down a very concrete definition of what you’re measuring.Zed Shaw

Whatever kind of testing you are doing, aim to make it modular wherever possible. It will make both analysis and reuse of test code easier.
Almost all test definitions should be made up of one or more of the following levels:

  1. Specific Report or dashboard along with all prompt values, hierarchical column settings, etc. This is the base unit of your performance test.
  2. Then take these individual tests, and use them together to create an individual User Workload model for each user.
  3. Finally, combine the User models to produce an overall Concurrent User workload simulation.

See below for a worked example of a sample test definition.

Points to remember in your test definition:

  1. Whilst each dashboard does have its own name, a good idea is to use your own ID system to make it easy to refer to a specific dashboard state. For example, when using prompts there might be two or more versions of the dashboard to test (for different prompt values).
    It’s also easier to refer to something like “ID01” in conversation and notes than it is it “Finance Dashboard – Profitability EMEA blah blah blah”.
  2. Don’t forget default values in your navigation. If you are testing specific dashboard prompt values, you’ll need to also test the default values first – because that is what will get executed when the user first clicks onto the dashboard.
  3. As well as testing dashboard response times, factor in the usability of any prompts that the user will interact with. A fast dashboard is not quite so impressive if the user had to wait a long time for the dashboard prompts to display the available values.
  4. You need to be writing down nuts and bolts detail here. Resist the temptation to skimp on the detail or assume you’ll remember – you won’t, and certainly your colleague who can reuse your testing won’t appreciate having to fill in the blanks to make use of it.
What do you mean, not all of it has been tested?

ScreamBefore you run off to write down all your dashboards and types of user, consider the following. Unless you have :

  1. A very small system
  2. A lot of time, and/or an infinite number of monkeys

you are not going to be able to actually test all of your OBIEE system. Fact.

There are too many factors (prompt values, aggregate paths, ad hoc usage) to be able to model all plausible (let alone possible) permutations of these.

Designing the scope of a performance test needs to take into account the acceptable risk of things going wrong vs time/budget available vs functionality required. By functionality, I mean factors such as the breadth of dashboards delivered, or data made available.

  • In some systems the stability is paramount and therefore the relative risk must be minimised. If this is the case and your time/budget is constrained then therefore the functionality will have to be constrained.
  • If you need to deliver a lot of functionality and assuming that time/budget is fixed, then you will need to accept an increased risk to the stability of your system.
  • If you have limitless time/budget then you can deliver lots of functionality and do large and wide-ranging performance tests to ensure minimal risk.

The point here is an important one because it is one that needs discussing with the project owner, or more specifically, whoever will be accountable if things don’t go right in Production. For a standard OLTP system it is much easier (relatively) to design a performance test with a high degree of confidence that most [common] workloads have been simulated and tested. However, for a reporting system such as OBIEE there are two reasons why it isn’t so easily determined, as discussed above and re-iterated here:

  1. Permutations of fixed reporting – each dashboard can have multiple prompt dimensions, hierarchical column expansions. Users can follow many different drill and navigation paths from any one dashboard.
  2. Ad hoc reporting – by definition, the report is not known until the user builds it, so we cannot test it beforehand

But, all is not lost. Even if we accept that to test every permutation of report that might run is impossible, it is still perfectly realistic to perform a thorough performance assessment of a system prior to go-live. The most sensible option is a hybrid approach of:

  • Traditional performance testing of fixed reports as described above
  • Testing predefined ad-hoc reports (if that isn’t a paradox)
  • Holistic and heuristic evaluation of the system design and data model
  • Resource management to curtail the impact of bad queries
  • Active monitoring to identify bad queries and deal with them (using the T.E.A. method)
  • Feedback loop : as system usage becomes clearer, refine original performance testing using a more accurately informed model (using the T.E.A. method)

This may seem a bit of an excuse for a cop-out. If we can’t performance test all our system and be confident in the results, what’s the point? Here, I defer to Cary Millsap, a great proponent of the rigorous, scientific, and rational approach to performance testing:

How can you possibly test a new application enough to be confident that you’re not going to wreck your production implementation with performance problems?

You can model. And you can test. However, nothing you do will be perfect. It is extremely difficult to create models and tests in which you’ll foresee all your production problems in advance of actually encountering those problems in production.

Some people allow the apparent futility of this observation to justify not testing at all. Don’t get trapped in that mentality. The following points are certain:

  • You’ll catch a lot more problems if you try to catch them prior to production than if you don’t even try.
  • You’ll never catch all your problems in pre-production testing. That’s why you need a reliable and efficient method for solving the problems that leak through your pre-production testing processes.

Somewhere in the middle between “no testing” and “complete production emulation” is the right amount of testing. The right amount of testing for aircraft manufacturers is probably more than the right amount of testing for companies that sell baseball caps. But don’t skip performance testing altogether. At the very least, your performance test plan will make you a more competent diagnostician (and clearer thinker) when it comes time to fix the performance problems that will inevitably occur during production operation.

An example OBIEE test definition

Here’s an example test definition, put together from SampleAppv207.

What we are going to model is two groups of users. One likes pretty pictures and just runs two chart-based dashboards and then logs out. The other is interested in numbers and does some filtering on an analysis dashboard.

First, we’ll define the dashboards in use. For brevity, I’ve omitted the constituent analysis details, but in normal practice it’s a good idea to include them (for future reference, ease of understanding, clarity, etc).

Dashboards Dashboard : ID01
  • Dashboard : 3.20 Vanilla Visuals
  • Page : Standard Visuals
  • Path : /shared/3. Analysis and Dashboards/_portal/3.20 Vanilla Visuals
  • Dashboard Prompts: None
  • Analyses :
    • /shared/3. Analysis and Dashboards/Report Views/Gallery/Examples of Standard Visualizations
Dashboard : ID02
  • Dashboard : 3.20 Vanilla Visuals
    • Page : Trellis-Product Sales
    • Path : /shared/3. Analysis and Dashboards/_portal/3.20 Vanilla Visuals Trellis-Product Sales
    • Dashboard Prompts: None
    • Drill state : Default (NB each chart can be clicked on to drill into detail)
    • Analyses :
      • /shared/3. Analysis and Dashboards/Report Views/Trellis/Revenue Breakup
      • /shared/3. Analysis and Dashboards/Report Views/Trellis/Product Cost and Unit Price Trend
      • /shared/3. Analysis and Dashboards/Report Views/Trellis/Brand Cost and Unit Price Trend
        • Offices Region Hier : All collapsed, except AMERICAS expanded one level
Dashboard : ID03
  • Dashboard : 1.3 Sample Sales
  • Page : Brand Analysis
  • Path : /shared/1. Quick Demos/_portal/1.3 Sample Sales Brand Analysis
  • Dashboard Prompts:
    • T05 Per Name Year : Default (2008, 2009, 2010, 2011)
    • D4 Company : Default (none)
    • T31 Cal Week : Default (between 0 and 53)
    • “Products”.“Products Hierarchy” : Default (none)
  • Analyses :
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Product Line Analysis
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Trending Analysis
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Gauges
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Top Customers Bookings
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Daily Revenue Timeline
Dashboard : ID04
  • Dashboard : 1.3 Sample Sales
  • Page : Brand Analysis
  • Path : /shared/1. Quick Demos/_portal/1.3 Sample Sales Brand Analysis
  • Dashboard Prompts:
    • T05 Per Name Year : 2010, 2011
    • D4 Company : Stockplus Inc.
    • T31 Cal Week : between 1 and 27
    • “Products”.“Products Hierarchy” : Default (none)
  • Analyses :
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Product Line Analysis
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Trending Analysis
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Gauges
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Top Customers Bookings
    • /shared/1. Quick Demos/Simple Demo Dashboard/Brand Analysis/Daily Revenue Timeline

Tip: Use Catalog Manager to generate a report of dashboards and their constituent analyses to make the creation of the above specification easier. :
Catalog manager
Catalog manager
Catalog manager

User Workload

Having defined the dashboards (ID01 – ID04), we define the user types and workloads:

User A
  • Description: Department X, executive
  • Workload model:
    1. ID01
    2. Think time: 60 seconds
    3. ID02
    4. Think time: 60 seconds
    5. Logout
User B
  • Description: Department Y, analyst
  • Workload model:
    1. ID03
    2. Think time: 120 seconds
    3. ID04
    4. Think time: 90 seconds
    5. Logout
Concurrency

Our concurrent user model looks like this. I am defining concurrent users as active users, that is, executing dashboards or simulated thinking.

Peak usage
  • User A : 5 concurrent users
  • User B : 2 concurrent users
Light usage
  • User A : 3 concurrent users
  • User B : 1 concurrent users
Types of testing

Before we can consider the types of test, we need to define what it is we are testing, hence the long section above. Once you’ve identified the types of user and specific reports they will run, you should also look at the type of test you want to run. The terminology is frequently used interchangably, so here’s my version of it:

  • Performance testing
    • With a single user, is the response time acceptable?
  • Load testing
    • With concurrent users, is the response time acceptable and stable?
  • Capacity testing
    • How does the system behave when concurrent users are simulated? do any areas show signs of reaching capacity of resource?
  • Soak testing
    • How does the system behave if the concurrent user workload is left to run for many hours? For example, are there any memory leaks lurking, or connections which are not closed down properly and will eventually run out?
  • Stress testing
    • As we increase the number of concurrent users beyond that required (and already tested successfully), how far will the system go until something breaks or response times suffer?
    • In a high-availability deployment, what happens to response times if we simulate partial stack failures? Does the system handle this gracefully under load?

These types of test do not have to be mututally exclusive in execution, but in defining your testing make sure you have covered them all, or explained why they are not necessary. An example would be performance testing a single report change on an existing system would probably not warrant a full scale soak test.

In summary

The output of this phase of the method should typically be a set of test definitions, not one single one. As already discussed, keeping tests modular and simple is a much better approach than trying to kill all the birds with one boulder.

Footnote: The soft squidgy side of performance testing Tangent: Enterprise change control processes

Taking the premise discussed above – that it is impossible to test every possible report execution – raises an interesting question around change control and the stability of a Production environment. If your company is one where releases must be triple tested and signed off in the blood of your firstborn the you might want to have this conversation sooner rather than later. Why? Well if your system has any element of ad-hoc usage then there is a chance that a user will run a query that disrupts other users on the system. There is always the potential for the perfect storm of it being a critical business period, high user traffic, and a rogue ad-hoc query. Without proper understanding of the nature of ad-hoc queries, an almighty witch hunt can ensue, looking for who broke Production.
If this kind of risk is unacceptable then you will have to look into mitigating it with things like resource management and environment partitioning.

Estimating performance test for project planning

The challenge that the complexity and multiple facets of performance testing gives is finding a way to communicate it accurately to project managers and business sponsors.

Unlike an estimate for code delivery, which will be <x> days and after which the code will either function or not, performance testing can pretty much take as long as you like, and the kicker is that the tests will only give varying degrees of confidence that things should be OK in production, but with no guarantees.

Doing the initial analysis on the system is a fixed time, as is building the test rig(s) with which to test it. Once you have done this, then defining the number of dashboards to test can be done based on the time available. The more dashboards are tested, the greater certainty can be had that any problems will be uncovered. Other fixed tasks should be a review of the overall design, as well as time to properly document the tests undertaken and an analysis of the results obtained. It is a false economy to skimp on the documentation, as to do so renders almost useless the results obtained for future work. Good documentation of testing undertaken and results found can greatly shorten the time necessary in the future when revisiting the work either for a new phase of development or to troubleshoot performance problems that have arisen.

What next?

Next up is designing the test that we’ve just defined … read on!

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ
Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep the discussion in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Categories: BI & Warehousing

Does the BI Server System Component Make Full Use of My Multi-Socket, Multi-Core Server?

Rittman Mead Consulting - Thu, 2013-03-14 16:06

An interesting question came up during a client visit the other week, where their system had some performance issues and I was asked to take a look at it for them. One of their concerns was whether OBIEE 11g, in particular the BI Server system component within it, was making full use of the CPUs and cores on their server – in particular, they’d noticed just one or two CPUs at near 100% utilisation when certain queries were running, even though there were many, many more CPUs available that the BI Server could use. What was going on, and was there an in-built limitation with the BI Server where it serialised everything through a single CPU when many more were sitting idle?

Thinking the answer through, it’s quite an interesting topic and there’s actually several aspects to how the BI Server works in a multi-socket, multi-core environment. To take an example; consider a situation where OBIEE 11g is running on a server that’s got four sockets, each with four cores in it, for a total of sixteen cores. Most modern OSs will consider this as sixteen CPUs, which the processor count that the Oracle database, OBIEE and other applications will consider as available.

These CPUs also generally support something called multi-threading; threads are sequences of operations owned by a process (such as nqsserver, the BI Server OS executable) that an operating system executes via a scheduler and some sort of time-sharing/multi-plexing arrangement, depending on the OS and hardware. All applications and services that run on a server that uses these features will benefit from multi-threading in that they can appear to run at the same time as other applications, but those that benefit the most are ones that themselves internally support multi-threading,  and can therefore initiate several tasks at once – and one such application is the BI Server system component.

What this means in practice is that when the BI Server component starts up, it creates and reserves a number of threads in advance, determined by a number of parameters including SERVER_THREAD_RANGE. As queries come in from the BI Presentation Server, the BI Server component uses these threads to issue queries to the underlying source database, manage client connections, communicate with database servers and so forth. You can see these threads running and ready to perform tasks for the BI Server component by using a tool such as Process Explorer for Windows; in the screenshot below, you can see the single BI Server process (nqsserver.exe) running as a child process of opmn.exe, the Oracle Process Manager and Notification Server that manages OBIEE’s legacy non-Java components.

Sshot 1

Double-clicking on that process shows you the threads associated with it; in this case, even for a largely unused system there are over a hundred threads open, ready to handle queries and perform other management and communication tasks.

Sshot 2

So going back to the original questions; for a given end-user query, will the BI Server execute everything on a single CPU, or will it make use of all of the CPUs on the server? Well, given that the BI Server is multi-threaded, what this question is really asking is whether all of the BI Server’s threads run on a single CPU, or whether all of the threads associated with a given query run on a single CPU, effectively ignoring much of the CPU capacity on the server.

Thinking it through a bit, any given single query is, to a certain extent, only really going to use a small part of the total amount of CPUs available on a server, because it’s not the BI Server that runs queries in parallel, it’s the underlying database. For example, a single analysis against a single Oracle Database datasource would only really need a single BI Server thread to handle the query request, but when the underlying database receives the query, it might use a large number of its CPUs to process the query, returning results back to the BI Server to then pass back to the Presentation Server for display to the user. The BI Server wouldn’t have any use for any more query threads, as it can’t really do anything with them – the exception to this being queries that generate multiple physical SQLs, for example to join data from multiple sources together and return a single set of data to the user, for which the BI Server could benefit from a higher CPU count if each of these queries in turn led to lots of threads being used – but two queries, in themselves, don’t neccessarily require two CPUs, because of course the BI Server, and the underlying CPUs, are themselves multi-threaded.

So let’s go back to Windows and see what happens when individual queries come in. The Threads view in Process Explorer actually displays, for each thread, the CPU (processor) that it is most likely to next run on, as you can see from the screenshot below:

Sshot 3

Looking at the threads at the top of the list (those that have run most recently), see how the Ideal Processor value is now 6 – this this an eight-processor Windows Server 2008 64-bit server it’s running on.

Sshot 4

So why does it use the term Ideal Processor, rather than just Processor? In fact, which CPU a thread runs on (or, technically, is scheduled to run on) is actually determined by the OS and is effectively “abstracted” away from individual processes, as long as are designed for multi-threading and don’t have any serialised single-threaded limitations. This is how the BI Server is written, with no explicit code to spread threads over more than one CPU, but with the OS doing this for it instead based on a number of factors. It’s also where there’s that annoyingly vague set of guidelines around whether it’s best to vertically-scale the BI Server component on a multi-processor 64-bit system – some OSs and hardware platforms might, in the background, limit processes to a subset of the available CPUs, some might not, so therefore the vague guidelines that it “might” be worth adding more BI Server components into an Oracle Instance, just as it “might” make better use of overall memory – you’ll have to try it on your own system and see.

To conclude then – all things begin equal, the BI Server should make use of all of the CPUs that the underlying operating system presents to it, with the OS itself deciding what threads are scheduled against which CPUs. In-theory, all CPUs on the server are available to each BI Server component, but each OS is different and it might be worth experimenting if you’re sure that certain CPUs aren’t being used – but this is most probably unlikely and the main reason you’d really consider vertical scale-out of BI Server components is for fault-tolerance, or if you’re using a 32-bit OS and each process can only see a subset of the total overall memory. And, bear in mind that however many CPUs the BI Server has available to it, for queries that send just a single SQL statement down to the underlying database server, adding more CPUs or faster CPUs isn’t going to help as only a single (or so) thread will be needed to send the query from the BI Server to the database, and it’s the database that’s doing all of the work – all that this would help with is compilation and post-aggregation work, and enabling the server to handle a higher number of concurrent users. Invest in a better underlying database instead, sort out your data model, and make sure your data source back-end is as optimised as possible.

Categories: BI & Warehousing

#kscope13

Chet Justice - Tue, 2013-03-12 21:35
Back in September, I was asked, and agreed, to become to Content Chair for "The Traditional" track at Kscope 13. Like I mentioned there, I had been involved for the past couple of years and it seemed like a natural fit. Plus, I get to play with some really fun people. If you are ready to take advantage of Early Bird Registration, go here. (save $300)

Over the past few weeks we've finalized (mostly) the Sunday Symposium schedule. We're currently working on finalizing Hands-on-Labs (HOL).

Beginning last year, we've had the Oracle product teams running the Sunday Symposia. This gives them an opportunity to showcase their wares and (hopefully) provide a bit of a road map for the future of said wares. This year, we have three symposia: APEX, ADF and Fusion Development and The Database and Developer's Toolbox.

ADF and Fusion Development

- Oracle Development Tools – Where are We and What’s Next - Bill Patakay, Oracle
- How to Get Started with Oracle ADF – What Resources are Out There? - Shay Shmeltzer and Lynn Munsinger, Oracle
- The Cloud and What it Means to Oracle ADF and Java Developers - Dana Singleterry, Oracle
- Going Mobile – What to Consider Before Starting a Mobile Project - Joe Huang, Oracle
- Understanding Fusion Middleware and ADF Integration - Frederic Desbiens, Lynn Munsinger, and Shay Shmeltzer, Oracle
- Open Q&A with the ADF Product Management

I love that they are opening up the floor to questions from their users. I wish more product teams would do that.

Application Express

- Oracle Database Tools - Mike Hichwa, Oracle
- Technology for the Database Cloud - Rick Greenwald, Oracle
- Developing Great User Interfaces with Application Express - Shakeeb Rahman, Oracle
- How Do We Build the APEX Builder? - Vlad Uvarov, Oracle
- How to Fully Utilize RESTful Web Services with Application Express - John Snyders, Oracle
- Update from APEX Development - Joel Kallman, Oracle

(If you see Joel Kallman out and about, make sure you you mispronounce APEX). This is a fantastic group of people (minus Joel of course). Not mentioned above is the affable David Peake who helps put all this together. The community surrounding APEX is second-to-none.

Finally, The Database and Developer's Toolkit. I'm partial to this one because I've been involved in the database track for the past couple of years. Like last year, this one is being put together by Kris Rice of Oracle. There are no session or abstract details for this one as it will be based mainly on the upcoming 12c release of the database. However, we do have the list of speakers lined up. If you could only come for one day of this conference, Sunday would be the day and this symposium would be the one you would attend.

This symposium will start off with Mike Hichwa (above) and then transition to the aforementioned (too many big words tonight) Mr. Rice. He'll be accompanied by Jeff Smith of SQL Developer fame, Maria Colgan from the Optimzer team and Tom Kyte.

How'd we do? I think pretty darn good.

Don't forget to sign up. Early Bird Registration ends on March 25, 2013. Save $300.
Categories: BI & Warehousing

Starbucks 1TB cube in production

Keith Laker - Tue, 2013-03-12 13:41
Check out the customer snapshot Oracle has published which describes the success Starbucks Coffee has achieved by moving their data warehouse to the Exadata platform, leveraging the Oracle Database OLAP Option and Oracle BIEE at the front end.    10,000 users in HQ and across thousands of store locations now have timely accurate and calculation rich information at their fingertips.


Starbucks Coffee Company Delivers Daily, Actionable Information to Store Managers, Improves Business Insight with High Performance Data Warehouse
( http://www.oracle.com/us/corporate/customers/customersearch/starbucks-coffee-co-1-exadata-ss-1907993.html )

By delivering extreme performance combined with the architectural simplicity and sophisticated multidimensional calculation power of the in-database analytics of the Database, Starbucks use of OLAP has enabled some outstanding results. Together with the power of other Oracle Database and Exadata benefits such as Partitioning, Hybrid Columnar Compression, Storage Indexes and Flash Memory, Starbucks is able to handle the constant growth in data volumes and end-user demands with ease.

A great example of the power of the "Disk To Dashboard" capability of Oracle Business Analytics.
Categories: BI & Warehousing

Report Flexibility, or Split-Second Performance : Pick One or the Other

Rittman Mead Consulting - Mon, 2013-03-11 15:02

Kevin McGinley wrote an interesting post the other week reflecting on Oracle Endeca Information Discovery (OEID), and the role that it’s come to play within Oracle’s BI tools strategy. Kevin’s view was that whilst Oracle positioned OEID originally as an “unstructured data”, or even “big data” tool, where it’s got the most take-up in Kevin’s view was around fast, ad-hoc analysis with minimal up-front data modelling, something that traditionally OBIEE has been a bit weak at. I was reminded of Kevin’s post last week whilst on-site with a customer’s IT department, who were struggling with that age-old problem: how do we provide a reporting system to our users that puts no restriction on what they can do, but that also returns the results of all queries within seconds. Why is this so hard then?

OBIEE, and most of the rest of the “enterprise” BI platforms in my experience, are typically selected and then implemented by the IT department, because it ticks lots of boxes around integration with ERP systems, corporate middleware and security, and they can usually get a good deal license-wise if it’s part of a wider EBS upgrade, for example. IT departments then sell OBIEE to the business as offering flexibility and self-service reporting, all without having to involve the IT department when you need a new report creating. Which is true of course, but there’s also the pleasant side-effect for IT in that users are, in fact, quite severely constrained on the data they can use in their reports, and the way in which they can combine it, and it usually does involve IT when changes to the RPD are made, for example to bring in a new data source or add a new hierarchy.

The reason for which, of course, is because the types of dataset typically reported against by OBIEE – large data warehouses, federated data marts, even transactional databases – will typically return results very slowly to users unless they’ve been indexed, summarised and otherwise optimized by the DBAs beforehand. Some of this is just basic physics – relational databases and disk-based storage is optimized for storing lots of data, in-detail, very safely, but you can’t just throw any query at it and expect it to consistently return results in a split-second – not unless you bake-in some fairly constrained access paths, pre-compute and summarise in advance, or even use technologies such as Oracle Exalytics and TimesTen to replicate “hot spots” into an in-memory store.

NewImage

So there you have it – you can either have flexibility, or speed, but not both. But should users accept this restriction?

I first got into the BI game back in the late 90′s, and back then there were systems you could deploy to users, that were both infinitely flexible, and fast; for example, the old Oracle Express Server and its “blue-screen” interface returned data in seconds albeit with fairly primitive client-server, or even DOS-based user interfaces as shown in the screenshot screenshot below (courtesy of the Independent Oracle OLAP Blog – brings back a few memories).

NewImage

Even now if you go to certain client sites where they’re major users of Excel, you’ll see this sort of response time when they have Analysis Services providing the data, or more recently PowerPivot, Microsoft’s new in-memory, column-store database.

So is it unreasonable for users to ask for both speed and flexibility, especially when they’ve spent millions on license fees for OBIEE, and they’ve got an Exadata server running in the background? I know Express Server and other OLAP tools have their own restrictions, but for sales analysis and other desktop BI applications, from the users’ perspective have we really come all that far in the last ten to twenty years, or is it all excuses now?

Kevin makes the point in his post that perhaps Endeca Information Discovery fills this need now, with its “schema-less” data model and in-memory analysis, and we’ve certainly seen results that would support that – particularly when running Endeca on our Exalytics server, where the Endeca Server engine can leverage the entire 40 cores to massively-parellel query the in-memory data store. But Endeca though comes with its own limitations – there’s no metadata layer, for example, and no obvious support for hierarchies or other “drill to detail” structures, though it’ll be interesting to see how this pans out as Endeca gets integrated into the wider Oracle BI stack, perhaps working together with OBIEE’s BI Server and using the same visualisation layer as the Presentation Server. Essbase of course could also meet this requirement too, but I’m not sure its integration with the OBIEE is quite at the point yet where end-users don’t need to be aware that there’s two different analysis engines running in the background.

So, do you think that it’s fair to say “you can have report flexibility or performance, but not both”, or is that just a limitation in our thinking, or the OBIEE toolset? Do you think having Endeca, or Essbase, in the background now makes this combination possible, or do you feel that, with the right amount of tuning, optimising and preparation, a decently set-up Exadata + Exalytics combination should be able to return results for users within a split-second, regardless of what they throw at it? I’d be interested in readers’ opinions.

Categories: BI & Warehousing

You Probably Need Parallel Except When You Don’t

Rittman Mead Consulting - Sun, 2013-03-10 15:04

My last post looked at some aspects of performance tuning that would make traditional DBAs blanche with horror, not once did I mention indexing, partitioning, PCTFREE or parallel. This was not because I don’t consider them important, but more that the simple test of “am I doing the right thing” is often overlooked. To appease the DBA in me I will write occasional blog posts on some database aspects and performance. Here’s the first.

If you are running a large Oracle data warehouse you should be using parallel. There I said it! I think this fundamentally to be true, especially so with engineered systems such as Exadata. As we know, Exadata is a “balanced system”; it is designed to perform well if all components of the system are used in balance, and parallel can certainly exercise hardware resources which are otherwise idle in a large serial query.

Like all tools you have to use parallel correctly; no more would we think of using a wrench to hammer a nail then should you think parallel is the answer to all performance problems. Sometimes parallel will make things worse, sometimes parallel will make performance less predictable.

Parallel introduces additional work to a query, simplistically we need to: split the query into multiple parallel processes, execute them, wait for the processes to complete and finally coordinate the results. This all takes time to do. Our time saving comes from being able to process multiple smaller chunks of data simultaneously. If the time to execute the step in parallel is not significantly faster than doing it without parallel then the additional overhead may make parallel processing a slower option; this is typically the case with small tables where a full tablescan or an indexed access is fast. Use too few parallel processes and we will not gain much in performance, too many and we risk starving the database of resource for other work or even slow our own process as it waits for resource. If you have implemented some form of CPU resource management on your system you may find that you experience delays as your parallel slaves ‘wait their turn’

Having worked for a computer services company in the past (I managed support and development for several large data warehouses) I got to know that “predictability = good” and “unpredictability = bad”. People expect that a report takes 20 seconds to run or the overnight ETL processes will complete by 07:30 every morning. People don’t like that 20 second report taking between 2 seconds or an hour depending on what else is going on; of course they always will want the report taking 2 seconds, but that is another problem. Adaptive parallel is one of those features that sounds a good idea until you try to implement a production service with service level or operational agreements with the users; the idea that the database allocates parallel process based on available processing capacity would be good except for two flaws: once a query starts it can’t make use of additional parallel resources released when other parallel queries complete and the query will execute even if there is not enough parallel resource to do so – in the worst case this means it will run serially and likely very slowly. 

Oracle improved on this a lot in 11gR2 with Automatic Degree of Parallism, AUTO DOP. Gwen Shapira wrote a nice piece on AUTO DOP a while back. This feature also introduces parallel statement queuing, that is, if a query can’t access the full degree of parallel it needs it waits until it can obtain the required resource. This does not completely solve the problem of variable response time but it should avoid queries degrading to a lower degree of parallel because of insufficient available parallel resource; waiting five minutes and then running in 10 seconds is much better to my thinking than running in two hours. It is possible to avoid being queued by using a query hint (NO_STMT_QUEUING) , however we then run the risk of having reduced resource for execution unless we reserve sufficient headway in the system by setting parallel_server_target to an appropriate value.

From my experience of AUTO DOP, DOP can be too aggressive for ETL DML processes, of course there are a lot of things we can do to reduce the amount of parallelism requested, but perhaps the best approach is to reserve AUTO DOP for user queries. This is amazingly simple to do as AUTO DOP can be set at SESSION LEVEL. In my ideal world, ETL runs with traditional hand-tuned set-based SQL using explicit parallel and the reporting users get AUTO DOP with its query queuing to give the most throughput. If ETL runs at the same time as reporting (more and more common these days) we should lower the parallel_server_target to ensure we have enough resource for the ETL to run without degradation.

 

 

Categories: BI & Warehousing

Introducing some Short Tutorials for Hadoop

Rittman Mead Consulting - Fri, 2013-03-08 04:23

HadoopMarpreduceServerClusterI’ve recently joined Rittman Mead and part of my job will be looking at ‘Big Data’ technologies. This includes looking at how we can apply technologies to manage big data sets whether it be lightweight (but large) key-value stores, capturing and moving data or running batch jobs. My background is primarily in Java development and I’ve spent a lot of time working with many open source tools and open standards that make development easier.

The open source tools that are symbolic of the term ‘Big Data’ are constantly evolving, providing better features and performance. They are unstable, not in terms of quality but in how their APIs and general best practices change so quickly. Fortunately the dust is starting to settle on the core projects so if you’ve not had the time to work with the tools it’s only been getting easier and you’re in a good position.

Over the next few posts I’ll introduce Hadoop along with a few other open source tools that can be used together to quickly develop applications. Hadoop appears to have a steep learning curve and even installing it can look tricky. It’s actually quite easy to start a development environment and with packages like those from Cloudera it’s becoming much easier and quicker to set up production clusters.

Our example project will stream data from twitter using their API that will then be stored in raw form in Hadoop. After we can look a the ways we can transform and process that data using other tools like Hive, Pig and HBase and Oracle NoSQL. We’ll also be using some other open source tools to help so even if you never use Hadoop I hope they might be interesting.

In the first post we’ll only start up half of Hadoop – this will include two daemons that will provide the distributed file store. Another two daemons are needed for the Map-Reduce framework which is used for running batch processes and we’ll look at these in a later post.

To add data to the file store we’ll use the popular server side javascript engine node.js. This isn’t related to Hadoop but it demonstrates how we can move data between two web services with a dozen lines of code.

I’ll be writing the steps with the following configuration listed below. Of course most of the examples I create will work with most versions of Hadoop and the other big data tools, but for reference, the versions I’ll be using will be as follows:

That’s it for now – back soon with the first example.

Categories: BI & Warehousing

BI Forum 2013 Agenda and Registration Details – Now Live!

Rittman Mead Consulting - Thu, 2013-03-07 02:24

I’m pleased to announce that registration for the Rittman Mead BI Forum 2013 is now live, with all details on the BI Forum 2013 web page. This year’s BI Forum takes place on the first two weeks of May, in Brighton and then Atlanta, with numbers limited to around 70 at each venue.

This year, sessions at the two events include such topics as:

  • Jeff McQuigg, KPI Partners Inc, “Performance Tuning the BI Apps with a Performance Layer”
  • Antony Heljula, Peak Indicators Ltd, ”OBI Performance Tuning – Real Customer Success Stories”
  • Florian Schouten, Oracle and Kevin McGinley, Accenture – Oracle BI Applications 11g and ODI 11g Support
  • Mike Durran, Oracle, “Virtualizing Exalytics”
  • Christian Screen, Capgemini, “How to Create a Plug-In for Oracle BI 11g”
  • Edelweiss Kammermann, Awen Consulting, “Integrating Oracle BI, BPM and BAM 11g: The complete cycle of information”
  • Adam Bloom, Oracle, “New Developments in BI Multi-tenancy and Cloud”
  • Rene Kuipers, VX Company, “It’s all in the genes – The power of Oracle Exadata and the Oracle Database”

and many more others covering OBIEE, ODI, Endeca, Big Data and the BI Applications, from key speakers from the Oracle BI community, and from Oracle’s own product development team.

In addition to the regular presentations, our masterclass this year is on Oracle Data Integration, covering the key techniques and “tips from the field” around tools such as Oracle Data Integrator, Oracle GoldenGate, the Big Data Adapters, together with the recently-updated Oracle Reference Architecture for Information Management and Big Data. Our own Stewart Bryson, Michael Rainey and myself will provide the masterclass, which will run on the day before each event both in Atlanta and Brighton.

cary2Finally, I’m delighted to announce some special guests who’ll be at either Atlanta or Brighton. Over in Atlanta, we’re pleased to have join us none other than Cary Millsap, co-author with Jeff Holt of the seminal Oracle performance book “Optimizing Oracle Performance”, who’ll be speaking about performance, response-time, and how this applies to our world of BI and analytics. We’ll also be joined in Atlanta by Gwen (Chen) Shapira, like myself and Cary an Oracle ACE Director, and who specialises in large data systems, the Oracle database and more recently, big data technologies such as Hadoop, NoSQL and Oracle’s Big Data Appliance. Over in Brighton, we’re delighted to be joined by Datasift’s Toby Potter, a recent speaker at the O’Reilly Strata Conference, and who’ll be speaking to us about including social media data in with our BI analyses.

As with previous BI Forums, we’ve deliberately set up the events to promote networking, discussion and opportunities to meet your peers in the industry. We keep the numbers at each event deliberately low, and we encourage delegates to stay in the hotel that’s also the conference venue, maximising opportunities to meet new people and ask those important questions.

Oracle are again providing a development-focused keynote on the Wednesday evening at both events, and we’ve got a number of meals, social events and other “fun” things planned over duration of the event. So it’s not all technical content – but when it is, it’ll be the best Oracle BI, DW and analytics content you’ll hear this year at any conference, guaranteed.

Details of how to register, dates, the full agenda and so on are at the Rittman Mead BI Forum 2013 web page, but book soon – the event typically sells out well in advance.

Categories: BI & Warehousing

We Don't Make Essbase, We Make Essbase Better!

Tim Tow - Wed, 2013-03-06 22:56

We have always claimed that Essbase was better with Dodeca, but until now we had only anecdotal evidence about how our customers get more return on their Essbase investment using Dodeca.  Now we have something a bit more concrete.  Recently, the annual BI Survey, which bills itself as the “world's largest vendor-independent survey of Business Intelligence and Performance Management users”, published the results of their annual survey, The BI Survey 12.  As in past years, we asked our customers to participate in the survey and, as a result, we earned a mention in the survey.  More importantly, we are thrilled by how we were mentioned.

One particular question in the survey asks customers how well they achieved their goals with their selected product.  In the past, Essbase has not scored well on this metric but, this year, for the first time ever, Essbase attained the highest score in goal achievement among all the products in the BI Survey.
  • Better than Microsoft Analysis Services?  Yes!  
  • Better than SAP BW?  Yes!  
  • Better than Cognos?  Yes!   
  • Better than Microstrategy?  Yes!  
  • Better than TM1 / Tableau / QlikTech???   Yes! Yes! Yes!   
  • Even better than OBIEE?   Yep!  
The analysis provided in the survey provides clues as to why the Goal Achievement Index for Essbase was the highest of all 25 products reviewed:

"This year for the first time we have a BI Giant with the highest goal achievement level. Oracle Essbase which has had quite low goal achievement scores in previous years, performs very well this year.  The sample includes more answers this year from companies that use Essbase in conjunction with front-end tools like Dodeca. These seem to increase the ratings for Essbase compared to ratings from users that only use the Excel Add-in."

That’s right.  Dodeca appears to be the reason that more customers achieve their goals with Essbase than with any other product.  If you are still using one of the Excel add-ins as your interface to Essbase, what are you missing?  Contact us today to learn more about how you can make Essbase better at your company using Dodeca and don't forget to ask about our no-cost evaluation program!

We have just released some exciting innovations in our most recent release.  Click here to read about the newest innovations inside the Dodeca Spreadsheet Management System.

Categories: BI & Warehousing

Little Changed

David Aldridge - Tue, 2013-03-05 04:05
Incredible and depressing to see people still getting Oracle internals as wrong as this.
Categories: BI & Warehousing

The Changing World of Business Intelligence

Rittman Mead Consulting - Fri, 2013-03-01 09:09

Over the last 18 months we have seen a huge amount of changes in the world of Oracle Business Intelligence and Analytics.

Everyone wants to talk about Big Data, however doesn’t know where to start in a corporate world, Oracle bought Endeca, which is a reporting/information discovery tool, Cloud delivered services are moving closer and closer and core reporting systems are still fundamental to an organisation’s existence. To make sense of this I wanted to try and relate it the Oracle Business Intelligence world when I first started working in this industry, and to try and look at how the design process and its resultant models are changing.

There used to be a fairly straightforward graphic that was used to explain Business Intelligence.

The bottom of the pyramid was the data in the organisation, there was then a transformation layer that turned it into information, it was then consumed by end users as knowledge which could be used in their decision making process.

For me this has now been replaced by the following graphic.

This model still starts with data, however the source, volume, variety and velocity (deliberately borrowed from standard Big Data definitions) of this data has increased. The organisation now looks at more internal sources of data, plus external sources of data as well, such as social media and third party market research.

The biggest change is the transformation layer. Depending on the source of data and the questions that are being asked of the organisation there are now two different approaches: schema on write and schema on read.

Schema on write

This is the traditional approach for Business Intelligence. A model, often dimensional, is built as part of the design process. This model is an abstraction of the complexity of the underlying systems, put in business terms. The purpose of the model is to allow the business users to interrogate the data in a way they understand.

The model is instantiated through physical database tables and the date is loaded through an ETL (extract, transform and load) process that takes data from one or more source systems and transforms it to fit the model, then loads it into the model.

The key thing is that the model is determined before the data is finally written and the users are very much guided or driven by the model in how they query the data and what results they can get from the system. The designer must anticipate the queries and requests in advance of the user asking the questions.

Schema on read

Schema on read works on a different principle and is more common in the Big Data world. The data is not transformed in any way when it is stored, the data store acts as a big bucket.

The modelling of the data only occurs when the data is read. Map/Reduce is the clearest example, the mapping is the understanding of the data structure. Hadoop is a large distributed file system, which is very good at storing large volumes of data, this is potential. It is only the mapping of this data that provides value, this is done when the data is read, not written.

New World Order

So whereas Business Intelligence used to always be driven by the model, the ETL process to populate the model and the reporting tool to query the model, there is now an approach where the data is collected its raw form, and advanced statistical or analytical tools are used to interrogate the data. An example of one such tool is R.

The driver for which approach to use is often driven by what the user wants to find out. If the question is clearly formed and the sources of data that are required to answer it well understood, for example how many units of a product have we sold, then the traditional schema on write approach is best.

If the question is more open, for example what is causing our sales of a product to drop, why are customers churning or even the clichéd unknown unknowns, then the schema on read is most appropriate.

Decisions, Decisions, Decisions

Whichever approach is taken the end result is that the user, or business, wants to make a decision, or take some action based on making sense of some data. Organisations are becoming increasingly data driven, and despite the evolution of Business Intelligence, the ability of an organisation to derive value from data will be key to its success.

Categories: BI & Warehousing

Tuning Philosophy – Tuning the Right Thing

Rittman Mead Consulting - Fri, 2013-03-01 04:00

My second presentation at this year’s RMOUG Training Days was on tuning “realtime data warehouses”; as usual, this paper is now on the Rittman Mead Articles page.  Perhaps more accurately my talk was more about my tuning philosophy rather than a cookbook of tuning “rules” to give optimal performance. I don’t think there is a single recipe for guaranteed tuning success; the best we can come up with is set of principles for getting things as right as possible and to keep in our heads that each system has its own unique combinations of hardware, software and data and this interaction modifies the steps we should evaluate in our tuning process. The sub-title of my talk came from an early slide: “Making the Arrow as short as possible”

NewImage

Another name for this arrow is “latency”, the thing that stops “realtime” from actually being “same time”. We will never have no arrow as the act of observing the event at source and reacting to output at target will always add some amount of delay to the data flow. I discussed this in a paper for the Evaluation Centre.

Rather than present my RMOUG talk here I will take a step back and write about how I go about improving performance in a data warehouse ETL.

Firstly, if we set out to improve performance we need to measure it, or else how can we be confident that we have “improved” things. This measurement can be as crude as clock time to run an ETL step, a throughput measure such as rows inserted per minute or we can really delve into performance and look fine detail such as number and type of database waits. I tend to start with coarse measurements. I do this for two simple reasons: execution time or throughput is often the business visible metric that is the basis of the problem perceived by the customer; and, from my experience of many data warehouses, the code as implemented may not be doing what the designers wanted, and there is little merit in tuning a process to do the “wrong thing” more quickly.  I therefore take as my starting point what the business wants to achieve in the ETL step and not the query being run. Here I see four kinds of problems:

  • The code used does not answer the business requirement.
  • The query has a flaw that causes it to process too much data.
  • The code uses inappropriate techniques.
  • The process has redundant steps.

Fortunately, the first cause is rare, probably because it is usually spotted and resolved long before moving from developement to to production. Processing too much data should be easy to detect if the ETL process is adequately instrumented; If a business has 4000 customers who each make two transactions a day and if we are loading 14 million customer transactions per day there is something very wrong in the process. Just doing simple calculations of expected data volume and comparing that with actual loads can readily spot this type of thing – we can then dive down to isolate the cause, which is often incorrect joins or a missing source data filter. As I found at one customer site, it is quite possible to load far too much data without affecting the values reported in the user query tools; ETL logic flaws do not always lead to obvious data aggregation problems.

Too many times I have seen code created by developers that have insufficient understanding of what a relational database can do or what a particular vendor’s database can do. I have seen people calculating the first of the month by taking the date, converting it to a string, then concatenate the “month and year” substring on to the literal “01 before converting it back to a date” because they don’t know you can do this in a single function (TRUNC) in Oracle . I have seen developers re-inventing database joins as procedural steps (often they hand code nested loops!) rather than letting the database do it. I have seen others look for change in a row by computing the MD5 hash of the concatenated columns of the new row and comparing it to the previous MD5 hash for the original row. Don’t get me wrong, MD5 hashing can work but it is so compute intensive that it can starve the database of vital CPU resource; For change detection I much prefer to use the mainstays of set based ETL: MINUS operators or simple outer joins between new and old data and looking for inequality between columns.

Once I am sure I am looking at the right problem I can go about optimising performance. I tend to start with OEM12c and the SQL Monitoring reports or good old fashioned EXPLAIN. Just getting the cardinalities right can help the CBO to do wonders. But accurate CBO costings are not always the solution, it is this stage that I start to look at the more physical aspects of design such as partitioning, parallel (assuming these are licensed), memory, concurrent activity, indexing and the way the tables are defined. More on this in another blog.

Categories: BI & Warehousing