BI & Warehousing
I will be in Seattle on Thursday, January 9th for the Meet the Oracle ACE Directors Panel. It is at the Sheraton Seattle from 4 - 6 pm and will feature several other ACE Directors including Martin D'Souza, Kellyn Pot'Vin, Tim Gorman, and my longtime friend and collaborator, Cameron Lackpour.
Come see and the panel and stay for the Happy Hour; the beer will be on me!
As well as offering consulting and training around Oracle’s BI, DW and EPM products, Rittman Mead also provide managed services and support around these products for customers around the world. We’re just about to move over to a new service desk system based on Atlassian Jira, and as part of our final “soak test” we’re going to open up the service publicly for the next 24 hours, taking on public OBIEE support requests on a “best endeavour” basis.
From mid-day today (Wednesday, 18th December 2014, UK time) through to mid-day tomorrow, you can register here and then raise a support ticket, and we’ll endeavour to answer your query over the next few days. As you’d expect, there are a few conditions; you must register with your work email address, and not one from Gmail, Yahoo, Outlook.com etc, and whilst we’ll try and answer your query, there are no promises and we may have to limit replies and accepted tickets. Full details of the offer, and for our global managed services, are on our support page here including terms and conditions, how to register and how to raise a ticket.
UPDATE: Thanks for your interest in our support system. We have now closed registration…
Keep checking back here for future support offerings! Thanks and have a great Christmas and New Year.
It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.
If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.
Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.
The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at firstname.lastname@example.org. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.
But first, a mea culpa:
In 2008, I Was An IdiotBack in early 2008, I wrote a blog entry comparing Collaborate, Kaleidoscope, and OpenWorld. In this entry, I said that Collaborate was the obvious successor to the Hyperion Solutions conference and I wasn't terribly nice to Kaleidoscope. Here's me answering which of the three conferences I think the Hyperion community should attend (I dare you to hold in the laughter):
Now which one would I attend if I could only go to one?Collaborate. Without reservation. If I'm going to a conference, it's primarily to learn. As such, content is key.I actually got asked a very similar question on Network 54's Essbase discussion board just yesterday (apparently, it's a popular question these days). To parrot what I said there, OpenWorld was very, very marketing-oriented. 80% of the fewer than 100 presentations in the Hyperion track were delivered by Oracle (in some cases, with clients/partners as co-speakers). COLLABORATE is supposed to have 100-150 presentations with 100+ of those delivered by clients and partners.In the interest of full-disclosure, my company, interRel, is paying to be a 4-star partner of COLLABORATE. Why? Because we're hoping that COLLABORATE becomes the successor to the Solutions conference. Solutions was a great opportunity to learn (partying was always secondary) and I refuse to believe it's dead with nothing to take it's mantle. We're investing a great deal of money with the assumption that something has to take the place of Hyperion Solutions conference, and it certainly isn't OpenWorld.Is OpenWorld completely bad? Absolutely not. In addition to the great bribes, it's a much larger conference than COLLABORATE or ODTUG's Kaleidoscope, so if your thing is networking, by all means, go to OpenWorld. OpenWorld is the best place to get the official Oracle party line on upcoming releases and what not. OpenWorld is also the place to hear better keynotes (well, at least by More Famous People like Larry Ellison, himself). OpenWorld has better parties too. OpenWorld is also in San Francisco which is just a generally cooler town. In short, OpenWorld was very well organized, but since it's being put on by Oracle, it's about them getting out their message to their existing and prospective client base.So why aren't I recommending Kaleidoscope (since I haven't been to that either)? Size, mostly. Their entire conference will have around 100 presentations, so their Hyperion track will most likely be fewer than 10 presentations. I've been to regional Hyperion User Group meetings that have more than that (well, the one interRel hosted in August of 2007 had 9, but close enough). While Kaleidoscope may one day grow their Hyperion track, it's going to be a long time until they equal the 100-150 presentations that COLLABORATE is supposed to have on Hyperion alone.If you're only going to one Hyperion-oriented conference this year, register for COLLABORATE. If you've got money in the budget for two conferences, also go to OpenWorld. If you're a developer that finds both COLLABORATE and OpenWorld to be too much high-level fluff, then go to Kaleidoscope.
In 2008, Mike Riley Luckily Wasn't An Idiot
It’s true that ODTUG is a smaller conference, however that is by choice. At every ODTUG conference, the majority of the content is by a user, not by Oracle or even another vendor. And even though Collaborate might seem like the better buy because of its scale, for developers and true technologists ODTUG offers a much more targeted and efficient conference experience. Relevant tracks in your experience level are typically consecutive, rather than side-by-side so you don’t miss sessions you want to attend. The networking is also one of the most valuable pieces. The people that come to ODTUG are the doers, so everyone you meet will be a valuable contact in the future.
It’s true, COLLABORATE will have many presentations with a number of those delivered by clients and partners, but what difference does that make? You can’t attend all of them. ODTUG’s Kaleidoscope will have 17 Hyperion sessions that are all technical.
In the interest of full disclosure, I have been a member of ODTUG for eight years and this is my second year as a board member. What attracted me to ODTUG from the start was the quality of the content delivered, and the networking opportunities. This remains true today.
I won’t censor or disparage any of the other conferences. We are lucky to have so many choices available to us. My personal choice and my highest recommendation goes to Kaleidoscope for all the reasons I mentioned above (and I have attended all three of the above mentioned conferences).
One last thing; New Orleans holds its own against San Francisco or Denver. All of the cities are wonderful, but when it comes to food, fun, and great entertainment there’s nothing like the Big Easy. Mike was only in his second year as a board member of ODTUG, but he was willing to put himself out there, so I wrote him an e-mail back. In that e-mail, dated February 10, 2008, I said that for Kaleidoscope to become a conference that Hyperion users would love, it would require a few key components: keynote(s) by headliner(s), panels of experts, high-quality presentations, a narrow focus that wasn't all things to all people, and a critical mass of attendees.
Mike Helped Us, Let's Help Him
Note that the file was saved in the Excel 2007 and higher format (with an .xlsx file extension). Did you know that the xlsx format is really just a specialized zip file? Seriously. It is a zip file containing various files that are primarily in xml format. I saved the workbook, added the .zip extension to the filename, and opened it in WinRar. Here is what I found.
I opened the xl folder to find a series of files and folders.
Next, I opened the worksheets folder to see what was in there. Of course, it is a directory of xml files containing the contents of the worksheets.
My Essbase retrieves were on the sheet named Sheet1, so let’s take a look at what is in the sheet1.xml file. The xml is quite large, so I can’t show all of it here, but needless to say, there is a bunch of information in the file. The cell contents are only one of the things in the file. Here is an excerpt that shows the contents of row 5 of the spreadsheet.
This is interesting as it shows the numbers but not the member name. What is the deal with that? I noticed there is an attribute, ‘t’, on that node. I am guessing that the attribute t=”s” means the cell type is a string. I had noticed that in one of the zip file screenshots, there was a file named sharedStrings.xml. Hmm... I took a look at that file and guess what I found?
That’s right! The 5th item, assuming you start counting at zero like all good programmers do, is Profit. That number corresponds perfectly with the value specified in the xml for cell B5, which was five (circled in blue in the xml file above). OK, so when are we going to get to Smart View stuff? The answer is pretty quick. I continued looking at sheet1.xml and found these nodes near the bottom.
Hmm, custom properties that contain the name Hyperion? Bingo! There were a number of custom property files in the xml file. Let’s focus on those.
Custom property #1 is identified by the name CellIDs. The corresponding file, customProperty1.bin, contained only the empty xml node <root />. Apparently there aren’t any CellIDs in this workbook.
Custom property #2 is identified by the name ConnName. The file customProperty2.bin contains the string ‘Sample Basic’ which is the name of my connection.
Custom property #3 is named ConnPOV but it appears to contain the connection details in xml format. Here is an excerpt of the xml.
Custom property #4 is named HyperionPOVXML and the corresponding file contains xml which lines up with the page fields I have in my worksheet.
What is interesting about the POV xml is that I have two different retrieves that both have working POV selectors which are both implemented as list-type data validations in Excel. I don’t know what happens internally if I save different values for the POV.
Custom property #5 is labeled HyperionXML. It appears to contain the information about the Essbase retrieval, but it doesn't appear to be the actual retrieval xml because it doesn't contain the numeric data. My guess is that this xml is used to track what is on the worksheet from a Hyperion standpoint.
There is a lot of information in this simple xml stream, but the most interesting information is contained in the slice element. Below is a close-up of contents in the slice.
The slice covers 6 rows and 7 columns for a total of 42 cells. It is interesting that the Smart View team chose to serialize their XML in this manner for a couple of reasons. First, the pipe delimited format means that every cell must be represented regardless of whether it has a value or not. This really isn’t too much of a problem unless your spreadsheet range is pretty sparse. The second thing about this format is that the xml itself is easy and fast to parse, but the resulting strings need to be parsed again to be usable. For example, the vals node will get split into an array containing 42 elements. The code must then loop the 42 elements and process them individually. The other nodes, such as the status, contain other pieces of information about the grid. The status codes appear to be cell attributes returned by Essbase; these attributes are used to apply formatting to cells in the same way the Excel add-in UseStyles would apply formatting. There are a couple of things to take away:
- In addition to the data on the worksheet itself, there is potentially *a lot* of information stored under the covers in a Smart View file.
- String parsing is a computation-intensive operation and can hurt performance. Multiply that workload by 8 because, depending on the operation and perhaps the provider, all 8 xml nodes above may need to be parsed.
In addition, the number of rows and columns shown in the slice may be important when you are looking at performance. Smart View must look at the worksheet to determine the size of the range to read in order to send it to Essbase. In the case of a non-multi-grid retrieve, the range may not be known and, as a result, the grid may be sized based on the UsedRange of the worksheet. In our work with Dodeca, we have found that workbooks converted from the older xls format to the newer xlsx format, which support a larger number of cells, may have the UsedRange flagged internally to be 65,536 rows by 256 columns. One culprit appears to be formatting applied to the sheet in a haphazard fashion. In Dodeca, this resulted in a minor issue which resulted in a larger memory allocation on the server. Based on the format of the Smart View xml, as compared to the more efficient design of the Dodeca xml format, if this were to happen in Smart View it may cause a larger issue due to the number of cells that would need to be parsed and processed. Disclaimer: I did not attempt to replicate this issue in Smart View but rather is an educated guess based on my experience with spreadsheet behavior.
Note: The Dodeca xml format does not need to contain information for cells that are blank. This format reduces the size and the processing cycles necessary to complete the task. In addition, when we originally designed Dodeca, we tested a format similar to the one used today by Smart View and found it to be slower and less efficient.
Considering all of this information, I believe the xml format would be difficult for the Smart View team to change at this point as it would cause compatibility issues with previously created workbooks. Further, this discussion should give some visibility to the fact that the Smart View team faces an on-going challenge to maintain compatibility between different versions of Smart View considering that different versions distributed on desktops and different versions of the internal formats that customers may have stored in their existing Excel files. I don’t envy their job there.
After looking at all of this, I was curious to see what the xml string would look like on a large retrieve, so I opened up Smart View, connected to Sample Basic and drilled to the bottom of the 4 largest dimensions. The resulting sheet contained nearly 159,000 rows of data. Interestingly enough, when I looked at the contents of customProperty5.bin inside that xlsx file, the contents were compressed. It occurred to be a bit strange to me as the xlsx file format is already compressed, but after thinking about it for a minute it makes sense as the old xls file format probably did not automatically compress content, so compression was there primarily to compress the content when saved in the xls file format.
Custom property #6 is labeled NameConnectionMap. The corresponding property file contains xml that appears to map the range names in the workbook to the actual grid and the connection.
Custom property #7 is labeled POVPosition. The file customProperty7.bin contains the number 4 followed by a NUL character. Frankly, I have no idea what position 4 means.
Moving on to custom property #8 which is labeled SheetHasParityContent. This file contains the number 1 followed by a NUL character. This is obviously a boolean flag that tells the Smart View code that new features, such as support for multiple grids, are present in this file.
Custom property #9 is labeled SheetOptions. The corresponding file, customProperty9.bin, contains an xml stream that (obviously) contains the Hyperion options for the sheet.
Custom property #10 is labeled ShowPOV and appears to contain a simple Boolean flag much like that in custom property #8.
Finally, custom property #11 is labeled USER_FORMATTING and may not be related to Smart View.
I did look through some of the other files in the .zip and found a few other references to Smart View, but I did not see anything significant.
So, now that we have completed an overview of what is contained in one, very simple, multi-grid file, what have we learned?
- There is a bunch of stuff stored under the covers when you save a Smart View retrieve as an Excel file.
- With the reported performance issues in certain situations with Smart View, you should now have an idea of where to look to resolve Smart View issues in your environment.
There are a number of files I did not cover in this overview that could also cause performance issues. For example, Oracle support handled one case where they found over 60,000 Excel styles in the file. Smart View uses Excel Styles when it applies automatic formatting to member and data cells. When there are that many styles in the workbook, however, it is logical that Excel would have a lot of overhead searching through its internal list of Style objects to find the right one. Accordingly, there is a styles.xml file that contains custom styles. If you have a bunch of Style objects, you could delete the internal styles.xml file.
Note: Be sure to make a copy of your original workbook before you mess with the internal structures. There is a possibility that you may mess it up and lose everything you have in the workbook. Further, Oracle does not support people going under-the-covers and messing with the workbook, so don’t even bring it up to support if you mess something up.
Wow, that should give you some idea of what may be going on behind the scenes with Smart View. Even with the experience I have designing and writing the Dodeca web services that talk to Essbase, I wouldn't say that I have a deep understanding of how the information in a Smart View workbook really works. However, one thing is for certain; Dodeca does not put stuff like this in your Excel files. It may be interesting to hear what you find when you explore the internals of your workbooks.
This year, Rittman Mead were the Analytics Sponsor for the UKOUG Tech13 conference in Manchester, and for those that visited the UKOUG stand during their time there will have noticed the Rittman Mead-sponsored Analytics Dashboard on display. In this blog post I will cover how it was put together, the “Blue Peter Way” !
For those of you not familiar with Blue Peter it is a long running children’s TV show here in the UK that started way back in 1958, possibly it’s most infamous moment came in 1969 with lulu the defecating elephant. As a child the highlight of the show was the “How to make” something section where they would always use the phrase “Sticky-backed-plastic” instead of “Sellotape” due to a policy against using commercial terms on air. The presenters would create something from scratch with bits and bobs you could find around the house, cereal boxes, egg cartons, washing up bottles, Sellotape etc ( sorry, “sticky-backed-plastic” ). That’s exactly what I’m going to be doing here but instead of making a sock monster I’m going to show you how the analytics dashboard was created from scratch with bits you can find on the internet. So kids, without further delay, let’s begin…
(remember to ask your parents permission before downloading any of the following items)
You will need :
- A Linux server.
- A Web server
- The Redis key/value store
- A DataSift account
- The Webdis HTTP server
- The tagcanvas.js jQuery plugin
- The vTicker jQuery plugin
- The flot plotting library
- Some Sticky-backed-Plastic (jQuery)
- Lots of coffee
The Linux server I’m using is Red Hat Enterprise Server 6.4 and the very first thing we’ll need to do is install a web server.
Done. The Web server is now installed, configured to start on server-boot and up and running ready to service our http requests.
Next up is the Redis key/value datastore. I’ll be using Redis to store all incoming tweets from our datasource ( more on that in a bit )
Now that we have Redis up and running let’s perform a couple of tests, first a benchmark using the Redis-benchmark command
This command throws out a lot more output than this but it is the LRANGE command we are particularly interested in as we’ll be using it to retrieve our tweets later on. 3593 requests per second seems reasonable to me, there were around a 1000 registrations for the UKOUG TECH13 conference, the likelihood of each of them making 3 concurrent dashboard requests all within the same second is slim to say the least – regardless, I’m willing to live with the risk. Now for a quick SET/GET test using the Redis-cli command.
Ok so that’s our datastore sorted, but what about our datasource ? Clearly we will be using Twitter as our primary data source but mere mortals like myself don’t have access to Twitter’s entire data stream, for that we need to turn to specialist companies that do. DataSift, amongst other companies like Gnip and Topsy ( which interestingly was bought by Apple earlier this week ) can offer this service and will add extra features into the mix such as sentiment analysis and stream subscription services that push data to you. I’ll be using DataSift, as here at Rittman Mead we already have an account with them. The service itself charges you by DPU ( Data Processing Units ) the cost of which depends upon the computational complexity of the stream your running, suffice to say that running the stream to feed the analytics dashboard for a few days was pretty cheap.
To setup a stream you simply express what you want included from the Twitter firehose using their own Curated Stream Definition Language, CSDL. The stream used for the dashboard was very simple and the CSDL looked like this :-
This is simply searching for tweets containing the string “ukoug” from the Twitter stream. DataSift supports all kinds of other social data, for a full list head over to their website.
Now that we have our data stream setup, how do we actually populate our Redis data store with it ? Simple – get DataSift to push it to us. Using their own PUSH API you can setup a subscription in which you can specify the following output parameters: “output_type”, “host”, “port”, “delivery_frequency” and “list” amongst many others. The output type was set to “Redis”, the host, well, our hostname and the port set to the port upon which Redis is listening, by default 6379. The delivery_frequency was set to 60 seconds and the list is the name of the Redis list key you want the data pushed to. With all that setup and the subscription active, tweets will automagically start arriving in our Redis datastore in the JSON format – happy days !
The next step is to get the data from Redis to the browser. I could install PHP and configure it to work with apache and then install one of the several Redis-PHP libraries and write some backend code to serve data up to the browser, but time is limited and I don’t want to faff around with all that. Besides this is Blue Peter not a BBC4 Documentary. Here’s where we use the next cool piece of open source software, Webdis. Webdis acts as a HTTP interface to Redis which means I can make HTTP calls directly from the browser like this :-
Yep, you guessed it, we’ve just set a key in Redis directly from the browser. To retrieve it we simply do this.
Security can be handled in the Webdis config file by setting IP based rules to stop any Tom, Dick or Harry from modifying your Redis keys. So to install Webdis we do the following :-
An that’s it, we now have tweet to browser in 60 seconds and without writing a single line of code ! Here’s an overview of the architecture we have :-
Would I do this in a production environment with larger data set, probably not, it wouldn’t scale, I’d instead write server-side code to handle the data processing, test performance and push only the required data to the client. The right tools for the right job.
The function getData() is called within the setInterval() function, the second parameter to the setInterval() sets the frequency of the call, in this case every 5 mins (30000 milliseconds). The getData function performs an Ajax get request on the url that points to our Webdis server listening on port 7379 which then performs an LRANGE command on the Redis data store, it’s simply asking Redis to return all list items between 0 and 100000, each item being a single tweet. Once the Ajax request has successfully completed the “success” callback is called and within this callback I am pushing each tweet into an array so we end up with an array of tweet objects. We now have all the data in a format we can manipulate to our hearts content.
Now onto the Graphs and Visualisations.The Globe
The tweet ticker was built by grabbing the latest 30 tweets from the data array and assigning them to html <li> tags, the vTicker jQuery plugin was then applied to the containing <ul> html tag. Various plugin options allow you to control things like the delay and scroll speed.Tweet Velocity
This one proved quite popular – these speaker types are a competitive bunch ! Having obtained a list of speakers from the UKOUG Tech13 website I was able to search all the twitter content for each speaker and aggregate up to get the total twitter mentions for each speaker, again the graph was rendered using the flot plotting library. As the graph updated during each day speakers were swapping places with our own Mark Rittman tweeting out the “The Scores on the doors” at regular intervals. When the stats can me manipulated though there’s always someone willing to take advantage !
tut, tut Mark.Twitter Avatars
The twitter Avatars used the tagcavas.js library but instead of populating it with words from the Twitter content the Tweet avatars were used. A few changes to the plugin options were made to display the results as a horizontally scrolling cylinder instead of a globe.Twitter Sentiment
The Twitter sentiment graph again used flot. Tweet sentiment was graphed over time for the duration of the conference. The sentiment score was provided by DataSift as part of the Twitter payload. The scores we received as part of the 3500 tweets ranged between -15 and 15 each score reflected either a positive or negative tweet. Asking a computer to infer a human emotion from 140 characters of text is a tough ask. Having looked at the data in detail a fair few of the tweets that received negative scores weren’t negative in nature, for example tweets with the content “RAC attack” and “Dangerous anti patterns” generated a cluster of negative scores. As we know computers are not as clever as humans, how can they be aware of the context of a tweet? detect sarcasm or differentiate between banter and plain old insults? Not all the negatively scored tweets where false-negatives, some were genuine, a few regarding the food seemed to ring true.
Perhaps the data you’re analyzing needs to be taken into context as a whole. You’d expect a 1000 Techies running around a tech conference to be happy, the sentiment analysis seemed to do a better job at ranking how positive tweets were than how negative they were. Perhaps from a visualisation point of view, a logarithmic scale along with a multiplier to raise the lowest score would have worked better in this case to reflect how positive overall the event was. One thing is clear though and that is that further statistical analysis would be needed over a much larger data set to really gain insight into how positive or negative your data set is.
So that’s it kids, I hope you’ve enjoyed this episode of Blue Peter, until next time….
Edit – 6-Dec-2013 19.07:
Mark has asked me the following question over twitter:
“any chance you could let us know why these particular tools / components were chosen?”.
I thought I’d give my answer here.
All the tools were also open source, easy to install/configure and are well documented, I had also used them all previously – again, a time saver. Redis was chosen for 2 reasons, it was supported as a subscription destination by DataSift ( along with many others ) and I’m currently using it in another development project I’m working on so I was up to speed with it. Although in this solution I’m not really taking advantage of Redis’s access speed it worked well as somewhere to persist the data.
If I was coding a more permanent solution with more development time then I’d add a relational database into the mix and use it to perform all of the data aggregation and analysis, this would make the solution more scalable. I’d then either feed the browser via ajax calls directly to the database via backend code or populate a cache layer from the database and make ajax called directly on the cache, similar to what I did in this solution. It would have been nice to use D3 to develop a more elaborate visualisation instead of the canned flot charts but again this would have taken more time to develop.
Next week Rittman Mead is the analytics sponsor for the UKOUG Tech13 conference up in Manchester.
We’ve got a great line up of talks, covering everything from BI Apps, Endeca, Hadoop, OBIEE and mobile – Stewart and Charles are both flying in from the US to present alongside Mark, Adam and myself. The full list is as follows:
- Deep dive into Oracle BI Applications using Oracle Data Integrator [Mark Rittman] – Sunday 12.30-2pm
- Leveraging Hadoop / Map Reduce OBIEE 11g and ODI 11g [Mark Rittman] – Sunday 3pm – 4.30pm
- BI Across Any Data Source with OBIEE and Oracle Endeca Discovery [Mark Rittman & Adam Seed] – Monday 5.05-5.50pm
- The Changing World of Business Intelligence [Jon Mead] – Tuesday 11.20 – 12.20
- Oracle BI Multi User Development (MDS XML vs MUDE) [Stewart Bryson] – Tuesday 11.30 -12.30
- Designing for a mobile World using OBIEE [Charles Elliot] – Wednesday 1.45 – 2.45
On Monday night we are teaming up with our friends at Pythian to host some drinks and nibbles at Taps Bar, which is just around the corner from the conference venue. We’ll be there from 6.30-9.00pm, so come and join us for a free drink.Analytics Sponsor
As part of being the analytics sponsor for the event we are looking to collate as much real time, social media and demographic information about the event as possible. We will be displaying an analytics dashboard in the conference venue detailing statistics from this data. To help us, could you:
- Complete the form here to give use some background infromation about why you are going; and
- Use the official hashtag ukoug_tech13 when tweeting anything about the event.
It’s looking like it will be a great few days, so look forward to seeing you up there.
More Excel SupportDodeca has always been strong on Excel version support and this version delivers even more Excel functionality. Internally, we use the SpreadsheetGear control, which does a very good job with Excel compatibility. This version of Dodeca integrates a new version of SpreadsheetGear that now has support for 398 Excel functions including the new SUMIFS, COUNTIFS, and CELL functions.Excel Page Setup DialogThe new version of Dodeca includes our implementation of the Excel Page Setup Dialog which makes it easy for users to customize the printing of Dodeca views that are based on Excel templates. Note that for report developers, the Excel Page Setup has also been included in the Dodeca Template Designer.
New PDF View TypeCustomers who use PDF files in their environments will like the new PDF View Type. In previous releases of Dodeca, PDF documents displayed in Dodeca opened in an embedded web browser control. Beginning in this version, Dodeca includes a dedicated PDF View type that uses a specialized PDF control.
View Selector TooltipsFinally, users will like the new View Selector tooltips which optionally display the name and the description of a report as a tooltip.
PerformancePerformance is one of those things that users always appreciate, so we have added a new setting that can significantly improve performance in some circumstances. Dodeca has a well-defined set of configuration objects that are stored on the server and we were even awarded a patent recently for the unique aspects of our metadata design. That being said, depending on how you implement reports and templates, there is the possibility of having many queries issued to the server to check for configuration updates. In a few instances, we saw that optimizing the query traffic could be beneficial, so we have implemented the new CheckForMetadataUpdatesFrequencyPolicy property. This property, which is controlled by the Dodeca administrator, tells Dodeca whether we should check the server for updates before any object is used, as was previously the case, only when a view opens, or only when the Dodeca session begins. We believe the latter case will be very useful when Dodeca is deployed in production as objects configured in production often do not change during the workday and, thus, network traffic can be optimized using this setting. The screenshot below shows where the administrator can control the update frequency.
Though users will like these features, we have put a lot of new things in for the people who create Dodeca views and those who administer the system. Let’s start with something that we think all Dodeca admins will use frequently.Metadata Property Search UtilityAs our customers continue to expand their use of Dodeca, the number of objects they create in the Dodeca environment continues to grow. In fact, we now have customers who have thousands of different objects that they manage in their Dodeca environments. The Metadata Property Search Utility will help these users tremendously.
This utility allows the administrator to enter a search string and locate every object in our system that contains that string. Once a property is located, there is a hyperlink that will navigate to the given object and automatically select the relevant property. This dialog is modeless, which means you can navigate to any of the located items without closing the dialog.
Note: this version does not search the contents of Excel files in the system.Essbase Authentication ServicesIn the past, when administrators wished to use an Essbase Authentication service to validate a login against Essbase and automatically obtain Dodeca roles based on the Essbase user’s group memberships, they had to use an Essbase connection where all users had access to the Essbase application and database. The new ValidateCredentialsOnly property on both of the built-in Essbase Authentication services now flags the service to check login credentials at the server-level only, eliminating the need for users to have access to a specific Essbase database.New Template Designer ToolsPrior to Dodeca 6.x, all template editing was performed directly in Excel. Since that time, however, most template design functionality has been replicated in the Dodeca Template Designer, and we think it is preferable due to the speed and ease of use with which users can update templates stored in the Dodeca repository. We have added a couple of new features to the Template Designer in this version. The first tool is the Group/Ungroup tool that allows designers to easily apply Excel grouping to rows and/or columns within the template. The second new tool is the Freeze/Unfreeze tool that is used to freeze rows and/or columns in place for scrolling.Parameterized SQL Select StatementsSince we introduced the SQLPassthroughDataSet object in the Dodeca 5.x series, we have always supported the idea of tokenized select statements. In other words, the SQL could be written so that point-of-view selections made by users could be used directly in the select statement. In a related fashion, we introduced the concept of parameterized insert, update, and delete statements in the same version. While parameterized statements are similar in concept to tokenized statements, there is one important distinction under the covers. In Dodeca, parameterized statements are parsed and converted into prepared statements that can be used multiple times and results in more efficient use of server resources. The parameterized select statement was introduced in this version of Dodeca in order for customers using certain databases that cache the prepared statement to realize improved server efficiency on their select statements.Workbook Script Formula Editor ImprovementsWe have also been working hard to improve extensibility for developers using Workbook Scripts within Dodeca. In this release, our work focused on the Workbook Script Formula Editor. The first thing we added here is color coding that automatically detects and distinguishes Excel functions, Workbook Script functions, and Dodeca tokens. In the new version, Excel functions are displayed in green, Dodeca functions and parentheses are displayed in blue, and tokens are displayed in ochre. Here is an example.
In addition, we have implemented auto-complete for both Excel and Dodeca functions.
New SQLException EventVersion 6.6 of Dodeca introduces a new SQLException event that provides the ability for application developers to customize the behavior when a SQL Exception is encountered.XCopy Release DirectoryBeginning in version 6.6, the Dodeca Framework installation includes a pre-configured directory intended for customers who prefer to distribute their client via XCopy deployment instead using Microsoft ClickOnce distribution. The XCopy deployment directory is also for use by those customers who use Citrix for deployment.Mac OS X Release DirectoryThe Dodeca Framework installation now includes a pre-compiled Dodeca.app deployment for customers who wish to run the Dodeca Smart Client on Mac OS X operating systems. What that means is that Dodeca now runs on a Mac without the need for any special Windows emulators. Dodeca does not require Excel to run on the Mac (nor does it require Excel to run on Windows for that matter), so you can certainly save your company significant licensing fees by choosing Dodeca for your solution.
In short, you can see we continue to work hard to deliver functionality for Dodeca customers. As always, the Dodeca Release Notes provide detailed explanations of all new and updated Dodeca features. As of today, we have decided to make the Release Notes and other technical documents available for download to non-Dodeca customers. If you are curious about all of the things Dodeca can do, and if you aren't afraid to dig into the details, you can now download our 389 page cumulative Release Notes document from the Dodeca Technical Documents section of our website.
How can you conditionally turn cells borders on and off in Publishers RTF/XSLFO templates? With a little digging you'll find what appears to be the appropriate attributes to update in your template. You would logically come up with using the various border styling options:
border-top|bottom|left|right-width border-top|bottom|left|right-style border-top|bottom|left|right-color
Buuuut, that doesnt work. Updating them individually does not make a difference to the output. Not sure why and I will ask but for now here's the solution. Use the compound border formatter border-top|bottom|left|right. This takes the form ' border-bottom="0.5pt solid #000000". You set all three options at once rather than individually. In a BIP template you use:
<?if:DEPT='Accounting'?> <?attribute@incontext:border-bottom;'3.0pt solid #000000'?> <?attribute@incontext:border-top;'3.0pt solid #000000'?> <?attribute@incontext:border-left;'3.0pt solid #000000'?> <?attribute@incontext:border-right;'3.0pt solid #000000'?> <?end if?>
3pt borders is a little excessive but you get the idea. This approach can be used with the if@row option too to get the complete row borders to update. If your template will need to be run in left to right languages e.g. Arabic or Hebrew, then you will need to use start and end in place of left and right.
For the inquisitive reader, you're maybe wondering how, did this guy know that? And why the heck is this not in the user docs?
Other than my all knowing BIP guru status ;0) I hit the web for info on XSLFO cell border attributes and then the Template Builder for Word. Particularly the export option; I generated the XSLFO output from a test RTF template and took a look at the attributes. Then I started trying stuff out, Im a hacker and proud me! For the users doc updates, I'll log a request for an update.
If you’re a user of Oracle’s data integration products, you’re probably aware that ODI12c came out last month, with some of the new features covered in posts on the blog here, here and here. Rittman Mead were actually on the beta program for 12c, with several of our team attending preview events in the UK and USA at the start of the year, and then running preview releases back in our development labs as the product neared GA.
As CTO I sponged our involvement in the ODI12c beta program, and was invited down to Oracle’s offices in Reading to take part in an interview with Kulvinder Hari, Director Product Management – Fusion Middleware, to talk about our involvement in the program, and what we saw as the most interesting new features in this latest release. You can access the full video here, and I also talk about the wider data integration market, products such as GoldenGate and EDQ, and the benefits OWB customers will get when migrating to, and interoperating with, Oracle Data Integrator.
The interview was actually part of a wider set of activities around the ODI12c launch, and you read a recap of the Oracle GoldenGate 12c and Oracle Data Integration 12c Webcast on the Oracle website, as well as download resources on ODI12c from an oracle.com microsite. Keep an eye on the blog as well over the next few months, as we post more content on ODI12c along with the other new Oracle Data Integration 12c releases.odi
In yesterday’s post on running OBIEE in the cloud, I looked at a number of options for hosting the actual OBIEE element; hosting it in a public cloud service such as Amazon EC2, using Oracle’s upcoming BI-as-a-Service offering, or partner offerings such as our upcoming ExtremeBI in the Cloud service. But the more you think about this sort of thing, the more you realise that the OBIEE element is actually the easy part – it’s what you do about data storage, security, LDAP directories and ETL that makes things more complicated.
Take the example I gave yesterday where OBIEE was run in the cloud, with the multi-tenancy option enabled, the main data warehouse in the cloud, and data sourced from cloud and on-premise sources.
In this type of setup, there’s a number of things you need to consider beyond how OBIEE is hosted. For example:
- If your corporate LDAP directory is on-premise, how do we link OBIEE to it? Or does the LDAP server also need to be in the cloud?
- What sort of database do we use if we’re hosting it in the cloud. Oracle? If so, self-hosted in a public cloud, or through one of the Oracle DB-in-the-cloud offerings?
- If not Oracle database, what other options are available?
- And how do we ETL data into the cloud-based data warehouse? Do we continue to use a tool like ODI, or use a cloud-based option – or even a service such as Amazon’s AWS Data Pipeline?
What complicates things at this stage in the development of “cloud”, is that most companies won’t move 100% to cloud in one go; more likely, individual application and systems might migrate to the cloud, but for a long time we’ll be left with a “hybrid” architecture where some infrastructure stays on premise, some might sit in a public cloud, others might be hosted on third-party private clouds. So again, what are the options?
Well Oracle’s upcoming BI-as-a-service offering works at one extreme end-of-the-spectrum; the only data source it’ll initially work with is Oracle’s own database-as-a-service, which in its initial incarnation provides a single schema, with no SQL*Net access and with data instead uploaded via a web interface (this may well change when Oracle launch their database instance-as-a-service later in 2014). No SQL*Net access means no ETL tool access though, in practice, as they all use SQL*Net or ODBC to connect to the database, so this offer to my mind is aimed at either (a) small BI applications where it’s practical to upload the data via Excel files etc, or (b) wider Oracle Cloud-based systems that might use database-as-a-service to hold their data, Java-as-a-service for the application and so forth. What this service does promise though is new capabilities within OBIEE where users can upload their own data, again via spreadsheets, to the cloud OBIEE system, and have that “mashed-up” with the existing corporate data – the aim being to avoid data being downloaded into Excel to do this type of work, and with user metrics clearly marked in the catalog so they’re distinct from the corporate ones.
But assuming you’re not going for the Oracle cloud offer, what are the other options over data? Well hosting OBIEE in the cloud is conceptually no different from hosting anywhere else, in that it can connect to various data sources via the various connection methods, so in-principle you’ve got just the same options open to you if running on premise. But the driver for moving OBIEE into the cloud might be that your applications, data etc are already in the cloud, and you might also be looking to take advantage of cloud features in your database such as dynamic provisioning and scaling, or indeed use one of the new cloud-native databases such as Amazon Redshift.
I covered alternative databases for use with OBIEE a few months ago in a blog post, and Amazon Redshift at the time looked like an interesting option; based on ParAccel, a mature analytic database offering, column-store and tightly integrated in with Amazon’s other offerings, a few customers have asked us about this as an option. And they’re certainly interesting – in practice, not all that different in pricing to Oracle database as a source but with some interesting analytic features – but they all suffer from the common issue that they’re not officially supported as data sources. Amazon Redshift, for example, uses Postgres-derviced ODBC drivers to connect to it, but Postgres itself isn’t officially supported as a source, which means you could well get sub-optimal queries and you certainly won’t get specific support from Oracle for that source. But if it works for you – then this could be an option, along with more left-field data source such as Hadoop.
But to my mind, it’s the ETL element that’s the most interesting, and most challenging, part of the equation. Going back to Openworld, Oracle made a few mentions of ETL in their general Cloud Analytics talks, including talk about an upcoming data source adapter for the BI Apps that’ll enable loading from Fusion Apps in the cloud, like this:
There were also a number of other deployment options discussed, including hybrid architectures where some sources were in the cloud, some on-premise, but all involved running the ETL elements of the BI Apps – ODI or Informatica – on-premise, the same way as installs today. And to my mind, this is where the Oracle cloud offering is the weakest, around cloud-based and cloud-native ETL and data integration – the only real option at the moment is to run ODI agents in the cloud and connect them back to an on-premise ODI install, or move the whole thing into the cloud in what could be quiet a heavyweight data integration architecture.
Other vendors are, in my opinion, quite a way further forward with their cloud data integration tools strategy than Oracle, who instead seem to be still focused on on-premise (mainly), database-to-database (mostly) ETL. To take two examples; Informatica have an Informatica Cloud service which appears to be a platform-as-a-service play, with customers presumably signing-up for the service, designing their ETL flows and then paying for what they use, with a focus on cloud APIs as well as database connectivity, and full services around data quality, MDM and so forth.
Another vendor in this space is SnapLogic, a pure-play cloud ETL vendor selling a component-based product with a big focus on cloud, application and big data sources. What’s interesting about this and other similar vendor’s approaches though are they they appear to be “cloud-first” – written for the cloud, sold as a service, as much focused on APIs as database connectivity – a contrast to Oracle’s current data integration tools strategy which to my mind still assumes an on-premise architecture. What’s more concerning is the lack of any announcement around ETL-in-the-cloud at the last Openworld – if you look at the platform-as-a-service products announced at the event, whilst database, BI, documents, BPM and so forth-as-a-service were announced, there was no mention of data integration:
What I’d like to see added to this platform in terms of data integration would be something like:
- On-demand data integration, sold as a service, available as a package along with database, Java and BI
- Support for Oracle and non-Oracle application APIs – for example Salesforce.com, Workday and SAP – see for example what SnapLogic support in this area.
- No need for an install – it’s already installed and it’s a shared platform, as they’re doing with OBIEE
- Good support for big data, unstructured and social data sources
I think it’s pretty likely this will happen – whilst products such as the BI Apps can have their ETL in the cloud, via ODI in the BI Apps 11g version for example, these are inherently single-tenant, and I’d fully expect Oracle plan at some time to offer BI Apps-as-a-service, with a corresponding data integration element designed from the ground-up to work cloud-native and integrate with the rest of Oracle’s platform-as-a-service offering.
So there we have it – some thoughts on the database and ETL elements in an OBIEE-in-the-cloud offering. Keep an eye on the blog over the next few months as I built-out a few examples, and I’ll be presenting on the topic at the upcoming BIWA 2014 event in San Francisco in January – watch this space as they say.
One of the major announcements at this year’s Oracle Openworld in San Francisco was around “OBIEE in the cloud”. Part of a wide-ranging set of announcements around services and infrastructure in the cloud from Oracle, the idea with this service is that you’ll be able to access an Oracle-hosted future version of OBIEE running in Oracle’s public cloud service, so that you can create dashboards and reports without having to perpetually-license lots of software, or stand-up lots of infrastructure in your data centre. But of course “cloud” is a hot topic at the moment, and lots of our customers are asking us about options to run OBIEE in the cloud, what’s involved in it, and how we deal with the stuff that surrounds an OBIEE installation – the data, the ETL, the security, and so forth. So I thought I’d put a couple of blog posts together to go through what options are available to OBIEE customers looking to deploy in the cloud, and see what the state of the industry is around this requirement.
I’ll start-off then by looking at the most obvious element – the BI platform itself. If you want to run OBIEE “in the cloud”, what are your basic options?
Probably conceptually the simplest, is just to run the OBIEE server applications in a cloud-based, hosted environment, for example Amazon EC2. In this instance, OBIEE runs as it would do normally, but instead of the host server being in your datacenter, it’s in a public cloud. This type of setup has been available for some time in the form of services such as Oracle On-Demand, but the difference here is that you’re using a public cloud service, there’s no service contracts to set up, usually no minimum contract size, and everything to do with security, failover, maintenance and so on is down to you. We’ve been doing this sort of thing for a long-time, typically for our cloud-based training environments, or for ad-hoc development work where it makes more sense for a server to be centrally available on the internet, rather than sitting on someone’s laptop or on an internal server. Once potential complication here is licensing – for products such as the Oracle Database, there are formal schemes such as Amazon RDS where either the license is included, or there are set metrics or arrangements to either bring your own license, or buy them based on virtual CPUs. If you’re prepared to take care of the licensing, and all of the security and maintenance aspects, this is an interesting approach.
In practice though, any setup like this is going to be what’s termed a “hybrid” cloud deployment; at least the administration side of OBIEE (the BI Administration Tool) is still going to be “on-premise”, as there’s no way you can deploy that “thin-client”, unless you create a Windows VM in the cloud too, and run the BI Administration tool from there. Moreover, your data sources are more than likely to still be on-premise, with just a few at this point hosted in the cloud, so most likely your OBIEE-in-the-cloud architecture will look like this:
There’s other things to think about too; how do you do your ETL when your target data warehouse might be in the cloud, or your data sources might be cloud based; how do you connect to your LDAP server, and so forth – I’ll cover these in the next postings in the series. But for now, this is conceptually simple, and its main benefit is avoiding the need to host your own servers, buy your own hardware, and so forth.
So what if you want to use OBIEE, but you don’t want the hassle even of setting up your own servers in a public cloud? This is where Oracle’s new “BI in the Cloud” service comes in – the idea here is that you swipe your credit card, fill in your details, self-provision your system, and then start loading data into it.
Whilst the underlying technology is core OBIEE, it’s designed for departmental, power user-type scenarios where the scope of data is limited, and the focus really is on ease-of-use, self-service and “software-as-a-service”. Looking at Oracle’s slides from Openworld, you can see the focus is quite different to on-premise OBIEE – the competition here for Oracle is the various internet startups, and products like Tableau, that make it easy to get started, provide consumer-class user interfaces, and focus on the single user rather than enterprise-type requirements.
But this focus on simplicity means a limitation in functionality as well. At the start, at least, Oracle’s BI in the Cloud will only offer Answers and Dashboards; no KPIs, BI Publisher, Maps or Delivers. Moreover, at least in its first iteration, it’ll only support Oracle’s own database schema-as-a-service as the single, sole datasource, so no Hybrid cloud/on-premise federated queries, no Essbase, and no ETL tools – hence the focus on single-user, departmental use-cases.
What you do get with this setup though is a cloud-native, fully-managed service where patching is taken care of for you, all the new feature appear first, and the administration element has been re-thought to be more appropriate for deployment to the cloud – in particular, a new version of the BI Administration tool that’s cloud-native, focused on simple use-cases, and doesn’t require desktop client installs or VPN connections through to the OBIEE server. The diagram bellow shows the architecture for this option, and you can see it’s all pretty self-contained; great for simplicity, but maybe a bit limiting at least in the initial iteration.
The third major variant that I can see around OBIEE in the cloud, is where partners (such as Rittman Mead) offer shared access to their cloud OBIEE installations, either as part of a development package or through some sort of multi-tenant reporting product. For example, we’re about to launch a service we’re calling “ExtremeBI in the Cloud”, where we combine our “ExtremeBI” agile development method with cloud-based development server hosting, with the cloud element there to make it easier for clients to start on a project *today*. Endeca Information Discovery is another product that could benefit from this approach, as the infrastructure behind an Endeca install can be fairly substantial, but users are typically more-focused on the data modelling and report-building element. In our case, the back-end cloud hosting will typically be done on a service like Amazon EC2 or VMWare’s Hybrid Cloud product, but the focus is more on the development piece – cloud is there to provide business agility.
The other third-party approach to this centres around multi-tenancy; typically, a partner or company will provide a reporting or analytics element to a wider product set, and use OBIEE as its embedded dashboarding tool as part of this. In the past, setting up multi-tenancy for OBIEE was fairly difficult, to the point where most companies set up separate installs for each end-customer, but 126.96.36.199 brought a number of multi-tenant features that presumably were put there to support Oracle’s own cloud product. The product docs describe the new multi-tenant features well, but the basic premise is that a single RPD and catalog are still set up, but OBIEE can then run in a “multi-tenant” mode where GUIDs demarcate each tenant, a new BIGlobalAdministrator role is created that assumes what were the old BIAdministrator role’s privileges, with other administrator roles then set up with limited privileges and no access to Fusion Middleware Control, for example.
What this does give you though is the framework to set up tenant groupings of users, separate areas of the catalog for each tenant, additional identity store attributes to hold tenant details, and a security framework that enables more limited forms of administrator account more suited to tenant-type situations. Again, not all OBIEE features are available when running in multi-tenant mode, and presumably we’ll see this feature evolve over time as more requirements come in from Oracle’s own cloud BI product, but it’s a feature you can use now if you’re looking to set up a similar type of environment.
So that’s the three basic options if you want to run OBIEE in the cloud; host it in a public cloud service like Amazon EC2 but then just run it like regular OBIEE; go for Oracle’s upcoming BI-in-the-cloud service, if the use-case suits you and you can live with the limitations, or consider one of the services from partners such as ourselves where we bundle cloud hosting of OBIEE up with a larger service offering like our “ExtremeBI in the Cloud” offer. But of course OBIEE is only one part of the overall platform – where do you store the data, and how do you get the data from wherever it normally is up into the cloud – in other words, how do we do ETL in the cloud? Check back tomorrow for the follow-up post to this one: Part 2 – Data Warehouse and ETL.
Bit of a corner case this week but I wanted to park this as much for my reference as yours. Need to be able to test a pure XSL template against some sample data? Thats an XSL template that is going to generate HTML, Text or HTML. The Template Viewer app in the BI Publisher Desktop group does not offer that as an option. It does offer XSL-FO proccesing thou.
A few minutes digging around in the java libraries and I came up with a command line solution that is easy to set up and use.
1. Place your sample XML data and the XSL template in a directory
2. Open the lib directory where the TemplateViewer is installed. On my machine that is d:\Oracle\BIPDesktop\TemplateViewer\lib
3. Copy the xmlparserv2.jar file into the directory created in step 1.
4. Use the following command in a DOS/Shell window to process the XSL template against the XML data.
java -cp ./xmlparserv2.jar oracle.xml.parser.v2.oraxsl fileX.xml fileY.xsl > fileX.xls
The file generated will depend on your XSL. For an Excel output, you would instruct the process to generate fileX.xls in the same folder. You can then test the file with Excel, a browser or a text editor. Now you can test on the desktop until you get it right without the overhead of having to load it to the server each time.
To be completely clear, this approach is for pure XSL templates that are designed to generate text, html or xml. Its not for the XSLFO templates that might be used at runtime to generate PDF, PPT, etc. For those you should use the Template Viewer application, it supports the XSLFO templates but not the pure XSL templates.
If your template still falls into the pure XSL template category. This will be down to you using some BIP functionality in the templates. To get it to work you'll need to add in the Publisher libraries that contain the function e.g. xdo-core.jar, i18nAPI_v3.jar, etc to the classpath argument (-cp.)
So a new command including the required libraries might look like:
java -cp ./xmlparserv2.jar;./xdo-core.jar;./i18nAPI_v3.jar oracle.xml.parser.v2.oraxsl fileX.xml fileY.xsl > fileX.xls
You will need to either move the libraries to the local directory, my assumption above or include the full path to them. More info here on setting the -cp attribute.
There are two steps you have to take to "save big". First, become a full member of ODTUG for $99 and enjoy all of the benefits, including access to a members-only presentations library, throughout the year. Next, register for Kscope14 and you are eligible for the members-only price of $1500 for a savings of $150. While you are registering, simply use the code AOLAP to get an additional $100 discount!
My company, Applied OLAP, is one of top-tier Platinum Sponsors of Kscope14 and I will be there. I hope to see you at the conference and, if you were able to save some money by using our exclusive AOLAP code, be sure to stop our booth, say hello, and learn how the Dodeca Spreadsheet Management System can help your company reduce spreadsheet risk, increase spreadsheet accuracy, and reduce costs.
We all love a good commandline utility. It gives us that warm feeling of control and puts hairs on our chests. Either that, or it means we can script the heck out of a system, automate many processes, and concentrate on more important matters.
However, some of the OBIEE commandline utilities can’t be used in Production environments at many sites because they need the credentials for OBIEE stored in a plain-text file. Passwords in plain-text are bad, mmmm’kay?
Two of the utilities in particular that it is a shame that can’t be scripted up and deployed in Production because of this limitation are the Presentation Services Catalog Manager, and the Presentation Services Replication Agent. Both these perform very useful purposes, and what I want to share here is a way of invoking them more securely.Caveat
IANAC : I Am Not A Cryptographer! Nor am I a trained security professional. Always consult a security expert for the final word on security matters.
The rationale behind developing the method described below is that some some sites will have a “No Plaintext Passwords” policy which flatout prevents the use of these OBIEE utilities. However, at the same sites the use of SSH keys to enable one server to connect to another automatically is permitted. On that basis, the key-based encryption for the OBIEE credentials may therefore be considered an acceptable risk. As per Culp’s 9th law of security administration, it’s all about striking the balance between enabling functionality and mitigating risk.
The method described below I believe is a bit more secure that plaintext credentials, but it is not totally secure. It uses key based encryption to secure the previously-plaintext credentials that the OBI utility requires. This is one step better than plaintext alone, but is still not perfect. If an attacker gained access to the machine they could still decrypt the file, because the key is held on the machine without a passphrase to protect it. The risk here is that we are using security by obscurity (because the OBIEE credentials are in an encrypted file it appears secure, even though the key is held locally), and like the emperor’s new clothes, if someone takes the time to look closely enough there is still a security vulnerability.
My final point on this caveat is that you should always bear in mind that if an attacker gains access to your OBIEE machine then they will almost certainly be able to do whatever they want regardless, including decrypting the weblogic superuser credentials or reseting it to a password of their own choosing.Overview
Two new shiny tools I’ve acquired recently and am going to put to use here are GnuPG (
mkfifo. GPG provides key-based encryption and decryption and is available by default on common Linux distributions including Oracle Linux.
mkfifo is also commonly available and is a utility that creates named pipes, enabling two unreleated processes to communicate. For a detailed description and advanced usage of named pipes, see here
This is a one-time set up activity. We create a key in
gpg, and then encrypt the plain text credentials file using it.
The first step is to create a gpg key, using
gpg --gen-key. You need to specify a “Real name” to associate with the key, I just used “obiee”. Make sure you don’t specify a passphrase (otherwise you’ll be back in the position of passing plain text credentials around when you use this script).
$ gpg --gen-key gpg (GnuPG) 1.4.5; Copyright (C) 2006 Free Software Foundation, Inc. This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the file COPYING for details. Please select what kind of key you want: (1) DSA and Elgamal (default) (2) DSA (sign only) (5) RSA (sign only) Your selection? DSA keypair will have 1024 bits. ELG-E keys may be between 1024 and 4096 bits long. What keysize do you want? (2048) 2048 [...] Real name: obiee Email address: Comment: [...] You don't want a passphrase - this is probably a *bad* idea! I will do it anyway. You can change your passphrase at any time, using this program with the option "--edit-key". [...] gpg: key 94DF4ABA marked as ultimately trusted public and secret key created and signed.
Once this is done, you can encrypt the credentials file you need for the utility. For example, the Catalog Manager credentials file has the format:
To encrypt it use
gpg --recipient obiee --output saw_creds.gpg --encrypt saw_creds.txt
Now remove the plaintext password file
Using the secure credentials file
Once we have our encrypted credentials file we need a way of using it with the utility it is intended for. The main thing we’re doing is making sure we don’t expose the plaintext contents. We do this using the named pipes method:
In this example I am going to show how to use the secure credentials file with
runcat.sh, the Catalog Manager utility, to purge the Presentation Services cache. However it should work absolutely fine with any utility that expects credentials passed to it in a file (or stdin).
There is a three step process:
Create a named pipe with
mkfifo. This appears on a disk listing with the
pbit to indicate that it is a pipe. Access to it can be controlled by the same
chmodprocess as a regular file. With a pipe, a process can request to consume from it, and anything that is passed to it by another process will go straight to the consuming process, in a FIFO fashion. What we’re doing through the use of a named pipe is ensuring that the plain text credentials are not visible in a plain text file on the disk.
Invoke the OBIEE utility that we want to run. Where it expects the plaintext credentials file, we pass it the named pipe. The important bit here is that the utility will wait until it receives the input from the named pipe – so we call the utility with an ampersand so that it returns control whilst still running in the background
gpgto decrypt the credentials file, and pass the decrypted contents to the named pipe. The OBIEE utility is already running and listening on the named pipe, so will receive (and remove from the pipe) the credentials as soon as they are passed from
The script that will do this is as follows:
# Change folder to where we're invoking the utility from cd $FMW_HOME/instances/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/catalogmanager # Create a named pipe mkfifo cred_pipe # Let's make sure only we can access it chmod 600 cred_pipe # Invoke Catalog Manager. Because we're using a named pipe, it's actually going to sit and wait until it gets input on the pipe, so we need to put the ampersand in there so that it returns control to the script ./runcat.sh -cmd clearQueryCache -online http://localhost:9704/analytics/saw.dll -credentials cred_pipe & # Decrypt the credentials and send them to the named pipe gpg --quiet --recipient obiee --decrypt saw_creds.gpg > cred_pipe # Remove the named pipe rm cred_pipe
Depending on the utility that you are invoking, you may need to customise this script. For example, if the utility reads the credentials file multiple times then using the named pipes method it will fail after the first read. Your option would be to read the credentials into the pipe multiple times (possibly a bit hacky), or land the plaintext credentials to disk and delete them after the utility complete (could be less secure if the delete doesn’t get invoked)Using a secure credentials file for command line arguments
Whilst the sticking point that triggered this article was around utilities requiring whole files with credentials in, it is also common to see command line utilities that want a password passed as an argument to them. For example, nqcmd :
nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s myscript.lsql
Let’s assume we’ve created an encrypted file containing “Password01” (using the
gpg --encrypt method shown above) and saved it as password.gpg.
To invoke the utility and pass across the decrypted password, there’s no need for named pipes. Instead we can just use a normal (“unnamed”) pipe to send the output straight from gpg to the target utility (nqcmd in this example), via xargs:
gpg --batch --quiet --recipient obiee --decrypt ~/password.gpg | xargs -I GPGOUT nqcmd -d AnalyticsWeb -u weblogic -p GPGOUT -s input.lsql
xargs has a
--interactive option that makes it a lot easier when developing piped commands such as the above
Because there is no passphrase on the gpg key, a user who obtained access to the server would still be able to decrypt the credentials file. In many ways this is the same situation that would arise if a server was configured to use ssh-key authentication to carry out tasks or transfer files on another server.Uses
Here are some of the utilities that the above now enables us to run more securely:
- nqcmd is a mainstay of my OBIEE toolkit, being useful for performance testing, regression testing, aggregate building, and more. Using the method above, it’s now easy to avoid storing a plaintext password in a script that calls it.
- Keeping the Presentation Catalog in sync on an OBIEE warm standby server, using Presentation Services Replication
- Purging the Presentation Services Cache from the command line (with Catalog Manager, per the above example)
- SampleApp comes with four excellent utilities that Oracle have provided, however all but one by default requires plaintext credentials. If you’ve not looked at the utilities closely yet, you should! You can see them in action in SampleApp itself, or get an idea of what they do looking at the SampleApp User Guide pages 14–17 or watching the YouTube video.
Over the past few months I’ve been posting a number of articles about Hadoop, and how you can connect to it from ODI and OBIEE. From an ODI perspective, I covered Hadoop as one of a number of new data sources ODI11g could connect to, then looked at how it leveraged Hive to issue SQL-like data extraction commands to Hadoop, and how it used Oracle Hadoop connector tools to transfer Hadoop data into the Oracle Database, and directly work with data in HDFS files. For OBIEE, I went through the background to Hadoop, Hive and the other “big data” technologies, stepped through a typical Hive query session, then showed how OBIEE 188.8.131.52 could connect to Hadoop through its newly-added Hive adaptor, then finally built a proof-of-concept OBIEE connection through to Cloudera Impala, then extended that to a multi-node Hadoop cluster.
But why all this interest in Hadoop – what’s it really got to do with OBIEE and ODI, and why should you as developers be interested in what’s probably yet another niche BI/DW datasource? Well in my opinion, Hadoop is the classic disruptive technology – cheap, and starting-off with far less functionality than regular, relational databases – but it’s improving fast, and as BI&DW developers it offers the potential of both massive benefits – significantly lower TCO for basic DW work, and support for lots of modern, internet-scale use-cases – and threats – in that if we don’t understand it and see how it can benefit our customers and end-users, we risk being left-behind as technology moves on.
To my mind, there are two main ways in which Hadoop, Hive, HDFS and the other big-data ecosystem technologies are used, in the BI/DW context:
1. Standalone, with their own query tools, database tools, query languages and so forth – your typical “data scientist” use case, originating from customers such as Facebook, LinkedIn etc. In this context, there’s typically no Oracle footprint, users are pretty self-sufficient, any output we see is in the form of “insights”, marketing campaigns etc.
2. Alongside more mainstream, for example Oracle, technologies. In this instance, Hadoop, Hive, HDFS, NoSQL etc are used as complementary, and supporting, technologies to enhance existing Oracle-based data warehouses, capture processes, BI systems. In some cases, Hadoop-type technologies can replace more traditional relational ones, but mostly they’re used to make BI&DW systems more scaleable, cheaper to run, able to work with a wider range of data sources and so forth. This is the context in which Hadoop can be relevant to more traditional Oracle BI, ETL and DW developers.
To understand how this happened, let’s go through a bit of a history lesson. Five years ago or so, your typical DW+BI architecture looked like this:
The data warehouse was typically made-up of three layers – staging, foundation/ODS and performance/dimensional, with data stored in relational databases with some use made of OLAP servers, or some of the newer in-memory databases like Qlikview. But over the intervening years, the scale and types of data sources have increased, with customers now looking to store data from unstructured and semi-structured sources in their data warehouse, take in feeds from social media and other “streaming” sources, and access data in cloud systems typically via APIs, rather than traditional ETL loading. So now we end up with a data warehouse architecture that looks like this:
But this poses challenges for us. From an ETL perspective, how do we access these non-traditional sources, and once we’ve accessed them – how do we efficiently process them? The scale and “velocity” of some of these sources can be challenging for traditional ETL processes that expect to log every transformation in a database with transactional integrity and multi-version concurrency control, whilst in some cases it doesn’t make sense to try and impose a formal data structure on incoming data as you’re capturing it, instead giving it the structure when we finally need it, or when we choose to access it in a query.
And then came “Hadoop”, and its platform and tool ecosystem. At its core, Hadoop is a framework for processing, in a massively-parallel and low-cost fashion, large amounts of data using simple transformation building blocks – filtering (mapping) and aggregating (reducing). Hadoop and MapReduce came out of the US West Coast Internet scene as a way of processing web and behavioural data in the same massively-distributed way that companies provided web search and other web 2.0 activities, and a core part of it was that it was (a) open-source, like Linux and (b) cheap, both in being open-source but also because it was designed from the outset to run on low-cost, commodity hardware that’s expected to fail. Pretty much the opposite of Oracle’s business model, but also obviously very attractive to anyone looking to lower the TCO of their data warehouse system.
So as I said – the Hadoop pioneers went-out and built their systems without much reference to vendors such as Oracle, IBM, Microsoft and the like, and being blunt, they won’t have much time for traditional Oracle BI&DW developers like ourselves. But those customers who are largely invested in Oracle technology, but see advantages in deploying Hadoop and big data technologies to make their systems more flexible, scaleable and cheaper to run – that’s where ODI and OBIEE’s connectivity to these technologies becomes interesting.
To take the example of customers who are looking to deploy Hadoop technologies to enhance their Oracle data warehouse – a typical architecture going down this route would look like this:
In this example, we’re using HDFS – Hadoop Distributed File System – as a pre-staging area for the data warehouse, storing incoming files cheaply, and with build-in fault tolerance, to the point where storage is so cheap that you might as well keep stuff you’re not interested in now, but you think might be interesting in the future. Using Oracle Direct Connector for HDFS, you can set up Oracle Database external tables that map onto HDFS just like any other file system, so you can extract from and otherwise work with these files without worrying about writing MapReduce jobs; ODI, through Oracle Data Integration Adaptor for Hadoop, you can connect ODI to these table sources as well, and work with them just like any other topology source, as I show in the slide below from my upcoming UKOUG Tech’13 session on ODI, OBIEE and Hadoop that’s running in a couple of week’s time in Manchester:
As well as storing data, you can also do simple filtering and transformation on that data, using the Hadoop framework. Most upfront data processing you do as part of an ETL process involves filtering out data you’re not interested in, joining data sets, grouping and aggregating data, and other large-scale data transformation tasks, before you then load it into the foundation/ODS layer and do more complex work. And this simple filtering and transformation is what Hadoop does best, on cheap hardware or even in the cloud – and if your customer is already invested in ODI and runs the rest of their ETL process using it, its relatively simple to add Hadoop capabilities to it, using ODI to orchestrate the data processing steps but using Hadoop to do the heavy lifting, as my slide below shows:
Now some customers, and of course Hadoop vendors, say that in reality you don’t even need the Oracle database if you’re going to build a data warehouse, or more realistically a data mart. Now that’s a bigger question and probably one that depends on the particular customer and circumstances, but a typical architecture that takes this approach might look like this:
In this case, ODI again has capabilities to transform data entirely within Hadoop – with ODI acting as the ETL framework and co-ordinator, but Hadoop doing the heavy-lifting – and there’s always the ability to get the data of Hadoop and into a main Oracle data warehouse, if the Hadoop system is more of a data mart or deparment-specific analysis. But whichever way – in most cases the customer is going tho want to continue to use their existing BI tool, particularly if their BI strategy involves bringing together data from lots of different systems, as you can do with OBIEE’s federated query capability – giving you an overall architecture that looks like this:
So it’s this context that makes OBIEE’s connectivity to Hadoop so important; I’m not saying that someone creating a Hadoop system from scratch is going to go out and buy OBIEE as their query tool – more typically, they’ll use other open-source tools or create models in tools like R; or they might go out and buy a lightweight data visualisation tool like Tableau and use that to connect solely to their Hadoop source. But the customers we work with have typically got much wider requirements for BI, have a need for an enterprise metadata model, recognise the value of data and report governance, and (at least at present) access most of their data from traditional relational and OLAP sources. But they will still be interested in accessing data from Hadoop sources, and OBIEE’s new capability to connect to this type of data, together with closer integration with Endeca and its unstructured and semi-structured sources, addresses this need.
So there you have it – that’s why I think OBIEE and ODI’s ability to connect to Hadoop is a big deal, and it’s why I think developers using those tools should be interested in how it works, and should try and set up their own Hadoop systems and see how it all works. As I said, I’ll be covering this topic in some detail at the UKOUG Tech’13 Conference in Birmingham in a couple of weeks time, so if you’re there on the Sunday come along and I’ll try and explain how I think it all fits together.
The other day I posted an article on the blog about connecting OBIEE 184.108.40.206 to Cloudera Impala, a new “in-memory” SQL engine for Hadoop that’s much faster than Hive for interactive queries. In this example, I connected OBIEE 220.127.116.11 to the Cloudera Quickstart CDH4 VM, which comes with all the Hadoop and Cloudera tools pre-installed and configured, making it easy to get going with the Hadoop platform.
Whilst the example worked though, I couldn’t help thinking that using Impala against a single node Hadoop install isn’t really how it’d be used in real-life; in reality, if you used OBIEE in this way, you’re much more likely to be connecting to a full Hadoop cluster, with multiple server nodes handling the incoming queries and potentially gigabytes, terabytes or petabytes of data being processed. So it it possible to set up a Hadoop cluster that gets a bit nearer to this multi-node architecture, so we can practice connecting to a cluster and not a single server, and we can see Hadoop process our queries across all of the nodes – as we’d see in real life, given that this low-cost MPP processing is the key benefit of Hadoop as a whole?
Hadoop, as you’re probably aware, was designed from the ground-up to run across multiple nodes, with those nodes typically either being small, low-cost servers, or in many cases servers running in the “cloud”. As such, you’re as likely to see Hadoop running on a cluster of Amazon EC2 server as running on physical servers in a datacenter, and in most cases the underlying OS running on those servers is Linux – most usually, Ubuntu 64-bit. So if we want to set up our own Hadoop cluster, there’s a few options open to us:
1. Get hold of a bunch of physical servers (maybe, old PCs or blade servers), install Linux and Hadoop on them, and then do the configuration and setup manually.
2. Buy a preconfigured solution – Oracle’s Big Data Appliance, for example, which has all the software pre-installed along with connectivity to ODI, Oracle Database etc
3. Spin-up a bunch of servers in the Cloud, or
4. Spin-up a bunch of Linux VMs, for example using VirtualBox or VMWare Fusion/Workstation
In the past I’ve done the Hadoop setup myself, manually, using the Hadoop distribution files available on the Hadoop website, but more recently vendors such as Hortonworks, MapR and Cloudera have put together their own Hadoop + added value tools distributions, and it’s Cloudera I’ve been looking at in most detail recently (if only because there’s a lot of ex-Oracle people there who I know, and it’s the bundled Hadoop distribution that comes with Oracle Big Data Appliance). What’s particularly good about Cloudera’s Hadoop offering is their “Cloudera Manager” utility – of which there’s a free version – and which simplifies the whole process of setting up a cluster by automating most of the process.
What’s also particularly interesting about Cloudera and Cloudera Manager, is that there are a number of solution available out there that automate the process of spinning-up clusters. One of them, described in this blog post on Cloudera’s website, involves using a built-in feature in Cloudera Manager to automatically create, provision and configure X number of Amazon EC2 virtual servers, with this servers then able to handle your queries in parallel and you just paying by the hour for the compute resource you need. So let’s give it a try.
I won’t go into a tutorial or explanation into Amazon Web Services and their EC2 (“Elastic Compute Cloud”) service here, suffice to say that you can create on-demand cloud-based VMs, paying by the hour and with pricing based on the size of instance, amount of memory needed, OS needed and so forth. We use Amazon AWS and EC2 extensively within Rittman Mead for development work, training environments and so forth, and what’s particularly interesting about AWS is the fact it’s all scriptable, there’s a public API and so forth. It’s this public API that Cloudera Manager uses to provision and spin-up the cluster VMs, something Cloudera Manager will automatically offer to do if it detects it’s running on Amazon EC2. Once you’ve provisioned the basic VMs, Cloudera Manager will automatically install the required Hadoop software on each of the VMs, meaning you can just sit back and watch the install, and then at the end, log in and check it’s all working.
And it did work – compared with problems I’d had with earlier versions of Cloudera Manager I’d had, where I’d set up the servers myself, installed Linux and Cloudera Manager myself, it all went amazingly well – to the point where I could upload some data into Impala itself, run some queries, and make use of my cloud-based Hadoop cluster.
And for one-off tasks, particularly where you need a very large amount of compute resource for a relatively small amount of time, Amazon AWS is great, but the cost soon starts to mount-up if you leave it running for too long – reckon on a cost of around $200-$400 for a reasonably-high specced instance for a month, multiplied by the amount of servers in your cluster.
So what are the alternatives? Well as I said before, you could set up a cluster yourself, installing the various bits of software, and potentially creating lots of VMs on a test server to host the Hadoop nodes. But another route you could take is to use one of the various “devops” tools out there to automate the build of a number of Hadoop nodes, using virtualisation tools such as VMWare or VirtualBox and a reasonably well-specced desktop or laptop. Back on the Cloudera website again, I saw a blog post and how-to just along these lines – one that used a devops tool called Vagrant to script and automate the build of the cluster, including setting up Ubuntu Linux on the VMs, and downloading and installing Cloudera Manager, just leaving the job of configuring the Hadoop cluster to us afterwards. This is the approach I finally went with in the end, and to do the same you’d need the following bits of kit and software:
- A desktop or server with a lot of RAM – I’ve used my iMac which has 32GB RAM; you could use a 16GB server or laptop but you’ll need to alter the Vagrant scripts to reflect this
- Either VirtualBox (supported by default by Vagrant) or VMWare Fusion / Workstation (which require an extra-cost plugin for Vagrant)
- Vagrant itself – a free download with install instructions here
Vagrant itself is an easy install and is available for OS X, Windows, Linux etc, and you can download Vagrant configuration flle for setting everything up from the Cloudera blog post. Then, it was a case of running the Vagrant script, and watching it create my VMs.
In the terminal screenshot above, you can see the VMs begin created (I started off using VirtualBox, later on I switched to VMWare Fusion), and in the screenshot below, you can see the various Cloudera packages being downloaded and installed.
Once the Vagrant set-up of the VMs was complete, I then logged into the Cloudera Manager website, and similar to how I’d done it with the EC2-based install, I just selected the other VMs to configure, chose the software components, and let the configuration complete.
At the end of the install process, I had six VMs running to provide by Hadoop cluster, each one using about 4GB of memory, and playing the following roles:
To be honest – you still need to know a bit about Hadoop, what the various bits do and so on to get it working – but then you also do to get Oracle installed, SQL Server, and so on. I guess the real barrier is having a machine big enough to run multiple Hadoop server nodes – too few and you don’t really see how the query processing works – so I guess this is why the cloud / EC2 route is so popular. But for me, I’ve got the six nodes working now, along with an OBIEE Windows VM with 18.104.22.168 installed to test out the connectivity. The screenshot below shows Cloudera Manager listing out the nodes in the cluster:
whilst the screenshot below this shows the various Hadoop platform elements listed out alongside the cluster nodes (a.k.a. VirtualBox/VMWare VMs) they’re running on.
and with OS X’s Activity Monitor showing they’re (just about) comfortably running within the overall 32GB RAM in the iMac.
So – the moment of truth – let’s try out some queries. I’ll start with Hive first of all, as Hue (Hive’s web-based UI) has some nice tools for uploading files and creating Hive tables out of them – or of course you can use ODI and it’s Hadoop Adapter and upload some data to the cluster as part of an ETL process. To use a more meaningfully-large dataset, I unloaded some of the tables from the full Airline Delays dataset to CSV files (around 180m rows of flight leg data), and then created Hive tables out of those – the screenshot below shows data from the main flight leg fact table.
In the background, two things happen when you upload new data into Hive; first, the file containing the data is stored in Hadoop’s filesystem, called HDFS (Hadoop Distributed File System), a unix-like distributed filesystem that breaks data down into blocks, and stores the blocks redundantly across the nodes in the cluster. If we take a look at the file I uploaded with the flight delays fact table data in it, you can see that it’s been broken down into blocks as shown at the bottom of the page:
If you click on an individual block, you can also see that the block is stored primarily on one node, and then redundantly on three other nodes in the cluster.
HDFS does this for two reasons; first, by spreading the data file over multiple servers, it can take advantage of the parallel processing provided by the Hadoop framework. Second, though, this redundancy means that if any node goes down, there’s copies of the data blocks elsewhere in the cluster, giving you the ability to use low-cost, commodity hardware (or cloud-based servers) whilst still protecting uptime, and your data.
So let’s run a query via the Hue UI, using Hive first of all. I put together a simple query that sums up flights, and averages distances, for all flights with California as the destination. As you can see from the Hue screenshot below, the query triggered two MapReduce jobs, one to find all flights with Californian destinations (the “map”), and one to aggregate the results (the “reduce”).
Looking at the MapReduce jobs being spun-up, run and then results gathered in, you can see that the MapReduce element (i.e., query time) took just under three minutes.
Going over the Job Tracker / MapReduce admin pages in Cloudera Manager, you can see the MapReduce jobs that were triggered by the Hive query – see how it handles the join, and how the filtering (mapping) is handled separately to the aggregating (reducing).
You can also bring up the Hadoop task tracker page, to see how the task track gave out chunks of the work to the various nodes in the cluster, and then got the results back in the end.
So – what about the Impala equivalent of the same query, then? Let’s give it a go. Well the query is more or less the same, but this time the results come back in around ten seconds, as we’d expect with Impala.
Looking inside Cloudera Manager, you can see the various Impala server processes working in the background, bypassing the need to generate MapReduce code and instead, using their own in-memory MPP framework to query the HDFS files and return the results.
And finally – the $64,000 question – can we connect OBIEE to the cluster? To do this, you’ll need to download the Cloudera Impala ODBC drivers, as I outlined in my previous blog post on the subject, but once you do, it should work – see the screenshot below where I’m querying the flight delays data using OBIEE 22.214.171.124.
So there you have it – a couple of ways you can spin-up your own multi-node Hadoop cluster, and confirmation that it should all still work with OBIEE once it’s put together.