- Partner Webcast - Oracle Data Integration for Big Data (Oracle Partner Hub: ISV Migration Center Team)
via Oracle Partner Hub: ISV Migration Center Team https://blogs.oracle.com/imc/
We share our skills to maximize your revenue!
The theme was Internet of Things (IoT) and this marked the first time that they organized a hackathon specifically for an outside partner, eProseed. All the previous hackathons were internal Oracle events. Initially the plan was for us Oracle folks to go over and mentor teams as needed, but later on, the decision was made to place us as technical resources in a team and actually participate. After some initial hiccups with my own team, I ended up in a team with Lonneke Dikmans (@lonnekedikmans), Karel Bruckman, DJ Ursal and Antonio Aguilar. Here’s what happened next …
If you have ever been to the Netherlands, you probably noticed they like bikes … a lot! This is the first thing you see when you get off the train in Utrecht:
Not exactly organized.
Lonneke’s team’s idea was to solve this with some IoT ingenuity. We tried to solve the following issues:
- For the individual:
- Where can I park my bike, i.e. where are there free bike parking spots?
- Where did I park my bike?
- How do I pay?
- For the city/municipality/parking management company:
- Where are there free parking spots?
- What is the overall usage, for parking management/planning?
- How long has a bike occupied a parking spot, for payment?
The tools at our disposal where a Raspberry PI with a fully loaded GrovePi kit and an Oracle Mobile Cloud Service (MCS) account. We were free to use any other tools as needed, but we decided to stick with these as they were freely available. Plus, we had direct access domain experts on site.
We used sensors in the GrovePi kit to detect a bike’s presence in the bike rack. As soon as we detected a bike being put into the rack, we used a Raspberry Pi camera to take a picture of the (presumably) bike owner and identified the person using her/his own phone. Users of the parking system had to register themselves so we could identify and charge them, but this part we did not build as part of the hackathon. We then sent a notification to the person’s phone using MCS. This notification contained the picture we took, the location of the bike and the time it was parked.
The location of the bike could be traced using a phone app and a web application. This app could also be used to keep track of how long the bike had been parked and how much this was going to cost the the user.
As soon as the bike was removed from the bike rack, another notification would be sent to the bike’s owner through MCS informing her/him of the usage time and how much the charge would be, and the system would automatically charge the user.
Besides the app for end users, we also had a dashboard that could be used by the parking management company. This could be a municipality or a for-profit company. The dashboard web application gave an overview of bike distribution throughout the company’s territory, e.g. the city.
This would allow the company to direct cyclists to places where there were free bike racks. Over time, we would also collect bike rack usage data that could be used to enhance parking infrastructure and overall usage, e.g. with this data you can predict usage peaks and proactively redirect cyclists, plan where more parking is needed or inform city planners on how to avoid “parking hot spots.”
In the end, our entry was good for third place.
Mark.Possibly Related Posts:
- Mobile Apps Taking Aim at Parking Tickets
- Trying Pivotal Tracker
- IoT Hackathon Field Report: Mexico Edition
- OK Google, Where’s My Car?
- University of Chicago Shows off its Automated Library
by Anna Akhmatova
an interpretive translation by Greg PavlikThe righteous Lot heard the voice of God
As if coming from the distant and black mountains.
But his wife,
She saw what was until yesterday her blessing.Under the beautiful spires of Sodom,
Where she sang spinning cloth -
The empty window of the room,
In which her children were born.She looked – and her pain died with her,
For she could look no more:
Her body translucent salt,
Her feet joined the earth.Who will mourn for Marah?
An insignificant role in a grand saga -
Yet my conscience cannot forget
The one who gave her life for a fleeting glance.-----Original poemИ праведник шел за посланником Бога,
Огромный и светлый, по черной горе.
Но громко жене говорила тревога:
Не поздно, ты можешь еще посмотреть
На красные башни родного Содома,
На площадь, где пела, на двор, где пряла,
На окна пустые высокого дома,
Где милому мужу детей родила.
Взглянула – и, скованы смертною болью,
Глаза ее больше смотреть не могли;
И сделалось тело прозрачною солью,
И быстрые ноги к земле приросли.
Кто женщину эту оплакивать будет?
Не меньшей ли мнится она из утрат?
Лишь сердце мое никогда не забудет
Отдавшую жизнь за единственный взгляд.
Ever look at a screen’s output and get that puckered feeling in the pit of your stomach? If you have been working in this profession for any amount of time, you know the feeling I’m talking about. The feeling that makes you think you would rather be living in Montana making woodcarvings at a roadside stand than being a DBA. I’ll be taking a somewhat lighthearted look at the perils of our profession and discuss ways to reduce problem occurrences.
The Perils of our Profession
One of the common challenges that all DBAs face, no matter what vendor’s database they work on, is the absolute attention to detail our profession demands. Switch a couple of characters in a script, forget to set your SID, set the wrong flag at the wrong time and the end result usually isn’t pretty. Many commands we issue on a regular basis are destructive by their very nature. This is the reason why I have a great respect for all technicians who have selected database administration as their chosen profession.
I know they have all experienced that uncontrolled “eye-twitching” at 2 AM when they are ready to hit the final enter key to execute the command. You know what command I’m talking about too. It’s that one command that you really, really, really hope is going to come back with a successful return code and ultimately end with a database that is finally usable. Whether it’s a recovery, a file fix or corrupt data is immaterial, it’s the wait that we are talking about.
There is no longer wait in the DBA profession than waiting for the message below after a database recovery:
Time always seems to stand still. The longer the recovery, the messier the recovery, the more critical the database – the longer you wait. You stare at the screen hoping beyond hope that the above message will appear. It’s the ritual cross your fingers, spin around three times, face towards Oracle headquarters and pray to everything that is Larry Ellison wait. I don’t care how sure you are of your capabilities, or how much of an Oracle “Ace” you are – you know the anticipation I’m talking about. You then either breathe a sigh of relief or you are in absolute disgust when you see an Oracle error message appear.
Not only must we try to prevent our own mistakes, we must safeguard our environments against the mistakes of others. Operating system administrators, disk storage technicians and application developers are just like us. We are all part of the human community that makes mistakes from time to time.
If you never make mistakes, send me a resume. I’m always looking for a “Patron Saint of Oracle” here at RDX. It will also save us on travel costs because I’m sure you’ll be able to spread your wings and fly here on your own.
As my old boss used to tell me (when I was a VERY junior DBA), “It really doesn’t make a difference who broke the database. You are the technician who is ultimately responsible for fixing it. The buck stops with you. If you can’t protect your environments, you aren’t doing your job.” We all know he’s absolutely correct.
Then there’s the software glitches. The problems that pop up out of the blue and make you go:
“WHAT THE? – How did THAT happen? I’ve done this 317 times in a row and it worked every time.”
For you math majors, here’s my calculation for this:
CLOSER YOU ARE TO PRODUCTION TURNOVER
+ THE GREATER THE VISIBILITY OF THE PROJECT
= THE MORE LIKELY A PREVIOUSLY UNKNOWN SOFTWARE GLITCH WILL OCCUR
I don’t care what software you are using, you will run into the “only occurs on this release, on this version of the operating system, using this particular feature on the third Tuesday of the sixth month when it’s cloudy outside” BUG. Be sure to expect management to stop by and ask “Well, why didn’t you test this on the third Tuesday of the sixth month when it was cloudy outside?”
The more complex the database ecosystem, the more paranoid I become. Which is why I’m not a follower of “the database is getting so easy – we won’t need DBAs” mantra that mindless industry pundits profess on a seemingly endless basis.
So now we know that our jobs are somewhat unforgiving and we do make a mistake from time to time. What can we do to reduce the chance of an error occurring?
Poka-Yoke for DBAs!
Poka-Yoke is a Japanese term that means “fail-safeing” or “mistake- proofing.” Wikipedia’s definition of Poka-Yoke is: “Its purpose is to eliminate product defects by preventing, correcting or drawing attention to human errors as they occur.”
Since I’m a car nut, here’s a couple of automotive Poka-Yoke examples. You can’t take the keys out of most modern cars until the car is in park. In addition, most cars won’t allow you to shift out of park until the key is in the “ON” position. How about gas caps that have the little tether that prevents us from driving off without the cap? Most gas caps are also attached using a ratchet assembly that ensures proper tightness and prevents over tightening.
Take a look around you, you’ll see dozens of Poka-Yokes during your daily activities:
- The little holes in bathroom sinks that prevent overflows
- Microwaves will stop when the door is opened
- Dryer doors will also stop when the door is opened
- Lawn movers that have a safety bar that must be depressed before they will run
- Disk brakes that begin to make a noise before they are completely ground down
- Rumble strips on roads
The list really is endless. We have applied the Poka-Yoke process to our daily activities here at RDX. We have checklists, process documentation, best practices, sign-off sheets – the works.
I’d be very interested to learn your Poka-Yoke ideas! If you have a Poka-Yoke idea, please respond and we’ll be glad to discuss it. Here’s some general ones that I recommend.
The Second Set of Eyes
As I have stated in previous blogs, I have over 20 years of experience using Oracle and have done my fair share of database backups and recoveries. During my career as an Oracle instructor, I have assisted in hundreds of database recoveries in Oracle’s classroom environments. During later stages of my career, I still had others review my recovery strategy and recovery steps before I began the recovery process. I used backup and recovery just as an example. Whatever the process is you are performing, a second opinion may prevent you from making a mistake. A review from a fellow DBA has saved me more than once. I may be described as having an ego (I have no idea where they get that opinion), but it doesn’t prevent me from asking for help from others.
A while back, a few RDX DBAs were correcting a very poor third-party utility backup script that was created by a customer’s previous database support vendor The third-party backup storage utility was overly complex, but it was the product the customer standardized on years ago. The customer described this particular environment as “if it goes down, we lose our ability to make money” application. After the massive set of changes was complete, two DBAs went line-by-line verifying each line of the backup script. At the end of each script they asked each other, “Are you OK with this?” Only then, did they move on to the next one. I don’t care how much time you have “in the seat” using Oracle, you need to put your ego aside at times and have someone check your work on critical activities. Our next step for this customer was, you guessed it, High Availability implementation.
I used to work for a shop that subscribed to the “everybody in one big room” philosophy. I guess it was supposed to allow everyone to work together as a team and become as “one with each other.” It may have achieved that purpose, but it sure didn’t allow you to concentrate on your work very well. You could hear so many different conversations they had to pump in white noise. The constant ‘whhhsssssshhhssshhh” noise made me feel like I was a crew member of the Starship Enterprise.
Like all DBA units, our particular area was often populated with various developers and O/S technicians. Many different conversations were occurring, some that could be described as somewhat animated. The environment did not allow you to concentrate on the task at hand. We often had to go into small conference rooms to work on critical tasks.
The point I’m trying to make is that no matter what type of environment you work in: if you can concentrate, OK. However, if you are like me and you can’t, find a spot where you can. Block off some time, send questions to other DBAs and concentrate on the task at hand. Don’t attempt to answer questions and code a complex script at the same time. This may seem obvious, but throughout my career, I have personally watched numerous DBAs attempt to multitask when they are working on a critical process. It’s a recipe for a problem. Once you are done, follow rule number one and have someone review your work.
What Database Are You Working IN?
Ever work in the wrong database? Working in the wrong database is a common problem for database experts as well as their less experienced counterparts. How many times have YOU found yourself running statements in the wrong environment? Feel free to include me in that not so select group. The operating system command SET can be used in Windows systems to display environment variables. The ENV command can be used to display the environment variables in UNIX. Many seasoned database administrators change their UNIX shell prompt in their profile to display the current Oracle SID. Displaying the current Oracle SID in the shell’s prompt provides a continuous reminder to the DBA of the database they are working in. Google it – you’ll find dozens of scripts by your fellow DBAs.
Saving Time VS Creating a Problem
At a large manufacturing firm, I once watched a fellow DBA perform a rather complex set of administrative tasks to solve a problem. He was rapidly flipping back and forth between at least 15 active screens, copying and pasting and editing and copying and pasting and editing… I describe this particular activity as “Multiple Screen Syndrome.” He also had several other screens open that were connected to other databases. He was multi-tasking to its highest degree. Take a break, take a breath and look at what you are doing.
How about the rm -r /u0*/ora*/prod*/*/*.* command in UNIX? It’s the command that drops multiple databases in multiple directories- all in one painful swoop. How many times have you heard of a mistake caused by commands like this causing mass mayhem? When you make a mistake like this, you become immortalized in conversations for years to come. Get a few technicians together after work and ultimately the conversation will include, “Remember when Bob so-and-so ran that big rm -r command by mistake and wiped out the entire O/S on our production web server?” You can’t tell me you haven’t heard stories like this.
My opinion is that I would rather you take your time than showcase your multi-tasking and time saving skills. The more complex and critical the activity, the more basic you should become in your plan of attack. Trust me when I say I won’t be impressed with your time savings “cut and paste” and wildcard expertise if I think it can even remotely be dangerous.
Safety First Mindset
You need to think “Safety First” when you are performing any particular complex or critical activity. Take the time and put one or two safeguards in place.
Other DBAs may call you paranoid; I’ll call you an experienced DBA that would rather be safe than sorry.
The intent of this blog post was to not provide you with a laundry list of recommendations; it was intended to help jump-start your creative juices to think about different methods to protect yourself against problems. If you have any helpful hints, please feel free to respond to this blog with your Safety First Tips and Tricks.
Thanks for reading,
The post The Art of Being a Successful DBA- Paranoid DBA Best Practices appeared first on Remote DBA Experts.
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 is an overview of Kudu technology.
- Part 2 is a lengthy dive into how Kudu writes and reads data.
- Part 3 (this post) is a brief speculation as to Kudu’s eventual market significance.
Combined with Impala, Kudu is (among other things) an attempt to build a no-apologies analytic DBMS (DataBase Management System) into Hadoop. My reactions to that start:
- It’s plausible; just not soon. What I mean by that is:
- Success will, at best, be years away. Please keep that in mind as you read this otherwise optimistic post.
- Nothing jumps out at me to say “This will never work!”
- Unlike when it introduced Impala — or when I used to argue with Jeff Hammerbacher pre-Impala — this time Cloudera seems to have reasonable expectations as to how hard the project is.
- There’s huge opportunity if it works.
- The analytic RDBMS vendors are beatable. Teradata has a great track record of keeping its product state-of-the-art, but it likes high prices. Most other strong analytic RDBMS products were sold to (or originated by) behemoth companies that seem confused about how to proceed.
- RDBMS-first analytic platforms didn’t do as well as I hoped. That leaves a big gap for Hadoop.
I’ll expand on that last point. Analytics is no longer just about fast queries on raw or simply-aggregated data. Data transformation is getting ever more complex — that’s true in general, and it’s specifically true in the case of transformations that need to happen in human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” isn’t that a big deal yet in commercial apps featuring machine-generated data — if growth trends continue as much of us expect, it’s only a matter of time before that changes.
Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor). But it also all requires strong low-latency analytic data underpinnings, and I suspect that several kinds of data subsystem will prosper. I expect Kudu-supported Hadoop/Spark to be a strong contender for that role, along with the best of the old-school analytic RDBMS, Tachyon-supported Spark, one or more contenders from the Hana/MemSQL crowd (i.e., memory-centric RDBMS that purport to be good at analytics and transactions alike), and of course also whatever Cloudera’s strongest competitor(s) choose to back.
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 is an overview of Kudu technology.
- Part 2 (this post) is a lengthy dive into how Kudu writes and reads data.
- Part 3 is a brief speculation as to Kudu’s eventual market significance.
Let’s talk in more detail about how Kudu stores data.
- As previously noted, inserts land in an in-memory row store, which is periodically flushed to the column store on disk. Queries are federated between these two stores. Vertica taught us to call these the WOS (Write-Optimized Store) and ROS (Read-Optimized Store) respectively, and I’ll use that terminology here.
- Part of the ROS is actually another in-memory store, aka the DeltaMemStore, where updates and deletes land before being applied to the DiskRowSets. These stores are managed separately for each DiskRowSet. DeltaMemStores are checked at query time to confirm whether what’s in the persistent store is actually up to date.
- A major design goal for Kudu is that compaction should never block – nor greatly slow — other work. In support of that:
- Compaction is done, server-by-server, via a low-priority but otherwise always-on background process.
- There is a configurable maximum to how big a compaction process can be — more precisely, the limit is to how much data the process can work on at once. The current default figure = 128 MB, which is 4X the size of a DiskRowSet.
- When done, Kudu runs a little optimization to figure out which 128 MB to compact next.
- Every tablet has its own write-ahead log.
- This creates a practical limitation on the number of tablets …
- … because each tablet is causing its own stream of writes to “disk” …
- … but it’s only a limitation if your “disk” really is all spinning disk …
- … because multiple simultaneous streams work great with solid-state memory.
- Log retention is configurable, typically the greater of 5 minutes or 128 MB.
- Metadata is cached in RAM. Therefore:
- ALTER TABLE kinds of operations that can be done by metadata changes only — i.e. adding/dropping/renaming columns — can be instantaneous.
- To keep from being screwed up by this, the WOS maintains a column that labels rows by which schema version they were created under. I immediately called this MSCC — Multi-Schema Concurrency Control — and Todd Lipcon agreed.
- Durability, as usual, boils down to “Wait until a quorum has done the writes”, with a configurable option as to what constitutes a “write”.
- Servers write to their respective write-ahead logs, then acknowledge having done so.
- If it isn’t too much of a potential bottleneck — e.g. if persistence is on flash — the acknowledgements may wait until the log has been fsynced to persistent storage.
- There’s a “thick” client library which, among other things, knows enough about the partitioning scheme to go straight to the correct node(s) on a cluster.
Leaving aside the ever-popular possibilities of:
- Cluster-wide (or larger) equipment outages
the main failure scenario for Kudu is:
- The leader version of a tablet (within its replica) set goes down.
- A new leader is elected.
- The workload is such that the client didn’t notice and adapt to the error on its own.
Todd says that Kudu’s MTTR (Mean Time To Recovery) for write availability tests internally at 1-2 seconds in such cases, and shouldn’t really depend upon cluster size.
Beyond that, I had some difficulties understanding details of the Kudu write path(s). An email exchange ensued, and Todd kindly permitted me to post some of his own words (edited by me for clarification and format).
Every tablet has its own in-memory store for inserts (MemRowSet). From a read/write path perspective, every tablet is an entirely independent entity, with its own MemRowSet, rowsets, etc. Basically the flow is:
- The client wants to make a write (i.e. an insert/update/delete), which has a primary key.
- The client applies the partitioning algorithm to determine which tablet that key belongs in.
- The information about which tablets cover which key ranges (or hash buckets) is held in the master. (But since it is cached by the clients, this is usually a local operation.)
- It sends the operation to the “leader” replica of the correct tablet (batched along with any other writes that are targeted to the same tablet).
- Once the write reaches the tablet leader:
- The leader enqueues the write to its own WAL (Write-Ahead Log) and also enqueues it to be sent to the “follower” replicas.
- Once it has reached a majority of the WALs (i.e. 2/3 when the replication factor = 3), the write is considered “replicated”. That is to say, it’s durable and would always be rolled forward, even if the leader crashed at this point.
- Only now do we enter the “storage” part of the system, where we start worrying about MemRowSets vs DeltaMemStores, etc.
Put another way, there is a fairly clean architectural separation into three main subsystems:
- Metadata and partitioning (map from a primary key to a tablet, figure out which servers host that tablet).
- Consensus replication (given a write operation, ensure that it is durably logged and replicated to a majority of nodes, so that even if we crash, everyone will agree whether it should be applied or not).
- Tablet storage (now that we’ve decided a write is agreed upon across replicas, actually apply it to the database storage).
These three areas of the code are separated as much as possible — for example, once we’re in the “tablet storage” code, it has no idea that there might be other tablets. Similarly, the replication and partitioning code don’t know much anything about MemRowSets, etc – that’s entirely within the tablet layer.
As for reading — the challenge isn’t in the actual retrieval of the data so much as in figuring out where to retrieve it from. What I mean by that is:
- Data will always be either in memory or in a persistent column store. So I/O speed will rarely be a problem.
- Rather, the challenge to Kudu’s data retrieval architecture is finding the relevant record(s) in the first place, which is slightly more complicated than in some other systems. For upon being told the requested primary key, Kudu still has to:
- Find the correct tablet(s).
- Find the record(s) on the (rather large) tablet(s).
- Check various in-memory stores as well.
The “check in multiple places” problem doesn’t seem to be of much concern, because:
- All that needs to be checked is the primary key column.
- The on-disk data is front-ended by Bloom filters.
- The cases in which a Bloom filter returns a false positive are generally the same busy ones where the key column is likely to be cached in RAM.
- Cloudera just assumes that checking a few different stores in RAM isn’t going to be a major performance issue.
When it comes to searching the tablets themselves:
- Kudu tablets feature data skipping among DiskRowSets, based on value ranges for the primary key.
- The whole point of compaction is to make the data skipping effective.
Finally, Kudu pays a write-time (or compaction-time) cost to boost retrieval speeds from inside a particular DiskRowSet, by creating something that Todd called an “ordinal index” but agreed with me would be better called something like “ordinal offset” or “offset index”. Whatever it’s called, it’s an index that tells you the number of rows you would need to scan before getting the one you want, thus allowing you to retrieve (except for the cost of an index probe) at array speeds.
Following on from my earlier posting of problems with temporary table and the PL/SQL result cache (a combination which the manuals warn you against) here’s another problem – again, to a large degree, self-inflicted.
Imagine you have a complex report involving a large number of financial transactions with a need to include calculations about current exchange rates. Unfortunately the rules about calculating the appropriate exchange rate for any transaction are complex and you find you have a choice between adding 6 tables with outer joins and a couple of aggregate (max) subqueries to the base query or calling a PL/SQL function to calculate the exchange rate for each row. I’m going to create an extremely simplified model of this requirement:
create table t1 nologging as with generator as ( select --+ materialize rownum id from dual connect by level <= 1e4 ) select rownum id, case mod(rownum,2) when 0 then 'USD' else 'GBP' end v1, case mod(rownum,2) when 0 then 'GBP' else 'USD' end v2 from generator v1 where rownum <= 1e3 ; create table t2 ( v1 varchar2(3), v2 varchar2(3), cvt number(10,6), constraint t2_pk primary key(v1,v2) ) organization index ; insert into t2 values('USD','GBP',0); insert into t2 values('GBP','USD',1); commit; create or replace function plsql_ordinary( from_cur in varchar2, to_cur in varchar2 ) return number is m_ret number; begin select /*+ ordinary trace_this */ cvt into m_ret from t2 where v1 = from_cur and v2 = to_cur ; return m_ret; end plsql_ordinary; / execute dbms_stats.gather_table_stats(user,'t2')
My t1 table represents the set of transactions but only has to worry about two exchange rates, the dollar/sterling and its inverse. My t2 table is an exchange rate table and I’ve loaded it with the two exchange rates I’m interested in. My function plsql_ordinary() takes two currency codes as inputs and returns the exchange rate.
Here’s the modelling query, with a little infrastructure to examine the workload. Note that this will only run on 12c because of the inline PL/SQL function I’ve used for slowing the query down.
set pagesize 20 set arraysize 6 set serveroutput off alter system flush shared_pool; alter session set statistics_level = all; with function wait_row_n ( i_secs number, i_return number default -1 ) return number is begin dbms_lock.sleep(i_secs); return i_return; end wait_row_n; select /*+ driver trace_this */ wait_row_n(0.3,id), plsql_ordinary(v1,v2), (select /*+ scalar trace_this */ t2.cvt from t2 where t2.v1 = t1.v1 and t2.v2 = t1.v2) scalar_sub from t1 where rownum <= 50 ; set pagesize 40 select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); select sql_id, executions, fetches, rows_processed, sql_text from v$sql where lower(sql_text) like 'select%trace_this%' and lower(sql_text) not like '%v$sql%' ;
The query includes a scalar subquery in the select list to get the same data as the PL/SQL function, and you’ll see the point of that in a while. Because of the arraysize and input parameters to wait_row_n() the query will produce output in batches of 6 rows roughly every two seconds for a total of about 18 seconds – which will give me plenty of time to play around in another session. Before I try to do any damage, though, let’s check the execution plan of the report and the statistics of the queries with “trace_this” in their text:
-------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 50 |00:00:00.01 | 11 | |* 1 | INDEX UNIQUE SCAN | T2_PK | 2 | 1 | 2 |00:00:00.01 | 2 | |* 2 | COUNT STOPKEY | | 1 | | 50 |00:00:00.01 | 11 | | 3 | TABLE ACCESS FULL| T1 | 1 | 50 | 50 |00:00:00.01 | 11 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("T2"."V1"=:B1 AND "T2"."V2"=:B2) 2 - filter(ROWNUM<=50) SQL_ID EXECUTIONS FETCHES ROWS_PROCESSED ------------- ---------- ---------- -------------- SQL_TEXT ----------------------------------------------------------------------------- f1bz07bk5rbth 50 50 50 SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1
Looking at operation 1 in the plan you can see that the scalar subquery has started just twice (once for each distinct combination of currency codes).
Looking at the results from v$sql you can see that the query in the PL/SQL function was executed 50 times – once for each row. Functions like this can be a massive drain of resources (typically CPU, but also through latch contention on buffer cache and shared pool latches).
Conveniently we realise that in our system the derived exchange rates change very slowly – so how about telling Oracle that the exchange rate function is a deterministic function (it’s nearly true), or better still, perhaps, experiment with the PL/SQL Function Result Cache.
(Note very carefully, however, that the Database Administrators’ Manual for 11.2 makes the following comment about using the deterministic keyword with PL/SQL functions)
Tells the optimizer that the function returns the same value whenever it is invoked with the same parameter values (if this is not true, then specifying DETERMINISTIC causes unpredictable results).
Look carefully at that “unpredictable” – it’s true … but you might not realise it until too late. Our PL/SQL function is NOT deterministic – after all a function that queries the database to produce a result may produce a different result every time it executes if someone keeps changing the underlying data – but we might wave our hands a bit and say that the table isn’t going to change while we’re running our report so it’s okay to pretend it’s deterministic, we might even make it a read-only table for the duration. Similar thoughts should cross our minds about declaring a PL/SQL function to the result cache – even though the manual doesn’t say anything quite so explicitly threatening about the result cache. But what the heck, let’s just do it and see what happens:
create or replace function plsql_result_cache( from_cur in varchar2, to_cur in varchar2 ) return number result_cache is m_ret number; begin select /*+ result cache trace_this */ cvt into m_ret from t2 where v1 = from_cur and v2 = to_cur ; return m_ret; end plsql_result_cache; / create or replace function plsql_deterministic( from_cur in varchar2, to_cur in varchar2 ) return number deterministic is m_ret number; begin select /*+ det trace_this */ cvt into m_ret from t2 where v1 = from_cur and v2 = to_cur ; return m_ret; end plsql_deterministic; / ... select /*+ driver trace_this */ wait_row_n(0.3,id), plsql_ordinary(v1,v2), plsql_deterministic(v1,v2), plsql_result_cache(v1,v2), (select /*+ scalar trace_this */ t2.cvt from t2 where t2.v1 = t1.v1 and t2.v2 = t1.v2) scalar_sub from ...
All three functions returned the same set of results as the scalar subquery – and here’s the output from v$sql showing the impact of declaring a deteministic function and a result cache function (note that “result cache” is not a hint in the first statement, it’s just a convenient label):
SQL_ID EXECUTIONS FETCHES ROWS_PROCESSED ------------- ---------- ---------- -------------- SQL_TEXT ----------------------------------------------------------------------------------- 49buxp3gba3cg 2 2 2 SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 2sh7bm59dkwhw 18 18 18 SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 f1bz07bk5rbth 50 50 50 SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1
The simple function ran 50 times, the deteministic function ran 18 times, and the result cache function ran twice. It required just two executions to get the two distinct results needed, after which the session used the result cache rather than calling the function again.
The deterministic function only remembers its results for the duration of a single database call – in this case the fetch – so on each fetch the session has to re-populate the session’s “deterministic cache”, which takes 2 calls for each fetch, a total of 9 fetches * 2 calls = 18 calls.
Clearly, if I can get away with it safely, the PL/SQL function result cache looks like a real winner, with the deterministic option coming a close second (given that real life ought to be using a significantly larger fetch arraysize). So what could possibly go wrong ? Two things – first, the results … and if the results can go wrong there’s not much point in talking about the second thing at the moment.
My query runs for 18 seconds, I’m going to start another session while it runs and update one of the rows in the t2 table a few seconds after my long-running query starts. Here’s the SQL I’ve run, an extract from the query output, and the results from v$sql:
update t2 set cvt = 2 where v1 = 'USD' ; commit; WAIT_ROW_N(0.3,ID) PLSQL_ORDINARY(V1,V2) PLSQL_DETERMINISTIC(V1,V2) PLSQL_RESULT_CACHE(V1,V2) SCALAR_SUB ------------------ --------------------- -------------------------- ------------------------- ---------- 1 1 1 1 1 2 0 0 0 0 3 1 1 1 1 4 0 0 0 0 5 1 1 1 1 6 0 0 0 0 7 1 1 1 1 8 0 0 0 0 9 1 1 1 1 10 2 0 2 0 11 1 1 1 1 12 2 0 2 0 13 1 1 1 1 14 2 2 2 0 15 1 1 1 1 16 2 2 2 0 SQL_ID EXECUTIONS FETCHES ROWS_PROCESSED ------------- ---------- ---------- -------------- SQL_TEXT ----------------------------------------------------------------------------------- 49buxp3gba3cg 4 4 4 SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 2sh7bm59dkwhw 18 18 18 SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 f1bz07bk5rbth 50 50 50 SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1
The most important point is that we’ve got results that are not self-consistent – except for the scalar subquery results.
The SQL statements that are executed inside the PL/SQL functions do not run at the SCN of the driving query, each individual statement executes at its own starting SCN. This is an important point that is often overlooked when people write PL/SQL functions that are then called from SQL. The inline scalar subquery, on the other hand, always runs as at the start SCN of the driving query no matter how many times or how late in the lifetime of the driving query it runs.
If we examine the results we can see that the ordinary PL/SQL function and the result cache PL/SQL function changed their output the moment the commit took place (you’ll have to take my word on that, of course), but the deterministic function seemed to delay slightly. We can also see that the number of executions for the ordinary and deterministic functions didn’t change, but the result cache function doubled its executions.
Because of the way I’ve created my data and defined the function, the ordinary function executes its SQL once every row while the deterministic function executes its SQL twice every fetch (once for each pair of input values (though the first fetch from SQL*Plus is a special case) and then remembers the outputs for the rest of the fetch – this is why there is a delay before the deterministic function changes its output and doesn’t introduce any extra calls to the SQL – it was going to re-execute for both values on the fetch starting at id 13 whatever went on around it; the result cache function gets an invalidation warning the moment the other session commits, so re-executes its SQL as many times as needed to repopulate the bits of the cache that have been discarded – and that’s why we see the number of SQL calls doubling, the relevant part of the cache was identified by some sort of hash value for the statement with SQL_ID = ’49buxp3gba3cg’ so both results were discarded and reloaded even though only one of them actually had to change.
Critically every execution of the recursive statements runs at the then-current SCN – so when the underlying data changes our report sees those changes, the report is not self-consistent.
Fortunately there’s something we can do about this – if we want the whole report to operate at the same SCN all we need to do is freeze our session at a specific point in time with the command “set transaction read only;”. This has to be executed as the first statement of a transaction but if we can slip it in just before we start running our report we can ensure that all our SQL statements (including the recursive ones) run as at the same SCN until we issue a commit. When I repeated the data-changing experiment after setting the transaction read only the report ran to completion showing the correct results.
But there was a significant change in the output from v$sql:
SQL_ID EXECUTIONS FETCHES ROWS_PROCESSED ------------- ---------- ---------- -------------- SQL_TEXT ----------------------------------------------------------------------------------- 49buxp3gba3cg 44 44 44 SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 2sh7bm59dkwhw 18 18 18 SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 f1bz07bk5rbth 50 50 50 SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1
I did my update just after the first batch of rows had appeared on screen – notice how the result cache SQL has executed 44 times instead of 2 (or 4) times. When I set my transaction to read only it looks as if my session stopped using the result cache the moment the other session commited – and that’s a GOOD THING. If my session were able to continue using the result cache that would mean one of two things, either I would be seeing a result created by another user – which would be too new for me, or every other session would be seeing the results from my session – which would (typically) be out of date for them. The session seems to have protected itself from the risk of a result cached PL/SQL function producing inconsistent results – but the workload changed the moment another session committed a change to the data we were interested in.
At that point I stopped investigating “set transaction read only” until a couple of days later when I realised that there was one more thing I had to test – when I changed the data from another session I didn’t check to see what that session would see when it executed the cached function, so I modified the code run by the other session to do the following:
update t2 set cvt = 2 where v1 = 'USD' ; commit; execute dbms_lock.sleep(6) execute dbms_output.put_line(plsql_result_cache('USD','GBP'))
The other session goes to sleep for a while (to let the reporting session get through a little more work) and then calls the function. I was very pleased to see that the session returned the correct result – the value 2 that it had just written to the table. But what I got from the reporting session wasn’t so good:
WAIT_ROW_N(0.3,ID) PLSQL_ORDINARY(V1,V2) PLSQL_DETERMINISTIC(V1,V2) PLSQL_RESULT_CACHE(V1,V2) SCALAR_SUB ------------------ --------------------- -------------------------- ------------------------- ---------- 1 1 1 1 1 2 0 0 0 0 3 1 1 1 1 4 0 0 0 0 5 1 1 1 1 6 0 0 0 0 7 1 1 1 1 8 0 0 0 0 ... 24 0 0 0 0 25 1 1 1 1 26 0 0 0 0 27 1 1 1 1 28 0 0 0 0 29 1 1 1 1 30 0 0 2 0 31 1 1 1 1 32 0 0 2 0 SQL_ID EXECUTIONS FETCHES ROWS_PROCESSED SQL_TEXT ------------- ---------- ---------- -------------- -------------------------------- 49buxp3gba3cg 32 32 32 SELECT /*+ result cache trace_th is */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 49buxp3gba3cg 1 1 1 SELECT /*+ result cache trace_th is */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 2sh7bm59dkwhw 18 18 18 SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1 f1bz07bk5rbth 50 50 50 SELECT /*+ ordinary trace_this * / CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1
I changed t2 just after the first batch of rows had appeared (just after id 6), then called the function a few seconds later – and as I called the function from the other session it queried the data (the second child to 49buxp3gba3cg, executed just once above) and reloaded the result cache. At that moment (visible at id 30) the first session found data in the result cache and stopped re-executing its queries. When my session went read only it protected other sessions from the past by NOT re-populating the result cache as it ran its queries – but if it found something in the result cache it used it (notice how it has recorded 32 executions of the query rather than 44 – it found something in the result cache on the last 12 calls of the function). The protection only goes one way.Observations
Using PL/SQL functions in the select list to encapsulate complex calculations that query the database is not something you can do casually. You have no guarantee that you will end up with a self-consistent result unless you take action to deal with the threats introduced by concurrent activity – ideally all tables involved should be set to read-only (which is only possible in 12c [Ed: see comment below] became possible from 11.1 onwards, though you can set a whole tablespace readonly in earlier versions: neither strategy is commonly viable). If you decide that you can work around those threats you still have the performance threat implied by the need to do some complex work for each driving row of your result set. For a report the simple solution to consistency is to “freeze” the session SCN by setting your session (transaction) into read only mode.
Once you’ve dealt with the consistency problem you can start to address the performance problen by claiming that you were using deterministic functions. You might be tempted to use the PL/SQL Result Cache to give you an even larger performance boost, but if you do you really have to make the table (or tablespace) read-only to be protected from read-consistency problems. The deterministic strategy may not be as dramatic in its effects as the result cache strategy but, given a reasonably sized fetch array, the performance benefit you get may be all you really need.
Whatever else you do, there’s an important underlying threat to remember. The moment you create a PL/SQL function that uses the result cache or deterministic option you have to ensure that nobody uses that function without ensuring that their code has handled the consistency threat properly. It’s too easy to forget, with the passing of time, that certain procedures have to be adopted when particular coding strategies are used.Left as Exercise
I was planning to write a further article going into some detail about using dbms_flashback.enable_at_time(systimestamp) instead of “set transaction read only” – a mechanism that might be used to achieve the same read-consistency requirement though, perhaps, used less frequently than the older, less flexible option. I was also planning to include notes in the same araticle about the effect of “alter session set isolation_level = serializable” which some systems probably use to get read-consistency across multiple statements while writing results back to the database.
Both strategies run into the same issue as “set transaction read only”, with the added effect that your session (the one that has tried to “fix” its SCN) will repopulate the cache, so not only could you see newer results from other sessions in the cache; other sessions could see older results because you put them into the cache.
I’m not going to write up these demonstrations (which require fairly simple modifications to the code supplied above) as all the phenomena have been recorded as bugs on MoS (along with the GTT problem from my previous post, and a documentation bug for the PL/SQL manual to clarify the various threats):
- Bug 21905592 : PL/SQL RESULT CACHE MUST BE BYPASSSED WHEN SESSION SCN NOT CURRENT
- Bug 21905695 : BAD X-SESSION RESULTS W/ RESULT-CACHED PL/SQL FN AND “SET TXN READ ONLY”
- Bug 21907155 : PL/SQL RC FN SHOWS ONE SESSION THE CONTENTS OF ANOTHER SESSION’S GTT
- Bug 21885173 : THE LNPLS SECTION “RESTRICTIONS ON RESULT-CACHED FUNCTIONS” NEEDS RE-WRITE
I haven’t mentioned it here, but another strategy for reducing the number of PL/SQL calls is simply to wrap the function call inside a scalar subquery, along the lines of:
select (select plsql_ordinary(v1, v2) from dual), ...
Provided you don’t have many combinations of (v1,v2) to handle, and with a little luck with Oracle’s internal hashing function, you could find that scalar subquery caching reduces your execution count from once per row to once per combination. Note that the function is the “ordinary” function, not the “fake deterministic” one; internally Oracle uses the same hashing strategy for remembering the results, but the duration of the scalar subquery cache is the statement rather than the fetch.
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 (this post) is an overview of Kudu technology.
- Part 2 is a lengthy dive into how Kudu writes and reads data.
- Part 3 is a brief speculation as to Kudu’s eventual market significance.
Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.
*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes.
- Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
- Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
- Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
- Kudu doesn’t support multi-row transactions.
- There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
- Kudu is rather columnar, except for transitory in-memory stores.
- Kudu’s core design points are that it should:
- Accept data very quickly.
- Immediately make that data available for analytics.
- More specifically, Kudu is meant to accept, along with slower forms of input:
- Lots of fast random writes, e.g. of web interactions.
- Streams, viewed as a succession of inserts.
- Updates and inserts alike.
- The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
- Low-latency business intelligence.
- Predictive model scoring.
- Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
- Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.
Also, it might help clarify Kudu’s status and positioning if I add:
- Kudu is in its early days — heading out to open source and beta now, with maturity still quite a way off. Many obviously important features haven’t been added yet.
- Kudu is expected to be run with a replication factor (tunable, usually =3). Replication is via the Raft protocol.
- Kudu and HDFS can run on the same nodes. If they do, they are almost entirely separate from each other, with the main exception being some primitive workload management to help them share resources.
- Permanent advantages of older alternatives over Kudu are expected to include:
- Legacy. Older, tuned systems may work better over some HDFS formats than over Kudu.
- Pure batch updates. Preparing data for immediate access has overhead.
- Ultra-high update volumes. Kudu doesn’t have a roadmap to completely catch up in write speeds with NoSQL or in-memory SQL DBMS.
Kudu’s data organization story starts:
- Storage is right on the server (this is of course also the usual case for HDFS).
- On any one server, Kudu data is broken up into a number of “tablets”, typically 10-100 tablets per node.
- Inserts arrive into something called a MemRowSet and are soon flushed to something called a DiskRowSet. Much as in Vertica:
- MemRowSets are managed by an in-memory row store.
- DiskRowSets are managed by a persistent column store.*
- In essence, queries are internally federated between the in-memory and persistent stores.
- Each DiskRowSet contains a separate file for each column in the table.
- DiskRowSets are tunable in size. 32 MB currently seems like the optimal figure.
- Page size default is 256K, but can be dropped as low as 4K.
- DiskRowSets feature columnar compression, with a variety of standard techniques.
- All compression choices are specific to a particular DiskRowSet.
- So, in the case of dictionary/token compression, is the dictionary.
- Thus, data is decompressed before being operated on by a query processor.
- Also, selected columns or an entire DiskRowSet can be block-compressed.
- Tables and DiskRowSets do not expose any kind of RowID. Rather, tables have primary keys in the usual RDBMS way.
- Kudu can partition data in the three usual ways: randomly, by range or by hash.
- Kudu does not (yet) have a slick and well-tested way to broadcast-replicated a small table across all nodes.
*I presume there are a few ways in which Kudu’s efficiency or overhead seem more row-store-like than columnar. Still, Kudu seems to meet the basic requirements to be called a columnar system.
You only need a bit of CSS to do the trick:
You can add the above CSS to your login page - Inline CSS in page attributes or you can add it to your page template or custom CSS in Universal Theme.
The result is this:
To differentiate the login page you can do a lot more, here're some things we show during our APEX UI training:
- add transparency to the login box
- add a background image to your entire page (blurred or not)
High Availability with Java EE Containers, JDBC, and Java Connection Pools [BOF7732]
Monday, Oct 26, 8:00 p.m. | Parc 55—Mission
Java Connection Pool Performance and Scalability with Wait-Free Programming [CON2158]
Wednesday, Oct 28, 4:30 p.m. | Hilton—Continental Ballroom 1/2/3
OOW SF 2015 - Session recommendations
Java Virtual Machine Cookbook [UGF2720]
Sunday, Oct 25, 9:00 a.m. | Moscone West—3011
Monday, Oct 26, 5:15 p.m. | Moscone South—308
Next-Generation Database: Java Connection Pool for Multitenant and Sharded Databases [CON8460]
Monday, Oct 26, 2:45 p.m. | Moscone South—308
Master Data Management Using Oracle Table Access for Hadoop and Spark [CON8459]
Wednesday, Oct 28, 3:00 p.m. | Moscone South—308
Market Basket Analysis Using Oracle In-Database Container for Hadoop [CON8462]
Thursday, Oct 29, 12:00 p.m. | Moscone South—307
Next-Gen Database Enhancements for Java Application Performance and Scalability [CON10310]
Thursday, Oct 29, 2:30 p.m. | Moscone South—307
Dialog with the Oracle Database Java Developers and Architects [MTE9501]
Tuesday, Oct 27, 7:15 p.m. | Moscone South—305
This enables tracing for sessions of a specific Service_Name and *optionally* Module Name and Action Name. This is useful where you define applications by Service Names and, optionally, use DBMS_APPLICATION_INFO to set Module and Action in a given session.
Thus, if in an HR session, I do :
Enter user-name: hr/hr@orcl
Oracle Database 11g Enterprise Edition Release 18.104.22.168.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> exec dbms_application_info.set_client_info('HR App Client');
PL/SQL procedure successfully completed.
SQL> exec dbms_application_info.set_module('Employee Module','Updating');
PL/SQL procedure successfully completed.
And, in another session (with the appropriate privileges), I do :
SQL> exec DBMS_MONITOR.SERV_MOD_ACT_TRACE_ENABLE(-
> module_name=>'Employee Module',-
PL/SQL procedure successfully completed.
the actions in the HR session(s) of that module and action are traced. (Additional parameters WAITS and BINDS can also be set to TRUE to enable of Waits and Binds (Waits are set to TRUE by default)).
Note : If there are multiple sessions with the same combination of service_name, module_name, action_name, all the sessions are traced !
Tracing is disabled when the session itself uses DBMS_APPLICATION_INFO to change it's Module / Action settings.
Tracing is also disabled when the session that initiated the tracing executes :
SQL> exec DBMS_MONITOR.SERV_MOD_ACT_TRACE_DISABLE(-
> module_name=>'Employee Module',-
PL/SQL procedure successfully completed.
Thus, this procedure allows tracing by the granularity of Service = Module = Action. Unfortunately, many custom applications do NOT use DBMS_APPLICATION_INFO to set Module and Action.
This is useful for tracing a single session and where Module / Action information are not populated by the client.
The call is simple :
SQL> exec DBMS_MONITOR.SESSION_TRACE_ENABLE(-
PL/SQL procedure successfully completed.
The disabling call is :
SQL> exec DBMS_MONITOR.SESSION_TRACE_DISABLE(-
PL/SQL procedure successfully completed.
Thus, this can be issued individually for each session.
Taking a Look at Oracle Big Data Preparation Cloud Service – Spark-Based Data Transformation in the Cloud
One of the sessions I’m delivering at the upcoming Oracle Openworld 2015 in San Francisco is entitled “Oracle Business Intelligence Cloud Service—Moving Your Complete BI Platform to the Cloud [UGF4906]”, and looks at how you can now migrate your entire OBIEE11g platform into Oracle Public Cloud including data warehouse and data integration routines. Using Oracle BI Cloud Services’ new on-premise RPD upload feature you can upload an existing RPD into BICS and run it from there, with the accompanying data warehouse database moving into Oracle’s Database Cloud Service (and with the restriction that you can’t then edit the repository within BICS, you need to do that on-premise and upload again). For ETL and data integration you can carry on using ODI12c which now has the ability to load data into Oracle Storage Cloud (for file sources) and BICS (via a REST API) as well as the full Oracle DBaaS, but another Oracle option for doing purely cloud-based data processing enrichment has recent become available – Oracle Big Data Preparation Cloud Service. So what is it, how does it work and how is it different to ODI12c?
Oracle Big Data Preparation Cloud Service (“BDP”) is a thin-client application within Oracle Cloud for ingesting, preparing and enriching datasets that don’t have a predefined schema and may well need certain fields obfuscated or cleansed. Being integrated with Oracle Storage Cloud and other infrastructure and platform services within Oracle cloud it’s obviously aimed mainly at data transformation tasks within the Oracle Cloud enviroment, but you can upload and download datasets from your browser for use with on-premise applications. Unlike the more general-purpose Oracle Data Integrator it’s aimed instead at a particular use-case – non-technical information analysts who need to get data transformed, wrangled and enriched before they can make use of it in an environment like Hadoop. In fact the product name is a bit misleading – it runs on a big data platform within Oracle Cloud and like Oracle Big Data Discovery uses Apache Spark for its data processing – but it could potentially be useful for a business analyst to prepare data for loading into Oracle BI Cloud Service, and I’ll cover this angle when I talk about data loading options in by Oracle Openworld session.
Within a logical architecture for a typical big data DW and BI system, BDP sits alongside ODI within the Data Factory and provides self-service, agile transformation capabilities to more business-orientated users.
Oracle Big Data Cloud Preparation Service shares quite a bit of functionality and underlying technology, with Oracle Big Data Discovery – both run on Hadoop, they both use Apache Spark for data processing and transformation, and both offer data transformation and “wrangling” features aimed at non-technical users. Oracle are positioning Big Data Preparation Service as something you’d use in the execution layer of the Oracle Information Management Reference Architecture whereas Big Data Discovery is associated more with the discovery layer – I’d mostly agree but I can see a role for BDD even within the execution layer, as a front-end to the data reservoir that typically now runs alongside relationally-stored data warehouses.
Looking back at the slides from one of the recent Strata conferences, for example, sees Oracle positioning BDP as the “operational data preparation” tool for structured and unstructured data – with no defined schema – coming into your information platform, with the enriched output then being used BI tools, enterprise reporting and data discovery tools.
Apart from the scalability benefits of running BDP on Apache Spark, the other interesting feature in BDP is how it uses Spark’s machine learning capabilities to try to automate as much of the data preparation process as possible, for example detecting credit card numbers in data fields and recommending you obfuscate that column. Similar to BICS and how Oracle have tried to simplify the process of creating reports and dashboards for a small team, BDP runs in the cloud tries to automate and simplify as much of the data preparation and enrichment process as possible, with ODI12c still available for ETL developers to develop more complex transformations.
The development lifecycle for BDP (from the Oracle Big Data Preparation Cloud Service e-book on Oracle’s website) uses a cycle of ingesting, cleaning, enriching and then publishing data using scripts authored using the tool and run on the Apache Spark platform. The diagram below shows the BDP development lifecycle from Oracle’s Big Data Preparation Cloud Service Handbook, and shows how ingestion, enrichment, publishing and governance go in a cycle with the common foundation of the transformation scripts that you build using BDP’s web interface.
So let’s walk through an example data preparation exercise using a file of data stored initially in Oracle Storage Cloud Service. After logging into BDP via Oracle Cloud you’re first presented with the Catalog view, listing out all your previous transformations and showing you when they were last used to process some data.
To create a transformation you first give it a name, then select the data source and then the file you’re interested in. In my environment I’ve got Oracle Storage Cloud and HDFS available as my main data sources, or I could upload a file from my desktop and start from there.
BDP then ingests the file and then uses its machine learning features to process and classify data in each column, recommending column names such as “gender”, “city” and cc_number based on (presumably) some sort of classification model. In the screenshot below you can see a set of these recommendations on the left-hand side of the screen, with the columns themselves listed centre and a summary of the file profiling on the right.
Taking a closer look at the profile results panel you can see two of the columns have alerts raised, in red. Clicking on the alert shows that the two columns have credit card data stored in clear text, with the recommendation being to obfuscate or otherwise secure these fields. Clicking on a field then shows the various transformation options, with the obvious choice here being to automatically obfuscate the data in those fields.
Once you’ve worked through all the recommendations and added any transformations you choose to add yourself, the final step is to publish your transformation to one of the available targets. In the example below we’ve got Oracle Storage Cloud and HDFS again as potential targets; I’d imagine Oracle will add a connector to BICS soon, for example, so that you can use BDP as a data prep tool for file data that will then be added to your dataset in BICS.
So … it’ll be interesting to see where this one goes. Its interesting that Oracle have split out data preparation and data discovery into two tools whilst others are saying theirs can do both, and you’ll still need ODI for the more complex integration jobs. But I like the innovative use of machine learning to do away with much of the manual work required for classification of incoming data fields, and running the whole thing on Spark certainly gives it the potential of scale. A couple of years ago I was worried Oracle didn’t really have a strategy for data integration and ETL in the cloud, but we’re starting to see something happen now.
There’s a big push from the Oracle field at the moment to move customers into the cloud, and I can see BDP getting bundled in with Big Data Cloud Service and BICS as the accompanying cloud data preparation tool. The danger then of course is that Big Data Discovery starts to look less useful, especially with Visual Analyzer already available within BICS and coming soon on-premise with OBIEE12c. My guess is that what we’re seeing now with these initial releases of BDP and BDD is just the start, with BDP adding more automatic enrichment “smarts” and starting to cover integration use-cases too, whilst BDD will put more focus on data visualization and analytics on the data reservoir.
- EMEA Partners: Oracle Cloud Platform: Integration Workshop for Partners (Oracle Partner Hub: ISV Migration Center Team)
via Oracle Partner Hub: ISV Migration Center Team https://blogs.oracle.com/imc/
We share our skills to maximize your revenue!
"Hi, I'm Chucky. Wanna play?" Fans of the horror film genre will surely recall these lines - innocent-sounding on their own, yet bone-chilling in the context of the scene in the movie - that Chucky, the possessed demonic doll, utters in the cult classic, "Child's Play". Called a "cheerfully energetic horror film" by Roger Ebert , the movie was released to more than a thousand screens on its debut in November 1988 . It went on to spawn at least five sequels and developed a cult following of sorts over the next two decades .
Chucky the doll
(image credit: http://www.shocktillyoudrop.com/)In "Child's Play", Chucky the killer doll stays quiet around the adults - at least initially - but carries on secret conversations with Andy, and is persuasive enough to convince him to skip school and travel to downtown Chicago. Chucky understands how children think, and can evidently manipulate - or convince, depending on how you frame it - Andy into doing little favours for him. A doll that could speak, hear, see, understand, and have a conversation with a human in the eighties was the stuff out of science fiction, or in the case of "Child's Play" - out of a horror movie.
Edison Talking Doll.
Image credit: www.davescooltoys.comA realistic doll that could talk and converse was for long the "holy grail" of dollmakers . It will come as a huge surprise to many - at least it did to me - that within a few years of the invention of the phonograph by Thomas Edison in 1877, a doll with a pre-recorded voice had been developed and marketed in 1890! It didn't have a very happy debut however. After "several years of experimentation and development", the Edison Talking Doll, when it launched in 1890, "was a dismal failure that was only marketed for a few short weeks." Talking dolls seem to have made their entry into mainstream retail only with the advent of "Chatty Cathy" - released by Mattel in the 1960s - and which worked on a simple pull-string mechanism. The quest to make these dolls more interactive and more "intelligent" continued; "Amazing Amanda" was another milestone in this development; it incorporated "voice-recognition and memory chips, sensory technology and facial animatronics" . It was touted as an "an evolutionary leap from earlier talking dolls like Chatty Cathy of the 1960's" by some analysts . In some ways that assessment was not off-the-mark. After all, "Amazing Amanda" utilized RFID technology - among the hottest technology buzzwords a decade back. "Radio-frequency tags in Amanda's accessories - including toy food, potty and clothing - wirelessly inform the doll of what it is interacting with." This is what enabled "Amazing Amanda" to differentiate between "food" (pizza, or "cookies, pancakes and spaghetti") and "juice". "However, even with all these developments and capabilities, the universe of what these toys could was severely limited. At most they could recognize the voice of the child as its "mommy".
Amazing Amanda doll.
Image credit:amazing-amanda.fuzzup.netThey were constrained by both the high price of storage (Flash storage is much sturdier than spinning hard drives, but an order of magnitude costlier; this limits the amount of storage possible) and limited computational capability (putting in a high-end microprocessor inside every doll would make them prohibitively expensive). The flip side was that what the toys spoke in home to the children stayed at home. These toys had a limited set of pre-programmed sentences and emotions they could convey, and if you wanted something different, you went out and bought a new toy, or in some cases, a different cartridge.
That's where things stood. Till now.
Screenshot of ToyFair websiteBetween February 14-17, 2015, the Jacob K. Javits Convention Center in New York saw "the Western Hemisphere’s largest and most important toy show" - the 2015 Toy Fair. This was a trade-show, which meant that "Toy Fair is not open to the public. NO ONE under the age of 18, including infants, will be admitted." It featured a "record-breaking 422,000+ net square feet of exhibit space" and hundreds of thousands of toys. Yet no children were allowed. Be that as it may, there was no dearth of, let's say, "innovative" toys. Apart from an "ultra creepy mechanical doll, complete with dead eyes", a fake fish pet that taken to a "whole new level of weird", or a "Doo Doo Head" doll that had the shape of you-guessed-it , of particular interest was a "Hello Barbie" doll, launched by the Fortune 500 behemoth, Mattel. This doll had several USPs to its credit. It featured voice-recognition software, voice recording capabilities, the ability to upload recorded conversations to a server (presumably Mattel's or ToyTalk's) in the cloud, over "Wi-Fi" - as a representative at the exhibition took pains to emphasize, repeatedly - and give "chatty responses." This voice data would be processed and analyzed by the company's servers. The doll would learn the child's interests, and be able to carry on a conversation on those topics - made possible by the fact that the entire computational and learning capabilities of a server farm in the cloud could be accessed by every such toy. That the Barbie franchise is a vital one to Mattel could not be understated. The Barbie brand netted Mattel $1.2 billion in FY 2013 , but this represented a six per cent year-on-year decline. Mattel attributed that this decline in Barbie sales in part to "product innovation not being strong enough to drive growth." The message was clear. Something very "innovative" was needed to jump-start sales. To make that technological leap forward, Mattel decided to team up with ToyTalk.
ToyTalk is a San Francisco-based start-up, and its platform powered the voice-recognition software used by "Hello Barbie". ToyTalk is headed by "CEO Oren Jacob, Pixar's former CTO, who worked at the groundbreaking animation company for 20 years" , and which claimed "$31M in funding from Greylock Partners, Charles River Ventures, Khosla Ventures, True Ventures and First Round Capital as well as a number of angel investors." 
Cover of Misery, by Stephen King.
Published by Viking Press.The voice recognition software would allow Mattel and ToyTalk to learn the preferences of the child, and over time refine the responses that Barbie would communicate back. As the Mattel representative put it, "She's going to get to know all my likes and all my dislikes..." - a statement that at one level reminds one of Annie Wilkes when she says, "I'm your number one fan." We certainly don't want to be in Paul Sheldon shoes.
Hello Barbie's learning would start happening from the time the doll was switched on and connected to a Wi-Fi network. ToyTalk CEO Oren Jacob said, "we'll see week one what kids want to talk about or not" . These recordings, once uploaded to the company's servers, would be used by "ToyTalk's speech recognition platform, currently powering the company's own interactive iPad apps including The Winston Show, SpeakaLegend, and SpeakaZoo" and which then "allows writers to create branching dialogue based on what children will potentially actually say, and collects kids' replies in the cloud for the writers to study and use in an evolving environment of topics and responses.". Some unknown set of people. sitting in some unknown location, would potentially get to hear and listen to entire conversations of a child before his parents would.
Screenshot of ToyTalk's Privacy page- "When users interact with ToyTalk, we may capture photographs or audio or video recordings (the "Recordings") of such interactions, depending upon the particular application being used.
- We may use, transcribe and store such Recordings to provide and maintain the Service, to develop, test or improve speech recognition technology and artificial intelligence algorithms, and for other research and development or internal purposes."
Further reading revealed that what your child spoke to the doll in the confines of his home in, say, suburban Troy Michigan, could end up travelling half the way across the world, to be stored on a server in a foreign country - "We may store and process personal information in the United States and other countries." 
What information would ToyTalk share with "Third Parties" was equally disturbing, both for the amount of information that could potentially be shared as well as for the vagueness in defining who these third-parties could possibly be - "Personal information"; "in an aggregated or anonymized form that does not directly identify you or others;"; "in connection with, or during negotiations of, any merger, sale of company assets, financing or acquisition, or in any other situation where personal information may be disclosed or transferred as one of the business assets of ToyTalk"; "We may also share feature extracted data and transcripts that are created from such Recordings, but from which any personal information has been removed, with Service Providers or other third parties for their use in developing, testing and improving speech recognition technology and artificial intelligence algorithms and for research and development or other purposes." A child's speech, words, conversation, voice - as recorded by the doll - was the "business asset" of the company.
And lest the reader have any concerns about safety and security of the data on the company's servers, the following disclaimer put paid to any reassurances on that front also: "no security measures are perfect or impenetrable and no method of data transmission that can be guaranteed against any interception or other type of misuse." If the sound of hands being washed-off could be put down on paper, that sentence above is what it could conceivably look like.
Apart from the firestorm of criticism described above, the advocacy group "Campaign for a Commercial Free Childhood" started a campaign to petition Mattel "CEO Christopher Sinclair to stop "Hello Barbie" immediately." 
The brouhaha over "Hello Barbie" is however only symptomatic of several larger issues that have emerged and intersect each other in varying degrees, raising important questions about technology, including the cloud, big data, the Internet of Things, data mining, analytics; privacy in an increasingly digital world; advertising and the ethics of marketing to children; law and how it is able to or unable to cope with an increasingly digitized society; and the impact on children and teens - sociological as well as psychological. Technology and Moore's Law  have combined with the convenience of broadband to make possible what would have been in the realm of science fiction even two decades ago.
The Internet, while opening up untold avenues of betterment for society at large, has however also revealed itself as not without a dark side - a dilemma universally common to almost every transformative change in society. From the possibly alienating effects of excessive addiction to the Internet to physiological changes that the very nature of the hyperlinked web engenders in humans - these are issues that are only recently beginning to attract the attention of academics and researchers. The basic and most fundamental notions of what people commonly understood as "privacy" are not only being challenged in today's digital world, but in most cases without even a modicum of understanding on the part of the affected party - you. In the nebulous space that hopefully still exists between those who believe in technology as the only solution capable of delivering a digital nirvana to all and every imaginable problem in society on the one hand and the Luddites who see every bit of technology as a rabid byte (that's a bad pun) against humanity lies a saner middle ground that seeks to understand and adapt technology for the betterment of humanity, society, and the world at large.
So what happened to Chucky? Well, as we know, it spawned a successful and profitable franchise of sequels and other assorted franchise. Which direction "Hello Barbie" takes is of less interest to me as the broader questions I raised in the previous paragraph.
 "Child's Play" review, http://www.rogerebert.com/reviews/childs-play-1988
 "A Brief History of Talking Dolls--From Bebe Phonographe to Amazing Amanda", http://collectdolls.about.com/od/dollsbymaterial/a/talkingdolls.htm
 "Edison Talking Doll", http://www.edisontinfoil.com/doll.htm
Disclaimer: Views expressed are personal.
© 2015, Abhinav Agarwal. All rights reserved.
This doesn't work the way we expect. File a bug.
Over the years, my own experience tells me that two-thirds of bugs filed aren't really bugs. What we really have is a user who fails to understand how the software works. And, yes, we can always respond with RTM (or worse), but the stream of bugs that aren't bugs continues to be filed. Stop for a second and imagine all the software development productivity lost in addressing bugs that aren't bugs. And we wonder why it takes so long to introduce new releases and features?
This doesn't work the way we expect. We'll have to customize the software.
We're not talking about extensions or bolt-ons. We're talking about changing the code delivered out of the box. Seems like around 75 percent of Oracle enterprise applications customers customize delivered code in one way or another. SaaS will cut this number way down, but it's still widely prevalent throughout the installed customer base of Oracle enterprise applications.
Why is customization bad? First, it means a customer must have a team of developers to build and maintain the customization through it's lifecycle. It's that maintenance part that gets really costly...each new update, each new release, each new expansion require testing and likely some change to the customized code. And here comes the incredibly crazy part of customization: I would confidently bet my lungs against yours that over two-thirds of those customizations are unnecessary to accomplish the purposes of the customer. Because the software out of the box already has the functionality to achieve business goal in mind...but it's likely a new way of doing things, and many folks don't want to change. Even when the change might be better. So we customize the code.
What to do?
As a very young man, I spent some time as a Boy Scout. I was a really lousy Boy Scout. Nothing that I liked about the entire thing. Later on, after I grew up, I developed a great appreciation for Scouting as a Scout Master. Nevertheless, I was a lousy Scout and resented my folks for putting me through it.
My Scout Master was retired military. Lots of gruffness at a time when I just wasn't ready for that. Only one thing he ever taught us stuck with me: "when you realize you're lost, take a breath, know what ya got, and figure out how to use what you've got to get yourself unlost."
Years later, I got lost in the woods. The advice from that Scout Master saved my life. And it's been a great principle for me ever since, in many situations: "know what ya got."
Most enterprise customers today don't know what they've got. That knowledge gap leads to filing bugs that aren't really bugs and building customizations when existing functionality gets the job done. And telling customers to RTM adds almost now value (heck, even most of the people building the software dread reading those things). If those of us in the industry want our customers to succeed with our products, we have to help by showing and telling. Which also means we have to earn the trust of our customers, because showing and telling achieves nothing if the audience fails to engage.
So you want to reduce the bogus bug filings? Cut back on customizations that slow everyone down? Work with your customers. Customer success starts there.