Skip navigation.

Feed aggregator

e-Literate TV Preview: Essex County College and changing role of faculty

Michael Feldstein - Wed, 2015-02-25 17:58

By Phil HillMore Posts (289)

As we get closer to the release of the new e-Literate TV series on personalized learning, Michael and I will be posting previews highlighting some of the more interesting segments from the series. When we first talked about the series with its sponsors, the Bill & Melinda Gates Foundation, they agreed to give us the editorial independence to report what we find, whether it is good, bad, or indifferent.

In this video preview (about 4:18 in duration), we hear from two faculty members who have first-hand experience in using a personalized learning approach as well as a traditional approach to remedial math. We also hear from students on what they are learning about learning. In our case studies so far, the real faculty issue is not that software is being designed to replace faculty, but rather that successful implementation of personalized learning necessarily changes the role of faculty. One of our goals with e-Literate TV is to allow faculty, staff and students to describe direct experiences in their own words. Take a look.

Click here to view the embedded video.

Stay tuned for the full episodes to be released on the In The Telling platform[1]. You can follow me (@PhilOnEdTech), Michael (@mfeldstein67), or e-Literate TV (@eLiterateTV) to stay up to date. You can also follow the e-Literate TV YouTube channel. We will also announce the release here on e-Literate.

  1. ITT is our partner in developing this series, providing video production as well as the platform.

The post e-Literate TV Preview: Essex County College and changing role of faculty appeared first on e-Literate.

Understanding vs Resolution – Statistics History Cleanup

Dominic Brooks - Wed, 2015-02-25 13:26

Today I helped resolve a problem quickly but to the detriment of my understanding of exactly what was going on and why.

And as part of the quicker resolution, I had to destroy the evidence which would have helped understand better.

So… now need to go back and figure it out if I have time to model it properly, etc.

Here’s what little I know so far.

What happened was that there were a number of application sessions experiencing slowness and contention when executing a recursive stats history cleanup statement.

Verified via ASH that this recursive delete was somehow being called by app ETL code (TOP_LEVEL_SQL_ID, PLSQL_ENTRY_OBJECT_ID & USER_ID columns), four of them each running slow statement and also because of that nature of the statement below and being blocked by mode 6 TX locks from the leading execution.

Version is

Statement was sql id 9v9n97qj8z1dg:

delete /*+ dynamic_sampling(4) */ 
from sys.wri$_optstat_histhead_history 
where savtime < :1 
and rownum <= NVL(:2, rownum);

First up, resolution was quick and easy according to, by a quirk of coincidence, my last post:

We just got rid of all the statistics history using the magic PURGE_ALL truncate flag.
The history is of limited usefulness day-to-day anyway.

The slowness was holding up ETL jobs which were just trying to calculate stats for their own partition.

I was brought into the situation towards the end of the situation but here’s an illustration of the slowness:

SQL Text
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_histhead_history where savtime < :1 and rownum <= NVL(:2, rownum)

Global Information
 Status              :  EXECUTING               
 Instance ID         :  1                       
 SQL ID              :  9v9n97qj8z1dg           
 SQL Execution ID    :  16777249                
 Execution Started   :  02/24/2015 19:11:25     
 First Refresh Time  :  02/24/2015 19:13:06     
 Last Refresh Time   :  02/25/2015 09:10:35     
 Duration            :  50351s                  
 Module/Action       :  JDBC Thin Client/-      
 Program             :  JDBC Thin Client        

| Name | Position |  Type  |                                           Value                                           |
| :2   |        2 | NUMBER | 10000                                                                                     |

Global Stats
| Elapsed |   Cpu   |    IO    | Application | Concurrency |  Other   | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes |
|   50359 |   35199 |     0.16 |       14669 |         254 |      237 |     1G |   63 | 504KB |

SQL Plan Monitoring Details (Plan Hash Value=2348801730)
| Id   |            Operation             |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |            Activity Detail            |
|      |                                  |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |              (# samples)              |
|    0 | DELETE STATEMENT                 |                               |         |       |     50262 |   +101 |  1423 |        1 |      |       |          |                                       |
|    1 |   DELETE                         | WRI$_OPTSTAT_HISTHEAD_HISTORY |         |       |     50361 |     +2 |  1423 |        1 |      |       |    29.74 | enq: TX - row lock contention (14630) |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | buffer busy waits (231)               |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (82)                              |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (5)                |
|    2 |    COUNT                         |                               |         |       |     50262 |   +101 |  1423 |      14M |      |       |     0.00 | Cpu (1)                               |
|    3 |     FILTER                       |                               |         |       |     50266 |    +97 |  1423 |      14M |      |       |    30.14 | Cpu (15146)                           |
|    4 |      TABLE ACCESS BY INDEX ROWID | WRI$_OPTSTAT_HISTHEAD_HISTORY |      6M | 23218 |     50253 |   +100 |  1423 |       4G |   22 | 176KB |    23.17 | buffer busy waits (18)                |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (11627)                           |
| -> 5 |       INDEX RANGE SCAN           | I_WRI$_OPTSTAT_HH_ST          |    990K |  5827 |     50264 |   +101 |  1423 |       4G |   41 | 328KB |    16.94 | buffer busy waits (3)                 |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (8332)                            |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (180)              |
   3 - filter(ROWNUM<=NVL(:2,ROWNUM))

So, first thought was that:

1. I thought that MMON was responsible for purging old stats… clearly not in this case as there were multiple concurrent application connections purging old data as part of their ETL process.

2. The DELETE is deleting any old history older than a parameterised date, the first 10000 rows thereof. There is no connection to the object on whose statistics the application ETL jobs were working on.

3. I would expect a COUNT STOPKEY operation but then the NVL predicate avoids this optimisation.

4. UPDATE: The real time sql monitoring output is also reporting the metrics for 1423 executions of this statement. The clue is in the 14M rows reported in the COUNT + FILTER operations. 1423 * 10000 rows = 14M. But I requested the output for what I thought was a single execution id – 16777249 – strange… bulk/batched delete ?

More questions than answers…

Forecast for SXSW: Mostly Cloudy with 100% Chance of Innovative Ideas

Linda Fishman Hoyle - Wed, 2015-02-25 12:23

South by Southwest (SXSW) is just around the corner, and we can’t wait for the five nerdiest days on the calendar. Thirty thousand of the top influencers from advertising, technology, marketing, and social media will converge on Austin, Texas, to swap ideas and build relationships.

Oracle’s lineup this year is all about the cloud. We have a lot of clouds at Oracle, and each one is designed to make our customers' lives easier. SXSW is a great opportunity to highlight our integrated solutions to these influencers. As Mark Hurd said, “We integrate the products for the customer so the customer doesn’t have to do it. We optimize the software for the solution, and it’s delivered us strong growth."

The best part? We're only getting better at what we do in the cloud. We have a great lineup of customer speakers sharing our stages to talk about their Oracle journeys plus a fun cocktail event again this year. So check out what we have planned for you at SXSW.

Friday, March 13th: A Full Day of Oracle Intelligence

Come by the Radisson Hotel at 111 Cesar Chavez and Congress to see a great day of presentations with Oracle customers. Also check out our Social Intelligence Center to follow all of the #SXSW online buzz.

Saturday, March 14 and Sunday, March 15:  The Accelerator at the SXSW Startup Village

For the second year in a row, we are the premier sponsor of the Accelerator, a competition of the top 48 most innovative ideas, technologies, products, and/or services. This competition will take place on Saturday and Sunday through a series of live demonstrations at the SXSW Startup Village. Don’t miss this!

Sunday, March 15th: More Great Sessions

Tuesday, March 17th: One Final Session Before Heading Home

If you aren’t able to go to Austin, follow the conversation on social:

  • Oracle (@Oracle)
  • Oracle Social Cloud (@OracleSocial)
  • Oracle Marketing Cloud (@OracleMktgCloud)
  • Oracle Data Cloud (@OracleDataCloud)
  • Oracle Customer Experience (@OracleCX)

Virtual CPUs with Google Compute Engine

Pythian Group - Wed, 2015-02-25 12:12

Continuing on my series of virtual CPU behavior in Amazon Web Services, Amazon Web Service HVM instances, and Microsoft Azure, I’m taking a look at Google Compute Engine (GCE). GCE is a relative newcomer to the public cloud world, become generally available in December 2013. It does have some interesting features, including transparent maintenance through live migration, and automatic sustained-use discounts without upfront commitments.

Unlike Amazon or Microsoft, Google is very upfront about their vCPU definition.

For the n1 series of machine types, a virtual CPU is implemented as a single hyperthread on a 2.6GHz Intel Sandy Bridge Xeon or Intel Ivy Bridge Xeon (or newer) processor. This means that the n1-standard-2 machine type will see a whole physical core.

I still believe calling such a hyperthread a “virtual CPU” is misleading. When creating a virtual machine in a non-cloud VM platform, 1 virtual CPU = 1 physical core. Plain and simple. But when using a cloud platform, I need 2 virtual CPUs to get that same physical core.


Anyways, off to run some CPU tests. n1-standard-4 is a close match to the m3.xlarge instances previously tested, so I’ll try that.

Getting set up on Google Compute Engine

I already signed up with Google Compute Engine’s free trial and created a project I’m calling marc-cpu-test. Installing the gcloud compute command-line tools.

[marc@quartz ~]$ gcloud auth login --no-launch-browser
Go to the following link in your browser:

Enter verification code: (redacted)
Saved Application Default Credentials.

You are now logged in as [].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT
[marc@quartz ~]$ gcloud config set project marc-cputest
[marc@quartz ~]$ gcloud config set compute/zone us-central1-a
[marc@quartz ~]$ gcloud compute instances create cpu-test-n4 --image centos-6 --machine-type "n1-standard-4" --zone us-central1-a
Created [].
cpu-test-n4 us-central1-a n1-standard-4 RUNNING
[marc@quartz ~]$ gcloud compute ssh cpu-test-n4
WARNING: You do not have an SSH key for Google Compute Engine.
WARNING: [/usr/bin/ssh-keygen] will be executed to generate a key.
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/marc/.ssh/google_compute_engine.
Your public key has been saved in /home/marc/.ssh/
The key fingerprint is:
Updated [].
Warning: Permanently added '' (RSA) to the list of known hosts.
Warning: Permanently added '' (RSA) to the list of known hosts.

OK, instance all set and connected. As a CentOS 6 image it doesn’t allow SSH root logins by default, so attempting to set up a gcloud environment as a root user will get you “permission denied” errors on SSH. Serves me right for trying to run these tools as root in the first place :-).

Looking around

Checking what they got us:

[marc@cpu-test-n4 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
processor       : 1
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
processor       : 2
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
processor       : 3
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2

Google has redacted the exact CPU model numbers, but has clearly marked this as a 2-core system with core IDs 0 and 1.

The single-CPU case
[marc@cpu-test-n4 ~]$ taskset -pc 0 $$
pid 1558's current affinity list: 0-3
pid 1558's new affinity list: 0
[marc@cpu-test-n4 ~]$  dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 14.3427 s, 151 MB/s
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 29.3081 s, 74.1 MB/s
2170552320 bytes (2.2 GB) copied, 29.3065 s, 74.1 MB/s

We get a nice boost in raw CPU numbers as compared to the 120 MB/s I saw in AWS. With two processes sharing this CPU, see a tiny bit less than half the throughput.

Sharing the cores (or trying to)
[marc@cpu-test-n4 ~]$ taskset -pc 0,1 $$
pid 1558's current affinity list: 0
pid 1558's new affinity list: 0,1
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c &gt; /dev/null &amp; done
[1] 1803
[2] 1805
[marc@cpu-test-n4 ~]$ 2170552320 bytes (2.2 GB) copied, 14.6959 s, 148 MB/s
2170552320 bytes (2.2 GB) copied, 14.7139 s, 148 MB/s

This is interesting; we see almost full-speed throughput in spite of processors 0 and 1 said to have a shared core. With processors 0 and 2 the situation is the same.

[marc@cpu-test-n4 ~]$ taskset -pc 0,2 $$
pid 1558's current affinity list: 0,1
pid 1558's new affinity list: 0,2
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c &gt; /dev/null &amp; done
[1] 1830
[2] 1833
[marc@cpu-test-n4 ~]$ 2170552320 bytes (2.2 GB) copied, 14.6683 s, 148 MB/s
2170552320 bytes (2.2 GB) copied, 14.6692 s, 148 MB/s

Is the CPU scheduler ignoring my taskset commands? Running mpstat 2-second samples during the test to see actual CPU usage:

[marc@cpu-test-n4 ~]$ mpstat -P ALL 2
06:08:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:46 PM  all   46.31    0.00    3.75    0.00    0.00    0.00    0.00    0.00   49.94
06:08:46 PM    0   93.50    0.00    6.50    0.00    0.00    0.00    0.00    0.00    0.00
06:08:46 PM    1   92.00    0.00    8.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:46 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:46 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:52 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:54 PM  all   46.75    0.00    3.25    0.00    0.00    0.00    0.00    0.00   50.00
06:08:54 PM    0   93.47    0.00    6.53    0.00    0.00    0.00    0.00    0.00    0.00
06:08:54 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:54 PM    2   93.50    0.00    6.50    0.00    0.00    0.00    0.00    0.00    0.00
06:08:54 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

So the taskset commands are working: when we ask for CPUs 0 and 1, we are getting them, but throughput shows that cores aren’t being shared. It means that the CPUs in the virtual machine are not statically bound to hardware threads as seen under AWS. I’d call it a win as it gets more consistent performance even if the guest operating system is forced to makes poor CPU scheduling decisions as in this case.

[marc@cpu-test-n4 ~]$ taskset -pc 0-3 $$
pid 1558's current affinity list: 0,2
pid 1558's new affinity list: 0-3
[marc@cpu-test-n4 ~]$ for i in {1..4}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 22.9823 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 22.9914 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 22.9915 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 23.1333 s, 93.8 MB/s

This is more the throughput we would expect with two shared cores.

Lessons learned

Over the course of these tests, I’ve discovered a few things:

  • Although they share virtual CPUs like competitors, Google is very upfront about this behavior.
  • Actual throughput for a simple gzip workload is excellent.
  • Google Compute Engine has an abstraction layer in front of CPUs that dynamically schedules tasks between CPU threads, in addition to the regular scheduler in the virtual machine. In my testing, it allocates tasks efficiently across CPU cores, even when the OS scheduler is configured suboptimally.
Categories: DBA Blogs

Exadata X5 – A Practical Point of View of the New Hardware and Licensing

Pythian Group - Wed, 2015-02-25 12:10

Oracle recently announced its latest iteration of Exadata – X5-2. It includes a refresh of the hardware to the most recent Xeon® E5-2699 v3 CPUs. These new CPUs boost the total cores count in a full rack to 288. This is higher than the current 8 socket “big machine” version X4-8, which has only 240 cores.

But the most exciting part is the all flash version of Exadata. In the previous generation – X4 – Oracle had to switch from 15K drives to 10K drives in order to boost capacity from 600 GB to 1200 GB per hard drive to keep disk space higher than flash cache size. At that time of X4 announcements, we were already wondering why Oracle was still offering high-speed disks and not switching to all flash, and now we know why. Because that type of high-performance flash wasn’t quite ready.

Maintaining high IO rates over long periods of times needed some changes to the ILOM in order to maintain cooling fans speed based on many individual temperature sensors inside the flash cards (details). Removing the SAS controller and using the new NVMe connectivity resulted in much higher bandwidth per hard drive – 3.2 GBytes/sec vs. the old 1.2 GBytes/sec SAS.

With temperature and bandwidth sorted out, we now have a super-high performance option (EF – Extreme Flash) for Exadata which delivers the stunning 263 GB/sec uncompressed scan speed in a full rack. The difference in performance between the High Capacity and High Performance EF flash option is now much higher. The high-performance option in Exadata X5 is now viable. In Exadata X4 it made so little difference, that it was pointless.

x4 vs x5

The one thing I wonder with the X5 announcement is why the X5-2 storage server still uses the very old and quite outdated 8 core CPUs. I’ve seen many cases where a Smart Scan on an HCC table is CPU bound on the storage server even when reading from spinning disk. I am going to guess that there’s some old CPU inventory to cleanup. But that may not end up being such a problem (see “all columnar” flash cache feature).

But above all, the most important change was the incremental licensing option. With 36 cores per server, even the 1/8th rack configuration was in the multi-million dollars in licenses, and in many cases was too much for the problem in hand.

The new smallest configuration is:

  • 1/8th rack, with 2 compute nodes
  • 8 cores enabled per compute node (16 total)
  • 256 GB RAM per node (upgradable to 768 GB per node)
  • 3 storage servers with only half the cores, disks and flash enabled

Then you can license additional cores as you need them, 2 cores at a time. Similar to how ODA licensing option worked. You cannot reduce licensed cores.

The licensing rules changes go even further. Now you can mix & match compute and storage servers to create even more extreme options. Some non-standard examples:

  • Extreme Memory – more compute nodes with max RAM, reduced licensed cores
  • Extreme Storage – replace compute node with storage nodes, reduced licensed cores

x5 custom
Link to video

In conclusion, Oracle Exadata X5 configuration options and the changes it brings to licensing allows an architect to craft a system that will meet any need and allow for easy, small step increments in the future, potentially without any hardware changes.

There are many more exciting changes in Oracle 12c, Exadata X5 and the new storage server software which I may cover in the future as I explore them in detail.

Categories: DBA Blogs

Log Buffer #411, A Carnival of the Vanities for DBAs

Pythian Group - Wed, 2015-02-25 12:00

This Log Buffer Edition brings you some blog posts from Oracle, SQL Server and MySQL.


Suppose you have a global zone with multiple zpools that you would like to convert into a native zone.

The digital revolution is creating abundance in almost every industry—turning spare bedrooms into hotel rooms, low-occupancy commuter vehicles into taxi services, and free time into freelance time

Every time I attend a conference, the Twitter traffic about said conference is obviously higher.  It starts a couple weeks or even months before, builds steadily as the conference approaches, and then hits a crescendo during the conference.

Calling All WebLogic Users: Please Help Us Improve WebLogic Documentation!

Top Two Cloud Security Concerns: Data Breaches and Data Loss

SQL Server:

This article describes a way to identify the user who truncated the table & how you can recover the data.

When SQL Server 2014 was released, it included Hekaton, Microsoft’s much talked about memory-optimized engine that brings In-Memory OLTP into play.

Learn how you can easily spread your backup across multiple files.

Daniel Calbimonte has written a code comparison for MariaDB vs. SQL Server as it pertains to how to comment, how to create functions and procedures with parameters, how to store query results in a text file, how to show the top n rows in a query, how to use loops, and more.

The article show a simple way we managed to schedule index rebuild and reorg for an SQL instance with 106 databases used by one application using a Scheduled job.


How to setup a PXC cluster with GTIDs (and have async slaves replicating from it!)

vCloud Air and business-critical MySQL

MySQL Dumping and Reloading the InnoDB Buffer Pool

How to benchmark MongoDB

MySQL Server on SUSE 12

Categories: DBA Blogs

SQL Server 2014 Cumulative Update 6

Pythian Group - Wed, 2015-02-25 11:59

Hello everyone,

Just a quick note to let you know that this week, while most of North America was enjoying a break, Microsoft released the 6th cumulative update for SQL Server 2014. This update contains fixes for 64 different issues, distributed as follows:

SQL 2014 Cumulative Update 6

As the name implies, this is a cumulative update, that means it is not necessary to install the previous 5 in case you don’t have them. Please remember to test thoroughly any update before applying to production.

The cumulative update and the full release notes can be found here:



Categories: DBA Blogs

Microsoft addresses bug that impacts all Windows iterations [VIDEO]

Chris Foot - Wed, 2015-02-25 10:33


Hi, welcome to RDX! Microsoft finally addressed a bug that could have enabled hackers to initiate man-in-the-middle attacks. The vulnerability was discovered nearly 15 years ago by JAS Global Advisors, a security firm from Chicago.

The flaw, which was dubbed Jasburg, posed grievous concerns for many organizations. In some cases, Jasburg could allow a figure to install malware, manipulate data or create administrator accounts. All of these actions were conducted through a business’s Active Directory.

Why did it take so long to address? Silicon Angle noted Jasburg was rooted in the Windows OS design. This meant that Microsoft had to re-engineer central components of its flagship OS and add a list of new functions. The patch should be implemented once users reboot their systems.

Thanks for watching!

The post Microsoft addresses bug that impacts all Windows iterations [VIDEO] appeared first on Remote DBA Experts.

Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration

Rittman Mead Consulting - Wed, 2015-02-25 09:51

In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?

The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:


As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.

The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:


For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:

host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) 
([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"
[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
LOCATION '/user/oracle/rm_logs';

If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.

create table access_logs as
select * 
from apachelog;

At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_logs;

The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.


The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.

Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:


At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.


Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.


Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.


In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.

Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.


I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.


These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.


So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.

Categories: BI & Warehousing


Marco Gralike - Wed, 2015-02-25 08:39
When you follow the Oracle in-memory / optimizer team, then you have probably seen this…

PeopleTools 8.54: Oracle Resource Manager

David Kurtz - Wed, 2015-02-25 04:11
This is part of a series of articles about new features and differences in PeopleTools 8.54 that will be of interest to the Oracle DBA.

Oracle Resource manager is about prioritising one database session over another, or about restricting the overhead of one session for the good of the other database users.  A resource plan is a set of rules that are applied to some or all database sessions for some or all of the time.  Those rules may be simple or complex, but they need to reflect the business's view of what is most important. Either way Oracle resource manager requires careful design.
I am not going to attempt to further explain here how the Oracle feature works, I want to concentrate on how PeopleSoft interfaces with it.
PeopleTools FeatureThis feature effectively maps Oracle resource plans to PeopleSoft executables.  The resource plan will then manage the database resource consumption of that PeopleSoft process.  There is a new component that maps PeopleSoft resource names to Oracle consumer groups.  For this example I have chosen some of the delivered plans in the MIXED_WORKLOAD_GROUP that is delivered with Oracle 11g.

  • The Oracle Consumer Group field is validated against the name of the Oracle consumer groups defined in the database, using view     .
SELECT DISTINCT group_or_subplan, type
FROM dba_rsrc_plan_directives

------------------------------ --------------
If you use Oracle SQL Trace on a PeopleSoft process (in this case PSAPPSRV) you find the following query.  It returns the name of the Oracle consumer group that the session should use.The entries in the component shown above are stored in PS_PT_ORA_RESOURCE
  • PS_PTEXEC2RESOURCE is another new table that maps PeopleSoft executable name to resource name.


And then the PeopleSoft process explicitly switches its group, thus:
old_group varchar2(30);
Unfortunately, the consequence of this explicit switch is that it overrides any consumer group mapping rules, as I demonstrate below.
SetupThe PeopleSoft owner ID needs some additional privileges if it is to be able to switch to the consumer groups.

SELECT DISTINCT r.pt_ora_consumr_grp
FROM sysadm.ps_pt_ora_resource r
WHERE r.pt_ora_consumr_grp != ' '
AND r.pt_ora_consumr_grp != 'OTHER_GROUPS'
dbms_output.put_line('Grant '||i.pt_ora_consumr_grp);
,CONSUMER_GROUP => i.pt_ora_consumr_grp

The RESOURCE_MANAGER_PLAN initialisation parameters should be set to the name of the plan which contains the directives.
NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------
resource_manager_plan string MIXED_WORKLOAD_PLAN

I question one or two of the mappings on PS_PTEXEC2RESOURCE.

-------------------------------- -----------------


  • PSNVS is the nVision Windows executable.  It is in PeopleTools resource MISCELLANEOUS.  This is nVision running in 2-tier mode.  I think I would put nVision into the same consumer group as query.  I can't see why it wouldn't be possible to create new PeopleSoft consumer groups and map them to certain executables.  nVision would be a candidate for a separate group. 
    • For example, one might want to take a different approach to parallelism in GL reporting having partitioned the LEDGER tables by FISCAL_YEAR and ACCOUNTING_PERIOD
  • PSQED is also in MISCELLANEOUS.  Some customers use it to run PS/Query in 2-tier mode, and allow some developers to use it to run queries.  Perhaps it should also be in the QUERY SERVER group.
Cannot Mix PeopleSoft Consumer Groups Settings with Oracle Consumer Group MappingsI would like to be able to blend the PeopleSoft configuration with the ability to automatically associate Oracle consumer groups with specific values of MODULE and ACTION.  Purely as an example, I am trying to move the Process Monitor component into the SYS_GROUP consumer group.

(attribute => 'MODULE_NAME'
,consumer_group => 'SYS_GROUP');

However, it doesn't work because the explicit settings overrides any rules, and you cannot prioritise other rules above explicit settings
exec dbms_application_info.set_module('PROCESSMONITOR','PMN_PRCSLIST');
SELECT REGEXP_SUBSTR(program,'[^.@]+',1,1) program
, module, action, resource_consumer_group
FROM v$session
ORDER BY program, module, action

So I have created a new SQL*Plus session and set the module/action and it has automatically mover into the SYS_GROUP.  Meanwhile, I have been into the Process Monitor in the PIA and the module and action of the PSAPPSRV session has been set, but they remain in the interactive group.
---------------- ---------------- ---------------- ------------------------

If I set the module to something that doesn't match a rule, the consumer group goes back to OTHER_GROUPS which is the default. 
exec dbms_application_info.set_module('WIBBLE','PMN_PRCSLIST');

---------------- ---------------- ---------------- ------------------------

Now, if I explicitly set the consumer group exactly as PeopleSoft does my session automatically moves into the INTERACTIVE_GROUP.
old_group varchar2(30);

---------------- ---------------- ---------------- ------------------------

Next, I will set the module back to match the rule, but the consumer group doesn't change because the explicit setting takes priority over the rules.
---------------- ---------------- ---------------- ------------------------
You can rearrange the priority of the other rule settings, but explicit must have the highest priority (if you try will get ORA-56704). So, continuing with this example, I cannot assign a specific component to a different resource group unless I don't use the PeopleSoft configuration for PSAPPSRV.
Instead, I could create a rule to assign a resource group to PSAPPSRV via the program name, and have a higher priority rule to override that when the module and/or action is set to a specific value.  However, first I have to disengage the explicit consumer group change for PSAPPSRV by removing the row from PTEXEC2RESOURCE.
UPDATE ps_ptexec2resource 
SET pt_resource_name = 'DO_NOT_USE'
WHERE pt_executable_name = 'PSAPPSRV'
AND pt_resource_name = 'APPLICATION SERVER'
(attribute => 'CLIENT_PROGRAM'
,value => 'PSAPPSRV'
,consumer_group => 'INTERACTIVE_GROUP');

(attribute => 'MODULE_NAME'
,consumer_group => 'SYS_GROUP');

explicit => 1,
oracle_user => 2,
service_name => 3,
module_name_action => 4, --note higher than just module
module_name => 5, --note higher than program
service_module => 6,
service_module_action => 7,
client_os_user => 8,
client_program => 9, --note lower than module
client_machine => 10
So, you would have to choose between using either the PeopleSoft configuration or the Oracle Resource Manager configuration.  It depends on your requirements.  This is going to be a decision you will have to take when you design your resource management.  Of course, you can always use just the mapping approach in versions of PeopleTools prior to 8.54.

ConclusionI have never seen Oracle Resource Manager used with PeopleSoft.  Probably because setting it up is not trivial, and then it is difficult to test the resource plan.  I think this enhancement is a great start, that makes it very much easier to implement Oracle Resource Manager on PeopleSoft.  However, I think we need more granularity.
  • I would like to be able to put specific process run on the process scheduler by name into specific consumer groups.  For now, you could do this with a trigger on PSPRCSRQST that fires on process start-up that makes an explicit consumer group change (and puts it back again for Application Engine on completion). 
  • I would like the ability to set different resource groups for the same process name in different application server domains.  For example,
    • I might want to distinguish between PSQRYSRV processes used for ad-hoc PS/Queries on certain domains from PSQRYSRVs used to support nVision running in 3-tier mode on other domains.
    • I might have different PIAs for backup-office and self-service users going to different applications servers.  I might want to prioritise back-office users over self-service users.
Nonetheless, I warmly welcome the new support for Oracle Resource Manager in PeopleTools.  It is going to be very useful for RAC implementations, I think it will be essential for multi-tenant implementations where different PeopleSoft product databases are plugged into the same container database overrides any rules©David Kurtz, Go-Faster Consultancy Ltd.

Partner Webcast – Oracle Business Process Management 12c : The Game Changer for your Business

The Oracle Business Process Management Suite 12c (BPM) is of one the most complete BPM suites in the market and the most feature rich BPM suite offerings. There have been a wide variety of changes...

We share our skills to maximize your revenue!
Categories: DBA Blogs

BPM12c Quickstart "invalid oramds url" solved; recommended patches

Darwin IT - Wed, 2015-02-25 01:53
A few weeks ago I mentioned that I ran into a bug relating an invalid MDS url while trying to create an XSL based on the Case.xsd.

Just this morning got a notification on my Service Request that there is a patch for an invalid MDS url while creating an XSLT based on the Case management XSD: 19775314. There are two versions of the patch: one for the and one for for homes that have the bundle patch applied.

I tried the second one on my BPM QuickStart home that was patched with the bundle patch.

So currently the recommended patches on the BPM QuickStart are:
  • 20163149: (dataobject assignment lost after dehydration), this is mentioned as a prerequesite to the bundle patch, as alternative to the wrongly mentioned patch in the Readme.
  • 19707784: SOA Bundle Patch
  • 20440332: Initiator task form does not shown up on workspace and task is auto approved
  • 19775314: java.lang.IllegalStateException: Invalid url ERROR WHEN CREATING TRANSFORMATION (Patch) 
  • 19706799: db adapter wizard mappings and xsd file creation does not trigger event in win (Patch); choose the version.

Annonce : Oracle Database In-Memory Advisor

Jean-Philippe Pinte - Wed, 2015-02-25 01:27
Oracle Database In-Memory Advisor est maintenant disponible.
Pour utiliser cet assistant, le "Database Tuning Pack" est nécessaire.

Plus d'information :
  • Page OTN
  • Note MOS 1965342.1

Coding in PL/SQL in C style, UKOUG, OUG Ireland and more

Pete Finnigan - Tue, 2015-02-24 22:05

My favourite language is hard to pin point; is it C or is it PL/SQL? My first language was C and I love the elegance and expression of C. Our product PFCLScan has its main functionallity written in C. The....[Read More]

Posted by Pete On 23/07/14 At 08:44 PM

Categories: Security Blogs

Integrating PFCLScan and Creating SQL Reports

Pete Finnigan - Tue, 2015-02-24 22:05

We were asked by a customer whether PFCLScan can generate SQL reports instead of the normal HTML, PDF, MS Word reports so that they could potentially scan all of the databases in their estate and then insert either high level....[Read More]

Posted by Pete On 25/06/14 At 09:41 AM

Categories: Security Blogs

Automatically Add License Protection and Obfuscation to PL/SQL

Pete Finnigan - Tue, 2015-02-24 22:05

Yesterday we released the new version 2.0 of our product PFCLObfuscate . This is a tool that allows you to automatically protect the intellectual property in your PL/SQL code (your design secrets) using obfuscation and now in version 2.0 we....[Read More]

Posted by Pete On 17/04/14 At 03:56 PM

Categories: Security Blogs

Twitter Oracle Security Open Chat Thursday 6th March

Pete Finnigan - Tue, 2015-02-24 22:05

I will be co-chairing/hosting a twitter chat on Thursday 6th March at 7pm UK time with Confio. The details are here . The chat is done over twitter so it is a little like the Oracle security round table sessions....[Read More]

Posted by Pete On 05/03/14 At 10:17 AM

Categories: Security Blogs

PFCLScan Reseller Program

Pete Finnigan - Tue, 2015-02-24 22:05

We are going to start a reseller program for PFCLScan and we have started the plannng and recruitment process for this program. I have just posted a short blog on the PFCLScan website titled " PFCLScan Reseller Program ". If....[Read More]

Posted by Pete On 29/10/13 At 01:05 PM

Categories: Security Blogs

PFCLScan Version 1.3 Released

Pete Finnigan - Tue, 2015-02-24 22:05

We released version 1.3 of PFCLScan our enterprise database security scanner for Oracle a week ago. I have just posted a blog entry on the PFCLScan product site blog that describes some of the highlights of the over 220 new....[Read More]

Posted by Pete On 18/10/13 At 02:36 PM

Categories: Security Blogs