Skip navigation.

Other

Javascript Driven ADF Taskflows for WebCenter Portal

This is a continuation from my previous post - Developing WebCenter Content Cross Platform iDoc Enabled Components for Mobile, ADF, Sharepoint, Liferay.

You can see a video of JIVE Forums integration with a JS Taskflows vs ADF Taskflow running in WebCenter Portal here -

Click here for hi-resolution

This post is aimed at Web Developers, Designers and Marketing web teams who aren’t familiar with ADF and want to create reusable dynamic taskflows without the need to learn ADF or Java to provide interactive dynamic regions using Javascript, HTML and CSS with custom frameworks like jQuery designed not to conflict with ADF JS environment.

Read on for a step by step run through on creating JS driven taskflows  -

    1. You will need to download JDeveloper – I’m using JDev 11.1.1.7.0 for WebCenter Portal 11g where I will deploy my custom taskflow driven entirely with Javascript.
    2. Run through the following Oracle guide to setup your project to extend Portal (11.1.1.8.3) - Developing Components for WebCenter Portal Using JDeveloper
    3. Add new taskflow to library by right-clicking WebCenterSpacesExtensions and selecting “New…”
    4. Add ADF Task Flow (JSF)
      .
      1
      .
    5. Name the xml file, leaving the Directory the JDev default
      .
      2
      .
    6. Double click the new xml file and drag a View element into the diagram from the Component Palette
      .
      3
      .
    7. Rename “view1″ to “[taskflow name]View”.
    8. Double click the new view to create a page fragment.
      Update the directory and add \taskflows\[taskflow name]\view
      This will make it easier to sort through in the future when you develop more taskflows.
      .
      4
      .
    9. Edit the JSFF and display code in source view.
      .
      5
      .
    10. Replace with the following -
      <?xml version='1.0' encoding='UTF-8'?>
      <jsp:root xmlns:jsp="http://java.sun.com/JSP/Page" version="2.1"
                xmlns:af="http://xmlns.oracle.com/adf/faces/rich"
                xmlns:f="http://java.sun.com/jsf/core">
      <af:resource type="javascript">
      <![CDATA[
      /**
       * CREATE BASE JS CONTAINER OBJ
       * This is base class to assist PSA javascript methods to init after page loaded.
       * You can add this script in the head of you template instead of the portlet.
       */
      var FB = window.FB || {},
      	Base = Base || (function() {
      		return {
      			//create multi-cast delegate.
      			onPortalInit: function(function1, function2) {
      				return function() {
      					if (function1) {
      						function1();
      					}
      					if (function2) {
      						function2();
      					}
      				}
      			},
      			//used for chaining methods
      			chainPSA: function() {}
      		}
      	})();
      
      //Use Base method if FB.Base hasn't been created
      FB.Base = FB.Base || Base;
      /************************/
      
      
      
      
      /**
       * CREATE CHAIN WRAPPER
       * Chain method will initialise from Base requirejs core script
       */
      FB.Base.chainPSA = FB.Base.onPortalInit(FB.Base.chainPSA, function() {
      	//set base mustache template name to load and inject
      	var vUID = 'FB_sampleContainer_${pageFlowScope.containerID}', //(UID) Unique Classname to inject template into - can't use IDs in portal 
      		oConstructor = {
      			vTemplate: 		'import/tpl/sampleTpl', //location of sampleTpl.mustache to load
      			oParams: { //Obj list of default params pulled from sample.xml Input definition
      				title:			'${pageFlowScope.title}',
      				displayTitle: 	'${pageFlowScope.displayTitle}',
      				activeUser: 	'${pageFlowScope.activeUser}'
      			},
      			containerID: 		vUID
      		};
      	
      	//check if array exists from other custom JS Portlets
      	if (typeof(FB.loadTemplate) === 'object') {
      		FB.loadTemplate.portletUIDList.push(vUID);
      	//create empty object
      	} else {
      		FB.loadTemplate = {
      			portletUIDList:[vUID],
      			portlets: {}
      		};
      	}
      
      	//inject params
      	FB.loadTemplate.portlets[vUID] = oConstructor;
      
      });
      /************************/
      ]]>
      </af:resource>
      
      
      <!-- Sample template will be injected here -->
      <af:panelGroupLayout layout="vertical" id="FB-SampleContainer" styleClass="FB_sampleContainer_#{pageFlowScope.containerID} portlet-sampleContainer"></af:panelGroupLayout>
      <!-- xSample template will be injected here -->
      
      
      </jsp:root>

      OVERVIEW:

      This is where the mustache template will be injected into to provide the sample component functionality.

    11. <af:panelGroupLayout layout="vertical" id="FB-SampleContainer" styleClass="FB_sampleContainer_#{pageFlowScope.containerID} portlet-sampleContainer"></af:panelGroupLayout>

      The oConstructor specifies the configuration of the the component to inject.
      vTemplate points to a JS file that requireJS imports and configures the base multiUploader components from the params defined.

      oParams contains all configuration for the App at the moment these are scoped params associated with the taskflow that you can allow the user to define and use within you sample component as a JS var.

      var vUID = 'FB_sampleContainer_${pageFlowScope.containerID}', //(UID) Unique Classname to inject template into - can't use IDs in portal 
      		oConstructor = {
      			vTemplate: 		'import/tpl/sampleTpl', //location of sampleTpl.mustache to load
      			oParams: { //Obj list of default params pulled from sample.xml Input definition
      				title:			'${pageFlowScope.title}',
      				displayTitle: 	'${pageFlowScope.displayTitle}',
      				activeUser: 	'${pageFlowScope.activeUser}'
      			},
      			containerID: 		vUID
      		};

      A simple check to see if other components exist on the page and append the new component within the JS Array “PortletUIDList” associated with a JS Object holding the component params in “portlets”

      //check if array exists from other custom JS Portlets
      	if (typeof(FB.loadTemplate) === 'object') {
      		FB.loadTemplate.portletUIDList.push(vUID);
      	//create empty object
      	} else {
      		FB.loadTemplate = {
      			portletUIDList:[vUID],
      			portlets: {}
      		};
      	}
      
      	//inject params
      	FB.loadTemplate.portlets[vUID] = oConstructor;

      Finally the JS configuration is wrapped in JS chain wrapper that will only initialise when requireJS has loaded in all its core base libraries like Jquery etc.

      FB.Base.chainPSA = FB.Base.onPortalInit(FB.Base.chainPSA, function() {
      
      //code
      
      });

      Make sure within your ADF Template you have setup requirejs core and have the following to initialise the FB.Base.chainPSA and loop through the custom taskflows to display on the page -

      //load JS Components
      		if (FB.Base.chainPSA) {
      			FB.Base.chainPSA();
      		}

      //loop and request all templates required
      			for (x;x<lPortletList;x++) {
      				var vPortletUID 	= aPortletList[x],
      					oPortlet 		= FB.loadTemplate.portlets[vPortletUID];
      				
      				//define temp object info to pass into script when init	
      				define('temp'+x, oPortlet);
      				
      				//request and initialise portlet template & pass params
      				require([oPortlet.vTemplate,'temp'+x], function(tpl,oPortlet) {
      					console.log('[IMPORTED TEMPLATE]',tpl.component,oPortlet);
      					tpl.init(oPortlet);
      				});
      			}

    12. To add taskflow parameters open the xml file again.
    13. Select Overview tab bottom left of the screen.
      Select the Parameters side tab.
      Add the following four example params -
      .
      6
      You will see these when we add and edit the taskflow to a portal page in WebCenter Composer.
    14. Deploy the taskflow to WebCenter Portal following the last steps in the Oracle GuideOnce the new taskflow / spaces extension project has been deployed load WebCenter Portal.
      The following screenshots from PS5 the UI has changed since PS7 but you should be able to work out the differences.
    1. Go into administration area of the portal and select the “Resources” Tab
    2. Select the “Resource Catalogs” from the items on the left under the “Structure” heading.
      A list of Resource Catalogs will be available. You can create a new one or use an existing one. Make sure the one you are updating is the one being used by the portal you want to add the taskflow into.
      .
      7
      .
    3. Select the resource catalogue and Edit from the Edit Menu drop down down.
      .
      8
      .
    4. A window will appear hear you can add folders and where you want your components to appear.
      I have created a Demo Taskflow folder.
    5. Select “Add From Library” from the Add dropdown menu.
      .
      9
      .
    6. Drill into Taskflows and add your [Taskflow] – I am adding the sample taskflow I created earlier.
    7. Go into your portal create a new page and add the new taskflow.
      Here is an example of the Jive Forums that I recreated as a JS driven taskflow.
      .
      11
      .
      12
      .
    8. And the final output of the taskflow on the page.
      13

 

 

 

The post Javascript Driven ADF Taskflows for WebCenter Portal appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Developing WebCenter Content Cross Platform iDoc Enabled Components for Mobile, ADF, Sharepoint, Liferay

frankensteinSo over the last couple of months I’ve been thinking and tinkering with code – “Whats the best approach for creating a WebCenter Content (WCC) components that I can consume and reuse across multiple platforms and environments?”
Is it… pagelet producer or maybe an iFrame; but these solutions just weren’t good enough or allowed the flexibility I really wanted.

I needed a WCC Solution that could easily be consumed into Mobile either cordova (Hybrid APP) or ADF Mobile (AMX views) and that worked on different devices/platforms as well as on any enterprise app ie Sharepoint (.Net), Lifreray,  WebCenter Portal (ADF) or even consumed into the new WebCenter Content ADF WebUI – Providing the added advantage that there would not need to be multiple branches of code or redevelopment of the component for each platform and environment.. 

And in the famous words of Victor Frankenstein.. “It’s Alive!!”

Yes; after tinkering around and trying different approaches this is the solution I created to support the above model.
I’m not saying this is the right approach or supported by the enterprise vendors but an approach that is reusable and can work on all enterprise apps.

 

[VIDEO CONVERTING]…

Here’s a quick video of a drag/drop MultiUploader component I created for WebCenter Content Classic that I can reuse on .Net and ADF WebCenter Portal/Content as well as mobile.

Read on to find out more on how this was achieved..

1) First I’m going dig into WebCenter Content and explain underlying structure of the component.

To create a flexible base model I created a light Javascript framework very similar to AngularJS or ReactJS.

This would be the base component that would enable additional components on the page with the use of Mustache (JS templates) to drive and inject dynamic functional areas of content into a specified DOM node by ID or className.
Any changes of layout with the component are handle via an ajax request to a cached mustache template which updates the DOM when needed (similar to ADFs PPR) any user interaction is handled through event driven actions from the imported templates.

RequireJS is used to supply a flexible module loading framework – where I do not need to be worried over conflicts of JS libraries and is used to load in mustache templates and additional JS functionality when needed.

You’re probably thinking that there are going to be a lot of AJAX requests going back and forth and its going to be slow.. Just Check out the video.. And the answer is not really – the mustache templates are going to be smaller than average images you load on a page..

So as an example for the MultiUploader I only have 1 mustache template that is 9kb.. All interaction is handled by 2 JS files that are 39kb uncompressed.

2) As mentioned a base model WCC component –  “FishbowlModuleLoader” will load in and initiate all other components on the page and will only load and cache required templates and JS files as and when is needed. There is no point to load in all templates and JS functionality on a page if it is not needed improving performance and interaction of the component.

3) The WCC components – “FishbowlMultiUploader”: I’m going to run through and give a quick overview of how this component works.

WebCenter Content Resource Asset

This is the base structure of the Content Component configuration “fb_multi_upload_page_body” it is consumed into a custom template “MULTI_UPLOAD_PAGE” which is requested via a custom service request “?IdcService=GET_FB_MULTI_UPLOAD_PAGE”.

<!--
Name:           fb_multi_upload_page_body
Author:         John Sim  [18/06/2014]
Parameters:		
Description:	Page Body for Multi Checkin used in MULTI_UPLOAD_PAGE template
-->
<@dynamichtml fb_multi_upload_page_body@>
[[% FB fb_multi_upload_page_body Template body MULTI_UPLOAD_PAGE %]]

<div id="FB-multiCheckin" class="FB_multiCheckin"></div>

<script>
/**
 * CREATE CHAIN WRAPPER
 * Chain method will load from Base ModuleLoader requirejs core script
 */
FB.Base.chainPSA = FB.Base.onPortalInit(FB.Base.chainPSA, function() {
	//set base mustache template name to load and inject
	var vUID = 'FB_multiCheckin', //(UID) Unique Classname to inject template into - can't use IDs in portal 
		oConstructor = {
			vTemplate: 'import/tpl/multiUploadTpl', //location of template.mustache to load
			oParams: { //Obj list of default params pulled from multiUploader.xml Input definition
				maxUploadSize:			'10mb',
				defaultDocType:			('<$multiUploadDefaultType$>' !== '')? '<$multiUploadDefaultType$>': 'Document', 
				defaultSecurityGroup:		('<$multiUploadDefaultSecurityGroup$>' !== '')? '<$multiUploadDefaultSecurityGroup$>': 'Public',
				defaultAccount:			'Workspace/'+userName, 
				author:				(typeof(userName) !== 'undefined')? userName: '', 
				httpEnterpriseCgiPath: 		(typeof(httpEnterpriseCgiPath) !== 'undefined')? httpEnterpriseCgiPath: '',
				idcToken: 			(typeof(idcToken) !== 'undefined')? idcToken: '',
				httpWebRoot: 			(typeof(httpWebRoot) !== 'undefined')? httpWebRoot: '',
				enableTagging:			true,
				enableEmails:			true,
				enableBarcode:			true,
				enableCheckinProfiles: 		true,
				showHelpOption: 		true
			},
			containerID: 		vUID
		};
	
	//check if array exists from other custom JS Portlets
	if (typeof(FB.loadTemplate) === 'object') {
		FB.loadTemplate.portletUID.push('FB_multiUploadContainer_' + vUID);
	//create empty object
	} else {
		FB.loadTemplate = {
			portletUIDList:['FB_multiUploadContainer_' + vUID],
			portlets: {}
		};
	}

	//inject params
	FB.loadTemplate.portlets['FB_multiUploadContainer_' + vUID] = oConstructor;
});
/************************/
</script>


<@end@>

 

This is where the mustache template will be injected into to provide the multiUpload component functionality.

<div id="FB-multiCheckin" class="FB_multiCheckin"></div>

The oConstructor specifies the configuration of the the component to inject.
vTemplate points to a JS file that requireJS imports and configures the base multiUploader components from the params defined.

oParams contains all configuration for the App at the moment these are mostly hard coded but could be defined as iDoc Variables when you install and enable the component within WCC.

var vUID = 'FB_multiCheckin', //(UID) Unique Classname to inject template into - can't use IDs in portal 
		oConstructor = {
			vTemplate: 'import/tpl/multiUploadTpl', //location of template.mustache to load
			oParams: { //Obj list of default params pulled from multiUploader.xml Input definition
				maxUploadSize:			'10mb',
				defaultDocType:			('<$multiUploadDefaultType$>' !== '')? '<$multiUploadDefaultType$>': 'Document', 
				defaultSecurityGroup:		('<$multiUploadDefaultSecurityGroup$>' !== '')? '<$multiUploadDefaultSecurityGroup$>': 'Public',
				defaultAccount:			'Workspace/'+userName, 
				author:				(typeof(userName) !== 'undefined')? userName: '', 
				httpEnterpriseCgiPath: 		(typeof(httpEnterpriseCgiPath) !== 'undefined')? httpEnterpriseCgiPath: '',
				idcToken: 			(typeof(idcToken) !== 'undefined')? idcToken: '',
				httpWebRoot: 			(typeof(httpWebRoot) !== 'undefined')? httpWebRoot: '',
				enableTagging:			true,
				enableEmails:			true,
				enableBarcode:			true,
				enableCheckinProfiles: 		true,
				showHelpOption: 		true
			},
			containerID: 		vUID
		};

A simple check to see if other components exist on the page and append the new component within the JS Array “PortletUIDList” associated with a JS Object holding the component params in “portlets”

//check if array exists from other custom JS Portlets
	if (typeof(FB.loadTemplate) === 'object') {
		FB.loadTemplate.portletUID.push('FB_multiUploadContainer_' + vUID);
	//create empty object
	} else {
		FB.loadTemplate = {
			portletUIDList:['FB_multiUploadContainer_' + vUID],
			portlets: {}
		};
	}

	//inject params
	FB.loadTemplate.portlets['FB_multiUploadContainer_' + vUID] = oConstructor;

Finally the JS configuration is wrapped in JS chain wrapper that will only initialise when requireJS has loaded in all its core base libraries like Jquery etc.

FB.Base.chainPSA = FB.Base.onPortalInit(FB.Base.chainPSA, function() {

//code

});

 

4) So lets take a look at how the base component “FishbowlModuleLoader” works.

Essentially this defines the FB.Base.chainPSA chain wrapper method in the header – does not need jquery or any other library.

<!--
Name:           std_html_head_declarations
Author:         John Sim  [18/06/2014]
Parameters:		
Description:	Add required header resources
-->
<@dynamichtml std_html_head_declarations@>
[[% FB std_html_head_declaration Update head add JS libs for module loader %]]

<$include super.std_html_head_declarations$>

<script>
/**
 * CREATE BASE JS CONTAINER OBJ
 * DONOT ADD JQUERY this is base class to assist PSA javascript methods to init after page loaded.
 */
var FB = window.FB || {},
	Base = Base || (function() {
		return {
			//create multi-cast delegate.
			onPortalInit: function(function1, function2) {
				return function() {
					if (function1) {
						function1();
					}
					if (function2) {
						function2();
					}
				}
			},
			//used for chaining methods
			chainPSA: function() {}
		}
	})();

//Use Base method if FB.Base hasn't been created
FB.Base = FB.Base || Base;
/************************/
</script>

<@end@>

You could cache this and put it in a script file I’ve just put it inline easier for you to read

In the footer we define requireJS and the configuration that loads in base libraries that we need for all components ie Jquery and maybe a few others.
Also we setup fb.core.js as our base script to import and load in the core framework I built to handle routing and DOM event interaction as well as global vars.

<!--
Name:           std_page_end
Author:         John Sim  [18/06/2014]
Parameters:		
Description:	Component Module Loader RequireJS setup
-->
<@dynamichtml std_page_end@>
[[% FB std_page_end Add Module Loader RequireJS lib %]]

<$include super.std_page_end$>


<!-- Init FB Component Module Loader -->
<script src="<$HttpWebRoot$>resources/FishbowlModuleLoader/js/core/config.js"></script>
<script src="<$HttpWebRoot$>resources/FishbowlModuleLoader/js/libs/requirejs/require.min.js" data-main="fb.core"></script>
<!-- Init FB Component Module Loader -->
<@end@>

fb.core.js so here is where the magic begins

// REQUIREJS Base configuration
require([
	//Dom ready req plugin
	'domReady',
	
	
	//core 
	'import/Layout',
	'import/Action',
	'import/Navigation',
	'import/Global',
	
	
	//Plugins
	'Moment',		//date plugin momentjs
	'ftlabsFastClick', 	//fix touch 300ms delay
	'fb'			//fb global methods
	

	
], function(domReady, Layout){
console.info('[ALL MODULES LOADED]');

	domReady(function() {
		console.info('[DOM READY]');
		
		//initialise layout DOM events ie click, touch etc.
		Layout.init();
		
		//load JS Components
		if (FB.Base.chainPSA) {
			FB.Base.chainPSA();
		}
		
		//check if any JS driven template containers exist
		if (typeof(FB.loadTemplate) !== 'undefined') {
			var aPortletList 	= FB.loadTemplate.portletUIDList,
				lPortletList 	= aPortletList.length,
				x 				= 0;
				
			//loop and request all templates required
			for (x;x<lPortletList;x++) {
				var vPortletUID 	= aPortletList[x],
					oPortlet 		= FB.loadTemplate.portlets[vPortletUID];
				
				//define temp object info to pass into script when init	
				define('temp'+x, oPortlet);
				
				//request and initialise portlet template & pass params
				require([oPortlet.vTemplate,'temp'+x], function(tpl,oPortlet) {
					console.log('[IMPORTED TEMPLATE]',tpl.component,oPortlet);
					tpl.init(oPortlet);
				});
			}
		}
		
	});
	
});

Once the Dom has fully loaded FB.Base.chainPSA(); is initiated this sets up and configures the FB.loadTemplate object that contains all information associated to required components that will need to be loaded into the page.

Here we loop through and load in all templates and pass across the component configuration to the templates to be initialised:

//loop and request all templates required
			for (x;x<lPortletList;x++) {
				var vPortletUID 	= aPortletList[x],
					oPortlet 		= FB.loadTemplate.portlets[vPortletUID];
				
				//define temp object info to pass into script when init	
				define('temp'+x, oPortlet);
				
				//request and initialise portlet template & pass params
				require([oPortlet.vTemplate,'temp'+x], function(tpl,oPortlet) {
					console.log('[IMPORTED TEMPLATE]',tpl.component,oPortlet);
					tpl.init(oPortlet);
				});
			}

And thats all there is too it..

5) Lets dig into WebCenter Portal now.. How can you reuse all that code you’ve written for WebCenter Content Classic within ADF?

Easy; lets create a JS driven taskflow template that we can dump into the resource catalogue and drag drop reuse it throughout any page where ever it is needed.

I’ve created a new post for this part -
Read on here to find out how to create JS Driven Taskflow templates.

 

Some gotchas’ - 

Somethings to think about if you do decide to use this approach.

  1. You will need to make sure that all AJAX requests are made on the same domain.
    1. or enable CORs from UCM to accepts requests cross domain. (Mobile works crossdomain)
  2. WCC needs to be accessible by the users browser
    1. You can setup a proxy service and only allow access to the custom services you require to lock down other UCM environment access if needed.

And finally - one thing that comes to mind here I am using static mustache templates but there is nothing stopping you from creating a custom WCC service to generate mustache templates with embedded idoc if you want..

The post Developing WebCenter Content Cross Platform iDoc Enabled Components for Mobile, ADF, Sharepoint, Liferay appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Teradata bought Hadapt and Revelytix

DBMS2 - Wed, 2014-07-23 02:29

My client Teradata bought my (former) clients Revelytix and Hadapt.* Obviously, I’m in confidentiality up to my eyeballs. That said — Teradata truly doesn’t know what it’s going to do with those acquisitions yet. Indeed, the acquisitions are too new for Teradata to have fully reviewed the code and so on, let alone made strategic decisions informed by that review. So while this is just a guess, I conjecture Teradata won’t say anything concrete until at least September, although I do expect some kind of stated direction in time for its October user conference.

*I love my business, but it does have one distressing aspect, namely the combination of subscription pricing and customer churn. When your customers transform really quickly, or even go out of existence, so sometimes does their reliance on you.

I’ve written extensively about Hadapt, but to review:

  • The HadoopDB project was started by Dan Abadi and two grad students.
  • HadoopDB tied a bunch of PostgreSQL instances together with Hadoop MapReduce. Lab benchmarks suggested it was more performant than the coyly named DBx (where x=2), but not necessarily competitively with top analytic RDBMS.
  • Hadapt was formed to commercialize HadoopDB.
  • After some fits and starts, Hadapt was a Cambridge-based company. Former Vertica CEO Chris Lynch invested even before he was a VC, and became an active chairman. Not coincidentally, Hadapt had a bunch of Vertica folks.
  • Hadapt decided to stick with row-based PostgreSQL, Dan Abadi’s previous columnar enthusiasm notwithstanding. Not coincidentally, Hadapt’s performance never blew anyone away.
  • Especially after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or all of its sales and marketing folks. Hadapt pivoted to emphasize its schema-on-need story.
  • Chris Lynch, who generally seems to think that IT vendors are created to be sold, shopped Hadapt aggressively.

As for what Teradata should do with Hadapt:

  • My initial thought Hadapt was to just double down, pushing the technology forward, presumably including a columnar option such as the one Citus Data developed.
  • But upon reflection, if it made technical sense to merge the Aster and Hadapt products, that would be better yet.

I herewith apologize to Aster co-founder and Hadapt skeptic Tasso Argyros (who by the way has moved on from Teradata) for even suggesting such heresy. :)

Complicating the story further:

  • Impala lets you treat data in HDFS (Hadoop Distributed File System) as if it were in a SQL DBMS. So does Teradata SQL-H. But Hadapt makes you decide whether the data is in HDFS or the SQL DBMS, and it can’t be in both at once. Edit: Actually, see Dan Abadi’s comments below.
  • Impala and Oracle’s new SQL-H competitor have daemons running on every data node. So does one option in Hadapt. But I don’t think SQL-H does that yet.

I was less involved with Revelytix that with Hadapt (although I’m told I served as the “catalyst” for the original Teradata/Revelytix partnership). That said, Teradata — like Oracle — is always building out a data integration suite to cover a limited universe of data stores. And Revelytix’ dataset management technology is a nice piece toward an integrated data catalog.

Related posts

Categories: Other

Data integration as a business opportunity

DBMS2 - Sun, 2014-07-20 21:59

A significant fraction of IT professional services industry revenue comes from data integration. But as a software business, data integration has been more problematic. Informatica, the largest independent data integration software vendor, does $1 billion in revenue. INFA’s enterprise value (market capitalization after adjusting for cash and debt) is $3 billion, which puts it way short of other category leaders such as VMware, and even sits behind Tableau.* When I talk with data integration startups, I ask questions such as “What fraction of Informatica’s revenue are you shooting for?” and, as a follow-up, “Why would that be grounds for excitement?”

*If you believe that Splunk is a data integration company, that changes these observations only a little.

On the other hand, several successful software categories have, at particular points in their history, been focused on data integration. One of the major benefits of 1990s business intelligence was “Combines data from multiple sources on the same screen” and, in some cases, even “Joins data from multiple sources in a single view”. The last few years before application servers were commoditized, data integration was one of their chief benefits. Data warehousing and Hadoop both of course have a “collect all your data in one place” part to their stories — which I call data mustering — and Hadoop is a data transformation tool as well.

And it’s not as if successful data integration companies have no value. IBM bought a few EAI (Enterprise Application Integration) companies, plus top Informatica competitor Ascential, plus Cast Iron Systems. DataDirect (I mean the ODBC/JDBC guys, not the storage ones) has been a decent little business through various name changes and ownerships (independent under a couple of names, then Intersolv/Merant, then independent again, then Progress Software). Master data management (MDM) and data cleaning have had some passable exits. Talend raised $40 million last December, which is a nice accomplishment if you’re French.

I can explain much of this in seven words: Data integration is both important and fragmented. The “important” part is self-evident; I gave examples of “fragmented” a couple years back. Beyond that, I’d say:

  • A new class of “engine” can be a nice business — consider for example Informatica/Ascential/Ab Initio, or the MDM players (who sold out to bigger ETL companies), or Splunk. Indeed, much early Hadoop adoption was for its capabilities as a data transformation engine.
  • Data transformation is a better business to enter than data movement. Differentiated value in data movement comes in areas such as performance, reliability and maturity, where established players have major advantages. But differentiated value in data transformation can come from “intelligence”, which is easier to excel in as a start-up.
  • “Transparent connectivity” is a tough business. It is hard to offer true transparency, with minimal performance overhead, among enough different systems for anybody to much care. And without that you’re probably offering a low-value/niche capability. Migration aids are not an exception; the value in those is captured by the vendor of what’s being migrated to, not by the vendor who actually does the transparent translation. Indeed …
  • … I can’t think of a single case in which migration support was a big software business. (Services are a whole other story.) Perhaps Cast Iron Systems came closest, but I’m not sure I’d categorize it as either “migration support” or “big”.

And I’ll stop there, because I’m not as conversant with some of the new “smart data transformation” companies as I’d like to be.

Related links

Categories: Other

The point of predicate pushdown

DBMS2 - Tue, 2014-07-15 07:52

Oracle is announcing today what it’s calling “Oracle Big Data SQL”. As usual, I haven’t been briefed, but highlights seem to include:

  • Oracle Big Data SQL is basically data federation using the External Tables capability of the Oracle DBMS.
  • Unlike independent products — e.g. Cirro — Oracle Big Data SQL federates SQL queries only across Oracle offerings, such as the Oracle DBMS, the Oracle NoSQL offering, or Oracle’s Cloudera-based Hadoop appliance.
  • Also unlike independent products, Oracle Big Data SQL is claimed to be compatible with Oracle’s usual security model and SQL dialect.
  • At least when it talks to Hadoop, Oracle Big Data SQL exploits predicate pushdown to reduce network traffic.

And by the way – Oracle Big Data SQL is NOT “SQL-on-Hadoop” as that term is commonly construed, unless the complete Oracle DBMS is running on every node of a Hadoop cluster.

Predicate pushdown is actually a simple concept:

  • If you issue a query in one place to run against a lot of data that’s in another place, you could spawn a lot of network traffic, which could be slow and costly. However …
  • … if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.

“Predicate pushdown” gets its name from the fact that portions of SQL statements, specifically ones that filter data, are properly referred to as predicates. They earn that name because predicates in mathematical logic and clauses in SQL are the same kind of thing — statements that, upon evaluation, can be TRUE or FALSE for different values of variables or data.

The most famous example of predicate pushdown is Oracle Exadata, with the story there being:

  • Oracle’s shared-everything architecture created a huge I/O bottleneck when querying large amounts of data, making Oracle inappropriate for very large data warehouses.
  • Oracle Exadata added a second tier of servers each tied to a subset of the overall storage; certain predicates are pushed down to that tier.
  • The I/O between Exadata’s two sets of servers is now tolerable, and so Oracle is now often competitive in the high-end data warehousing market,

Oracle evidently calls this “SmartScan”, and says Oracle Big Data SQL does something similar with predicate pushdown into Hadoop.

Oracle also hints at using predicate pushdown to do non-tabular operations on the non-relational systems, rather than shoehorning operations on multi-structured data into the Oracle DBMS, but my details on that are sparse.

Related link

Categories: Other

21st Century DBMS success and failure

DBMS2 - Mon, 2014-07-14 00:37

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

  • A significant minority of enterprises are highly committed to their enterprise DBMS standards.
  • Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
  • FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

  • First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
  • Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
  • Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
  • Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say: 

  • At large enterprises, their internet operations perhaps excepted:
    • Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
    • NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
  • Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
  • The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

  • Performance and/or scalability (many examples).
  • Developer features (for example dynamic schema).
  • License/maintenance cost (for example several open source categories).
  • Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

  • There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
  • There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
  • … appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

  • Reliability, especially but not only in short-request use cases.
  • Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

  • Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
    • I think this is one of several reasons NoSQL has been much more successful than NewSQL.
    • It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
  • As for MySQL:
    • MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
    • MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
    • In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
  • PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
  • SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
  • Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
  • Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

Categories: Other

Big Data in the Cloud at Google I/O

William Vambenepe - Tue, 2014-07-01 00:55

Last week was a great party for the entire Google developer family, including Google Cloud Platform. And within the Cloud Platform, Big Data processing services. Which is where my focus has been in the almost two years I’ve been at Google.

It started with a bang, when our fearless leader Urs unveiled Cloud Dataflow in the keynote. Supported by a very timely demo (streaming analytics for a World Cup game) by my colleague Eric.

After the keynote, we had three live sessions:

In “Big Data, the Cloud Way“, I gave an overview of the main large-scale data processing services on Google Cloud:

  • Cloud Pub/Sub, a newly-announced service which provides reliable, many-to-many, asynchronous messaging,
  • the aforementioned Cloud Dataflow, to implement data processing pipelines which can run either in streaming or batch mode,
  • BigQuery, an existing service for large-scale SQL-based data processing at interactive speed, and
  • support for Hadoop and Spark, making it very easy to deploy and use them “the Cloud Way”, well integrated with other storage and processing services of Google Cloud Platform.

The next day, in “The Dawn of Fast Data“, Marwa and Reuven described Cloud Dataflow in a lot more details, including code samples. They showed how to easily construct a streaming pipeline which keeps a constantly-updated lookup table of most popular Twitter hashtags for a given prefix. They also explained how Cloud Dataflow builds on over a decade of data processing innovation at Google to optimize processing pipelines and free users from the burden of deploying, configuring, tuning and managing the needed infrastructure. Just like Cloud Pub/Sub and BigQuery do for event handling and SQL analytics, respectively.

Later that afternoon, Felipe and Jordan showed how to build predictive models in “Predicting the future with the Google Cloud Platform“.

We had also prepared some recorded short presentations. To learn more about how easy and efficient it is to use Hadoop and Spark on Google Cloud Platform, you should listen to Dennis in “Open Source Data Analytics“. To learn more about block storage options (including SSD, both local and remote), listen to Jay in “Optimizing disk I/O in the cloud“.

It was gratifying to see well-informed people recognize the importance of these announcement and partners understand how this will benefit their customers. As well as some good press coverage.

It’s liberating to now be able to talk freely about recent progress on our quest to equip Google Cloud users with easy to use data processing tools. Everyone can benefit from Google’s experience making developers productive while efficiently processing data at large scale. With great power comes great productivity.

Categories: Other

Using multiple data stores

DBMS2 - Wed, 2014-06-18 10:03

I’m commonly asked to assess vendor claims of the kind:

  • “Our system lets you do multiple kinds of processing against one database.”
  • “Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”

So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:

  • The most extreme vendor marketing claims are false.
  • There are many different choices that make sense in at least some use cases each.

Horses for courses

It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:

Vendors are part of this consensus; already in 2005 I observed

For all practical purposes, there are no DBMS vendors left advocating single-server strategies.

Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.

Multiple data stores for a single application

We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree: 

  • It’s common and sensible to manage authentication and authorization data in its own data store. Commonly, the data format is LDAP (Lightweight Directory Access Protocol).
  • It’s common and sensible to manage the “content” and “e-commerce transaction records” aspects of websites separately.
  • Even beyond that case, there are often performance reasons to manage BLOBs (Binary Large OBjects) outside your relational database.
  • Internet “interaction” data is also often best managed outside an RDBMS, in part because of its very non-tabular data structures.

The spectacular 2010 JP Morgan Chase outage was largely caused, I believe, by disregard of these precepts.

There also are cases in which applications dutifully get all their data via SQL queries, but send those queries to two or more DBMS. Teradata is proud that its systems can support rather transactional queries (for example in call-center use cases), but the same application may read from and write to a true OTLP database as well.

Further, many OLTP (OnLine Transaction Processing) applications do some fraction of their work via inbound or outbound messaging. Many buzzwords can come into play here, including but not limited to:

  • SOA (Service-Oriented Architecture). This is the most current and flexible one.
  • EAI (Enterprise Application Integration). This was a hot concept in the late 1990s, but was generally implemented with difficulties that SOA was later designed to alleviate.
  • Message-oriented middleware (MOM) and Publish/Subscribe. These are even older, and overlap greatly.

Finally, every dashboard that combines information from different data stores could be assigned to this category as well.

Multiple storage approaches in a single DBMS

In theory, a single DBMS could operate like two or more different ones glued together. A few functions should or must be centralized, such as administration, and communication with the outside world (connection handling, parsing, etc.). But data storage, query execution and so on could for the most part be performed by rather loosely coupled subsystems. And so you might have the best of both worlds — something that’s multiple data stores in the ways you want that diversity, but a single system in how it fits into your environment.

I discussed this idea last year with cautious optimism, writing:

So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.

  • …  multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.
  • Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …
  • … strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM [have] been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.

In 2005 I had been more ambivalent, in part because my model was a full 1990s-dream “universal” DBMS:

IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML.

That covers most of the waterfront, but I’d like to more explicitly acknowledge three trends:

  • Among other things, Hadoop is a collection of DBMS (HBase, Impala, et al.) that in some cases are very loosely coupled to each other. The question is less how well the various data stores work together, and more how mature any one of them is on its own.
  • The multiple-data-models idea has been extended into schema-on-need, which is sometimes but not always housed in Hadoop.
  • Even on the relational side, multiple storage capabilities exist in one product.
    • Vertica was designed that way from the get-go. (Like the old joke about police duos, one is to read and one is to write.)
    • IBM, Microsoft and Oracle have all recently added some kind of in-memory columnar capability.
    • Teradata, Aster (before Teradata bought them), Greenplum and Vertica all added some variant on row/column dual stores.

Related links

Categories: Other

Announcing Fishbowl’s Technical Support Offerings for Oracle WebCenter

support_logo

Supporting an enterprise software system like Oracle WebCenter is no easy task. Technical complexities, customizations, and multiple versions make it difficult to resolve issues quickly and keep the system up and running. Without a dedicated and knowledgeable support team, WebCenter environments can suffer from system downtime, poor performance, and frustrated users.

Join Fishbowl Solutions for a webinar as they discuss their Oracle WebCenter technical support offerings. These offerings include specific technical services to support WebCenter administrators, end users, as well as customized environments. If you are a WebCenter administrator, power user, or an IT Director/Manager that oversees your company’s WebCenter environment, this webinar is for you. Come hear how Fishbowl’s support offerings could help you increase up-time, improve SR issue resolution, and ensure overall user satisfaction.

Attendees of this webinar will learn:

  • The reasons Fishbowl is best positioned to be your single point of contact for Oracle WebCenter technical support
  • What support services does Fishbowl offer and what does each include
  • The benefits Cascade Corporation has already realized with Fishbowl’s Enterprise Support offering for Oracle WebCenter

Date: Thursday, June 12th
Time: 1:00 – 2:00 PM EST, 12 – 1:00 PM CST

Register: https://www2.gotomeeting.com/register/941379506

 

The post Announcing Fishbowl’s Technical Support Offerings for Oracle WebCenter appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Optimism, pessimism, and fatalism — fault-tolerance, Part 2

DBMS2 - Sun, 2014-06-08 10:58

The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.

Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:

  • Clusters will have node hiccups more often than single nodes will. (Duh.)
  • Networks are relatively slow even when uncongested, and furthermore congest unpredictably.
  • In many applications, it’s OK to sacrifice even basic-seeming database functionality.

And so there’s been innovation in numerous cluster-related subjects, two of which are:

  • Distributed query and update. When a database is distributed among many modes, how does a request access multiple nodes at once?
  • Fault-tolerance in long-running jobs.When a job is expected to run on many nodes for a long time, how can it deal with failures or slowdowns, other than through the distressing alternatives:
    • Start over from the beginning?
    • Keep (a lot of) the whole cluster’s resources tied up, waiting for things to be set right?

Distributed database consistency

When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:

  • Analytic RDBMS innovation.
  • Short-request applications designed to avoid distributed joins.
  • Short-request clustered RDBMS that don’t allow fully-general distributed joins in the first place.

But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as: 

  • A write is planned and parceled out to occur on all the different nodes where the data needs to be placed.
  • Each node decides it’s ready to commit the write.
  • Each node informs the others of its readiness.
  • Each node actually commits.

Unfortunately, if any of the various messages in the 2PC process is delayed, so is the write. This creates way too much likelihood of work being blocked. And so modern approaches to distributed data writing are more … well, if I may repurpose the famous Facebook slogan, they tend to be along the lines of “Move fast and break things”,* with varying tradeoffs among consistency, other accuracy, reliability, functionality, manageability, and performance.

By the way — Facebook recently renounced that motto, in favor of “Move fast with stable infrastructure.” Hmm …

Back in 2010, I wrote about various approaches to consistency, with the punch line being:

A conventional relational DBMS will almost always feature RYW consistency. Some NoSQL systems feature tunable consistency, in which — depending on your settings — RYW consistency may or may not be assured.

The core ideas of RYW consistency, as implemented in various NoSQL systems, are:

  • Let N = the number of copies of each record distributed across nodes of a parallel system.
  • Let W = the number of nodes that must successfully acknowledge a write  for it to be successfully committed. By definition, W <= N.
  • Let R = the number of nodes that must send back the same value of a unit of data for it to be accepted as read by the system. By definition, R <= N.
  • The greater N-R and N-W are, the more node or network failures you can typically tolerate without blocking work.
  • As long as R + W > N, you are assured of RYW consistency.

That bolded part is the key point, and I suggest that you stop and convince yourself of it before reading further.

Eventually :) , Dan Abadi claimed that the key distinction is synchronous/asynchronous — is anything blocked while waiting for acknowledgements? From many people, that would simply be an argument for optimistic locking, in which all writes go through, and conflicts — of the sort that locks are designed to prevent — cause them to be rolled back after-the-fact. But Dan isn’t most people, so I’m not sure — especially since the first time I met Dan was to discuss VoltDB predecessor H-Store, which favors application designs that avoid distributed transactions in the first place.

One idea that’s recently gained popularity is a kind of semi-synchronicity. Writes are acknowledged as soon as they arrive at a remote node (that’s the synchronous part). Each node then updates local permanent storage on its own, with no further confirmation. I first heard about this in the context of replication, and generally it seems designed for replication-oriented scenarios.

Single-job fault-tolerance

Finally, let’s consider fault-tolerance within a single long-running job, whether that’s a big query or some other kind of analytic task. In most systems, if there’s a failure partway through a job, they just say “Oops!” and start it over again. And in non-extreme cases, that strategy is often good enough.

Still, there are a lot of extreme workloads these days, so it’s nice to absorb a partial failure without entirely starting over.

  • Hadoop MapReduce, which stores intermediate results anyway, finds it easy to replay just the parts of the job that went awry.
  • Spark, which is more flexible in execution graph and data structures alike, has a similar capability.

Additionally, both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once (presumably on different nodes), to hedge against the risk that any one copy of the process runs slowly or fails outright. According to my notes, speculative execution is a major part of NuoDB’ architecture as well.

Further topics

I’ve rambled on for two long posts, which seems like plenty — but this survey is in no way complete. Other subjects I could have covered include but are hardly limited to:

  • Occasionally-connected operation, which for example is a design point of CouchDB, SQL Anywhere (sort of), and most kinds of mobile business intelligence.
  • Avoiding planned downtime — i.e., operating despite self-inflicted wounds.
  • Data cleaning and master data management, both of which exist in large part to fix errors people have made in the past.

Related links

Categories: Other

Optimism, pessimism and fatalism — fault-tolerance, Part 1

DBMS2 - Sun, 2014-06-08 10:55

Writing data management or analysis software is hard. This post and its sequel are about some of the reasons why.

When systems work as intended, writing and reading data is easy. Much of what’s hard about data management is dealing with the possibility — really the inevitability — of failure. So it might be interesting to survey some of the many ways that considerations of failure come into play. Some have been major parts of IT for decades; others, if not new, are at least newly popular in this cluster-oriented, RAM-crazy era. In this post I’ll focus on topics that apply to single-node systems; in the sequel I’ll emphasize topics that are clustering-specific.

Major areas of failure-aware design — and these overlap greatly — include:

  • Backup and restore. In its simplest form, this is very basic stuff. That said — any decent database management system should let backups be made without blocking ongoing database operation, with the least performance impact possible.
  • Logging, rollback and replay. Logs are essential to DBMS. And since they’re both ubiquitous and high-performance, logs are being used in ever more ways.
  • Locking, latching, transactions and consistency. Database consistency used to be enforced in stern and pessimistic ways. That’s changing, big-time, in large part because of the requirements of …
  • distributed database operations. Increasingly, modern distributed database systems are taking the approach of getting work done first, then cleaning up messes when they occur.
  • Redundancy and replication. Parallel computing creates both a need and an opportunity to maintain multiple replicas of data at once, in very different ways than the redundancy and replication of the past.
  • Fault-tolerant execution. When one node is inoperative, inaccessible, overloaded or just slow, you may not want a whole long multi-node job to start over. A variety of techniques address this need.

Long-standing basics

In a single-server, disk-based configuration, techniques for database fault-tolerance start:

  • Database changes (inserts, deletes, updates) are applied expeditiously to disk. Furthermore …
  • … a log is kept of the (instructions for) changes. If a change is detected as not going through, it can be reapplied.
  • Data is often kept in multiple copies automagically, whether that is governed by the storage systems or the DBMS itself, with background resyncing if one copy is known to go bad.
  • These ideas can be pushed further into:
    • High availability — the database is mirrored onto storage that is controlled by a second server, which runs the same software as the primary.
    • Disaster recovery — same idea as HA, but off-site to protect against site disasters. DR often cuts various corners, as compared to same-site HA, in the speed and/or quality of service with which the remote instance would take over from the primary site.

Valuable though they are, none of these techniques protects against data corruption that occurs via software errors or security breaches, because in those cases the system will likely do a great job of making incorrect changes consistently across all copies of the data. Hence there also needs to be a backup/restore capability, key aspects of which start:

  • You periodically create and lock down snapshots of the database.
  • In case of comprehensive failure, you load the most recent snapshot you trust, and roll forward by re-applying changes memorialized in the log. (Of course, you avoid re-applying spurious changes that should not have occurred.)

For many users, it is essential that backups be online/continuous, rather than requiring the database to periodically be taken down.

None of this is restricted to what in a relational database would be single-row or at least single-table changes. Transaction semantics cover the case that several changes must either all change or all fail together. Typically all the changes are made; the system observes that they’ve all been made; only then is a commit issued which makes the changes stick. One complexity of this approach is that you need a way to quickly undo changes that don’t get committed.

Locks, latches and timestamps

If a database has more than one user, two worrisome possibilities arise:

  • Two conflicting changes might be attempted against the same data.
  • One user might try to read data at the moment another user’s request was changing it.

So it is common to lock the portion of the database that is being changed. A major area of competition in the early 1990s — and a big contributor to Sybase’s decline — was the granularity of locks (finer-grained is better, with row-level locking being the relational DBMS gold standard).

In-memory locks are generally called latches; I don’t know what the other differences between locks and latches are.

Increasingly many DBMS are designed with an alternative approach, MVCC (Multi-Version Concurrency Control); indeed, I’m now startled when I encounter one that makes a different choice. The essence of MVCC is that, to each portion (e.g. row) of data, the system appends very granular metadata — a flag for validation/invalidation and/or a timestamp. Except for the flags, the data is never changed until a cleanup operation; rather, new data is written, with later timestamps and validating flags. The payoff is that, while it may still be necessary to lock data against the possibility of two essentially simultaneous writes, reads can proceed without locks, at the cost of a minuscule degree of freshness. For if an update is in progress that affects the data needed for a read, the read just targets versions of the data slightly earlier in time.

MVCC has various performance implications — writes can be faster because they are append-mainly, reads may be slower, and of course the cleanup is needed. These tradeoffs seem to work well even for the query-intensive workloads of analytic RDBMS, perhaps because MVCC fits well with their highly sequential approach to I/O.

“In-memory” DBMS

A DBMS that truly ran only in RAM would lose data each time the power turned off. Hence a memory-centric DBMS will usually also have some strategy for persisting data.

Related links

Categories: Other

A Framework Approach to Building an Oracle WebCenter Intranet, Extranet, or Portal

Whether you already have or are planning to build an Oracle WebCenter-based intranet, extranet or customer portal, its overall success hinges on its time to market, ability to scale, and the presence of user productivity tools. Attend this webinar to see how Fishbowl’s Portal Solution Accelerator (PSA) can provide an extensible framework that bundles reusable templates and page layouts, standards-based portlets, and in-place security administration. Join us to discover how this framework can be applied to build or improve your corporate intranet, partner extranet, or customer portal.

Date: Thursday, May 22nd
Time: 1:00 PM EST
Register: https://www2.gotomeeting.com/register/236838418

 

The post A Framework Approach to Building an Oracle WebCenter Intranet, Extranet, or Portal appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Solve the network neutrality dilemma and make money too!

DBMS2 - Wed, 2014-05-14 08:46

As per the links and quotes below, my views on the network neutrality debate may be summarized as:

  • There should be a fairly good level of internet delivery that is, by regulation, available to any website or other internet service. This is essential so that ideas can blossom, speech can be shared, etc.
  • There should be a way to pay for arbitrarily good levels of internet delivery. Entertainment would benefit from that. Medicine, in the future, might require it.
  • Any payment for better delivery should happen through a marketplace open to all.

In this post I’ll add detail as to how that marketplace could work.

Personal note: When I interviewed for academic and think-tank jobs in 1981, my favorite interview speech was the one on utility regulation across different qualities-of-service in the face of uncertain supply and demand. I’m really going back to my roots here.

What I wrote in 2007 — and which garnered considerable discussion at the time — still applies:

Net neutrality is both necessary and workable for what I call Jeffersonet, which comprises the “classical”, bandwidth-light parts of the Internet. Thus, it includes e-mail, instant messaging, much e-commerce, and just about every website created in the first 13 or so years of the Web. Jeffersonet is the greatest tool in human history to communicate research, teaching, news, and political ideas, or to let tiny businesses compete worldwide. Any censorship of Jeffersonet – even if just of the self-interested large-enterprise commercial kind – would be a terrible loss. Net neutrality is workable for Jeffersonet because – well, because it’s already working just fine. Jeffersonet doesn’t need anything beyond current levels of bandwidth and reliability. So there’s no reason to mess with what’s working, other than simple profit-hungry greed.

Network neutrality opponents, however, point to evolving and future technologies, technically more demanding than what the current Internet can well support. Their uses are centered on what I call Edisonet – communication-rich applications such as entertainment, gaming, telephony, telemedicine, teleteaching, or telemeetings of all kinds. Reliable, tiered service is needed for these applications, and somebody has to pay for it. Even so – and this is a key point — the payment scheme should be as favorable to application-developer competition as possible.

So does what I wrote earlier this year:

I think the anti-discrimination argument for network neutrality has much merit. But I also think there are some kinds of payment structure that could leave the playing field fairly level. Imagine, if you will, that:

  • Consumers are charged for data, speed of connection, reliability of delivery, or anything else, but …
  • … internet companies have the ability to absorb those charges on consumers’ behalf, but can only do so …
  • one interaction at a time, with no volume discounts, via an automated system that is open to everybody.

Such a system is surely technologically feasible — indeed, it is at least as feasible as the online advertising networks that already exist. Further, it would be possible for the system to have nice features such as:

  • Telcos could implement forms of peak load pricing, for those times when their network capacity actually is under stress.
  • “Edge provider” internet companies could pay subsidies only on behalf of certain consumers, where those consumers are selected in all the complex ways that advertisements are currently targeted.

To see how this could look, let’s distinguish among some categories of market participant, and consider what kinds of business complexity they can reasonably be expected to endure.

  • True consumers.Their situation probably should and will remain much as it is today:
    • Simple choices of bundled connectivity-service plans.
    • Consumption of particular sites and services on the basis of e-commerce, subscription or free (ad-supported or otherwise).
    • Regulation needed to control — likely with partial success — the huge oligopolists who market, sell and supply the connectivity services.
  • Other kinds of end customer(e.g. businesses acting as consumers).
    • Most of their internet consumption is akin to true consumers’.
    • In addition, they should be able to obtain business-critical services with SLA (Service Level Agreement) guarantees for reliability and speed.
  • Connectivity providers. They can handle any kind of complexity their customers can, provided the equipment exists to deliver on the promises they make. This is true of customer-facing “last mile” and intermediate/wholesale telecommunication firms alike.
  • Market-makers/middlemen for the new market(s) I’m suggesting be created. (Analogous to ad-tech real-time auctioneers/clearing houses.) It’s their job to handle the complexity everybody else needs.
  • Publishers, e-commerce vendors and other internet-based enterprises – or, similarly, the internet parts of brick-and-mortar businesses.Here is where the analysis gets interesting.
    • The reason we’re having this discussion is that certain large internet-based enterprises want the ability to buy SLA guarantees for their service delivery.
    • If they’re able to do so, many of their competitors will suddenly develop that desire as well.
    • Hence the business of SLA-guarantee purchasing needs to be organized not just for the benefit of a few large internet companies, but also to suit smaller companies who might wish to compete with them.
    • What kind of SLA will a smaller/generic internet company want to be able to buy? In a nutshell, they’ll want to guarantee quality and reliability end-to-end on a customer-by-customer or interaction-by-interaction basis.

The distinction I’m drawing here is:

  • Huge companies like Netflix or Google can buy their SLAs piecemeal — one deal to set up a private wholesale network, another deal to guarantee “last-mile” delivery, etc.
  • But mom-and-pop web businesses don’t have that luxury. They need to buy delivery on an all-or-nothing basis.

Making the latter happen is, I maintain, just another job for the middlemen — provided those middlemen come into existence.

And so, net neutrality is easy to solve except for one chicken-egg problem:

  • The right middlemen need to be created …
  • … for a business that doesn’t yet exist …
  • … and which indeed can’t exist unless they’re first created …
  • … which nobody has a strong incentive to do unless the business is first shown to (be likely to) exist.

I hope somebody — perhaps an existing ad-tech company — gambles on setting up the needed clearinghouse, and gets richly rewarded for doing so.

Categories: Other

Notes and comments, May 6, 2014

DBMS2 - Tue, 2014-05-06 07:46

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

  • My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
  • My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
  • My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. :) Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
  • My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
  • My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
  • My CitusDB post picked up a few clarifying comments.

Here is a catch-all post to complete the set. 

1. The recently-announced Cloudera/MongoDB relationship* is still at the Barney stage. That said, I’m optimistic that their stated intention to add substance to the relationship will eventually come to fruition. If nothing else, the two companies have high regard for each other, at least at the Mike Olson/Max Schireson level.

*That’s one of numerous deals with my fingerprints on it, but in this case only lightly. It was probably on track to happen even without my nudges.

2. Most of what I talked about when I visited MongoDB is confidential; the public stuff was mainly in my recent MongoDB technology post. But in one exception, I asked Max for an update as to MongoDB enterprise use cases. He reported a cluster in data combination, especially but not only in use cases which have both high-volume part and dynamic-schema aspects. Specific examples Max cited included:

  • Tracking financial holdings from a variety of asset classes — especially if derivatives are involved, because they have a dynamic-schema aspect.
  • Product catalogs, including for use on web sites.
  • Customer information.
  • Patient information.

3. I didn’t ask everybody I saw in California about business trends, and much of what we did discuss was confidential. That said:

  • MapR was proud of its numbers.
  • So was DataStax.
  • ClearStory has a bunch of Very Big Enterprises as customers, mainly but not only in consumer sectors (e.g. retail, packaged goods).

4. Platfora is focusing a bit, starting with clickstream and security — i.e., event series stuff. And by the way, they report that the term “event series” is working well for them.

5. I gather from a variety of comments and conversations that Amazon Redshift has achieved considerable traction.

6. Something I can’t find evidence of having posted before: I think multiple businesses monitor online sales or similar business successes as a guide to network problems. eBay did this via a custom in-memory MOLAP (Multidimensional Online Analytic Process) system years ago. Best evidence that this is hardly restricted to eBay: all the “me-too” responses I get from telling that story.

7. Citus Data tells me that as of PostgreSQL 9.4, Postgres will be able to return just the part of a JSON column needed for a query. This is as opposed to storing the whole thing as text and only retrieving it in its entirety.

8. In the comments to my “Spark on fire” post, Patrick McFadin pointed out that Mahout is transitioning from MapReduce to Spark. (All new work will be on Spark, although old MapReduce-based routines will continue to be supported.) It turns out that Derrick Harris wrote about that over a month ago, and I just missed the news.

9. Also in predictive analytics — there are rumblings that R could eventually be supplanted by Julia, although R’s massive libraries of algorithms still give it the advantage now.

10. Multiple vendors, fed up with the intermittent slowdowns from garbage collection, are moving some processing off the Java heap. Unfortunately, I neglected to ask any of them what the remaining differences then were between Java and C++ programming.

11. And to finish on a light note: BDAS — the project of which Spark is only a part — is pronounced “bad-ass”, something I first heard from Dave Patterson.

Categories: Other

Introduction to CitusDB

DBMS2 - Fri, 2014-05-02 02:00

One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:

  • CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
  • Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
  • CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
  • Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.

*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.

Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.

Notwithstanding what I said about “fat heads”, CitusDB does have a concept of Master nodes. These:

  • Also use single-node copies of PostgreSQL.
  • Are blessedly able to scale out, although their underlying databases are entirely replicated.
  • Store no actual data, but do store metadata about each virtual node, including:
    • Structural metadata.
    • Location.
    • Min/max column values (for data skipping).
    • But not (yet) stats to help with query optimization.
  • Do some query planning and rewriting.
  • Handle administration, some of which is nicely parallelized/centralized. (E.g., an index choice can be made once and automatically propagated across all the relevant virtual nodes.)

CitusDB is definitely in its early days. For example:

  • If I understand correctly, the recent CitusDB 3.0 release is the first one on which data is redistributed among shards. Before that, you could only join tables that were either sharded on the same key, or else small enough to be broadcast-replicated across the whole cluster.
  • SQL coverage isn’t great. (E.g., no Windowing.)
  • Some hard-to-parallelize things aren’t implemented yet, e.g. exact median or generally-usable COUNT DISTINCT.
  • ACID is still lacking. Writes are batch-only, micro-batch or otherwise as the case may be.
  • CitusDB’s backup story is primitive, with the main options being:
    • You can rely on having replicas on multiple nodes, even — if you like — in different data centers.
    • You can backup each of the PostgreSQL nodes separately; CitusDB doesn’t yet offer automation for that.
  • CitusDB’s query optimization sounds pretty primitive.
  • I don’t recall Citus telling me of serious workload management.
  • CitusDB compression is block-level only. (PostgreSQL’s version of Lempel-Ziv.)

Still, the Citus Data folks seem to have good ideas, including some — as yet undisclosed — plans going forward. So if it sounds as if CitusDB might fit your needs better than more established scale-out RDBMS do, I’d encourage you to take a look at what Citus offers.

Categories: Other

MemSQL update

DBMS2 - Thu, 2014-05-01 21:40

I stopped by MemSQL last week, and got a range of new or clarified information. For starters:

  • Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
  • … which was the point of introducing MemSQL’s flash-based columnar option.
  • One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
  • Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
  • At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
  • A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
  • MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.

On the more technical side:

  • Some MemSQL users are running 7- or 8-way joins and other long-ish SQL statements.
  • But MemSQL doesn’t yet have fully peer-to-peer data redistribution.
    • MemSQL “leaves” only talk to MemSQL “aggregator nodes,” not each other …
    • … but note the plural on “aggregator nodes”, which should immunize MemSQL from the worst of “fat head” bottlenecks.
    • Of course, you can sometimes get join locality by sharding multiple tables on the same key …
    • … or by broadcast-replicating tables that are sufficiently small.
  • Better SQL coverage — e.g. SQL Windowing — is coming soon.
  • MemSQL believes it has an aggressive data skipping story.
  • MemSQL doesn’t yet have a true workload management story; they’re still at the stage “Our queries run so fast not many of them have to be active at once,  and if things nevertheless get too busy we have some throttling capabilities.” But MemSQL at least sounds aware of the difference between that and true workload management, which puts them ahead of some other vendors I talk with.
  • MemSQL doesn’t have stored procedures. In particular, since MemSQL (the product) generates code on the fly, MemSQL (the company) doesn’t think the performance benefits of stored procedure pre-compilation are needed.

And finally, MemSQL’s column-store compression story — which I mangled in a previous post — goes like this:

  • There are numerous compression algorithm choices, both columnar (e.g. dictionary/tokenization, run-length encoding) and block (Lempel-Ziv, I presume in multiple variations).
  • Compression is block-by-block, something I hear more commonly these days than Vertica’s alternative of global compression choices.
  • The choice of compression scheme is automagic for each block, unless you give explicit hints.
  • Default block size for the columnar store is 10 million rows.
Categories: Other

Hardware and storage notes

DBMS2 - Wed, 2014-04-30 20:05

My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:

  • Rack- or data-center-scale systems.
  • The real or imagined demise of Moore’s Law.
  • Flash.

I also got updated as to typical Hadoop hardware.

If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)

One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.

Otherwise, I didn’t gain much new insight into actual flash uptake. Everybody thinks flash is or soon will be very important; but in many segments, folks are trading off disk vs. RAM without worrying much about the intermediate flash alternative.

I visited two Hadoop distribution vendors this trip, namely the ones who are my clients – Cloudera and MapR. I remembered to ask one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or less:

  • The default assumption remains $20-30K/node, 2 sockets, 12 disks. (Edit: See lively price discussion in the comments below.)
  • Most hardware vendors have standard/default Hadoop boxes by now, and in many cases customers just buy what’s on offer.
  • The aforementioned disks sometimes get up to 4 terabytes now.
  • 128GB is now the norm for RAM. 256GB is common. Higher amounts are seen, up to – in rare cases – 2-4 TB.
  • Flash is of interest, but isn’t being demanded much yet. This could change when flash’s storage density matches disk’s.
  • Flash interest is highest for Impala.

Cloudera suggested that the larger amounts of RAM tend to be used when customers frame the need as putting certain analytic datasets entirely in RAM. This rings true to me; there’s lots of evidence that users think that way, and not just in analytic cases. This is probably one of the reasons that they often jump straight from disk to RAM without fully exploring the opportunities of flash.

One last thing — the big cloud vendors are at least considering the use of their own non-Intel chip designs, which might be part of the reason for Intel’s large Hadoop investment.

Categories: Other

The Intel investment in Cloudera

DBMS2 - Wed, 2014-04-30 20:04

Intel recently made a huge investment in Cloudera, stated facts about which start:

  • $740 million …
  • … for 18% of the company …
  • … as part of an overall $900 million round.
  • SEC filings coming soon with more details.

Give or take stock preferences, etc., that’s around a $4.1 billion valuation post-money, but Cloudera does say it now has “most of $1 billion” in the bank.

Cloudera further told me when I visited last Friday that the majority of the Intel investment is net new money. (I presume that the rest of the round is net-new as well.) Hence, I conclude that previous investors sold in the aggregate less than 10% of total holdings to Intel. While I’m pretty sure Mike Olson is buying himself a couple of nice toys, in most respects it’s business-as-usual at Cloudera, with the same investors, directors and managers they had before. By way of contrast, many of the “cashing-out” rumors going around are OBVIOUSLY absurd, unless you think Intel acquired a much larger fraction of Cloudera than it actually did.

That said, Intel spent a lot of money, and in connection with the investment there’s a tight Cloudera/Intel partnership. In particular,

  • Cloudera will move rapidly to make its Hadoop distribution be a superset of Intel’s.
  • When that is accomplished, Intel will get out of the Hadoop distribution business, and Cloudera will take over Intel’s Hadoop customers, which I still think are concentrated in Asia, especially China and India.

Notes on that include:

  • All of the above is planned to take place by the end of 2014.
  • The biggest case of Intel technology superseding/subsuming Cloudera’s will be security. (I.e., Cloudera Sentry will become a subset of Rhino.)
  • Intel’s Hadoop engineers will remain Intel employees.

Medium-term Intel/Cloudera collaboration plans include looking for things to push down to the chip, such as:

  • Encryption/decryption.
  • Compression/decompression.
  • Math.

I didn’t get into that in any detail with Cloudera, and am not sure I’m convinced yet much will come of it.

Finally, Cloudera and Intel will of course talk a lot about adapting to computer architecture opportunities and trends. I expect that part to go well, because Intel has strong relationships of that kind even with software companies it doesn’t own large percentages of.

Categories: Other

Cloudera, Impala, data warehousing and Hive

DBMS2 - Wed, 2014-04-30 20:03

There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.

  • Hive is good at some tasks and terrible at others.
    • Hive is good at batch data transformation.
    • Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
  • Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
  • Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
  • Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
  • Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”.  Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
  • There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.

And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.

Related links

Categories: Other

Spark on fire

DBMS2 - Wed, 2014-04-30 04:41

Spark is on the rise, to an even greater degree than I thought last month.

  • Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
    • A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
    • MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
  • Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
    • The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
    • A rich set of programming primitives in connection with that model.
    • Support also for highly-iterative processing, of the kind found in machine learning.
    • Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
    • A clever approach to fault-tolerance.
  • Spark is a major contender in streaming.
  • There’s some cool machine-learning innovation using Spark.
  • Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*

*Yes, my fingerprints are showing again.

The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):

… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.

With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.

In an unfortunate non-development, Tachyon is not (yet?) part of Spark, and so it is hard for a Spark job’s data to be shared with other jobs (Spark or otherwise) or processes. That said:

  • The tight integration of data structures and processes gives similar performance benefits to those of in-process vs. out-of-process in-database analytic functions. (It also of course raises similar stability concerns, but those seem less important in the case of Spark than of a true DBMS.)
  • From a Hadoop vendor’s standpoint, Tachyon’s benefit of not requiring HDFS (Hadoop Distributed File System) isn’t important, and Tachyon somewhat conflicts with a newish effort called HDFS Caching.

A couple of Spark machine learning stories are very cool, in that they involve intra-day retraining of models. The better-known one is Yahoo’s, which in a prototype built in 120 lines of code trains a new model for recommendation of each candidate top-page news story. When I challenged that anecdote, Ion told me about his own former company Conviva, which retrains models every minute to decide which particular source of streaming video each client system will be connected to.

I am generally skeptical of immature SQL efforts, and SparkSQL is no exception. That said, it seems to be going in sensible directions, which should be welcome to those folks who used or were planning to use Shark anyway.

  • SparkSQL actually has its own optimizer, rather than using the inappropriate Hive one. As with many new optimizers, it’s starting out rule-based, but is planned to become cost-based down the road.
  • SparkSQL can run queries against data that’s either inside Spark or outside-but-accessible.
  • SparkSQL can be accessed via Python and other APIs.
  • Spark works with the Hive metastore, nee’ HCatalog.

And finally, there’s no public news as to what Databricks’ own business is. I think that’s a bit silly, but in fairness:

  • The Spark 1.0 launch will consume every bit of marketing bandwidth they have.
  • They don’t yet want to commit to a delivery date of their first offering.
Categories: Other