Skip navigation.

Other

“Innovation in Managing the Chaos of Everyday Project Management” is now on YouTube

If you missed Fishbowl’s recent webinar on our new Enterprise Information Portal for Project Management, you can now view a recording of it on YouTube.

 

Innovation in Managing the Chaos of Everyday Project Management discusses our strategy for leveraging the content management and collaboration features of Oracle WebCenter to enable project-centric organizations to build and deploy a project management portal. This solution was designed especially for groups like E & C firms and oil and gas companies, who need applications to be combined into one portal for simple access.

If you’d like to learn more about the Enterprise Information Portal for Project Management, visit our website or email our sales team at sales@fishbowlsolutions.com.

The post “Innovation in Managing the Chaos of Everyday Project Management” is now on YouTube appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

WibiData’s approach to predictive modeling and experimentation

DBMS2 - Tue, 2014-12-16 06:29

A conversation I have too often with vendors goes something like:

  • “That confidential thing you told me is interesting, and wouldn’t harm you if revealed; probably quite the contrary.”
  • “Well, I guess we could let you mention a small subset of it.”
  • “I’m sorry, that’s not enough to make for an interesting post.”

That was the genesis of some tidbits I recently dropped about WibiData and predictive modeling, especially but not only in the area of experimentation. However, Wibi just reversed course and said it would be OK for me to tell more or less the full story, as long as I note that we’re talking about something that’s still in beta test, with all the limitations (to the product and my information alike) that beta implies.

As you may recall:

With that as background, WibiData’s approach to predictive modeling as of its next release will go something like this:

  • There is still a strong element of classical modeling by data scientists/statisticians, with the models re-scored in batch, perhaps nightly.
  • But of course at least some scoring should be done as real-time as possible, to accommodate fresh data such as:
    • User interactions earlier in today’s session.
    • Technology for today’s session (device, connection speed, etc.)
    • Today’s weather.
  • WibiData Express is/incorporates a Scala-based language for modeling and query.
  • WibiData believes Express plus a small algorithm library gives better results than more mature modeling libraries.
    • There is some confirming evidence of this …
    • … but WibiData’s customers have by no means switched over yet to doing the bulk of their modeling in Wibi.
  • WibiData will allow line-of-business folks to experiment with augmentations to the base models.
  • Supporting technology for predictive experimentation in WibiData will include:
    • Automated multi-armed bandit testing (in previous versions even A/B testing has been manual).
    • A facility for allowing fairly arbitrary code to be included into otherwise conventional model-scoring algorithms, where conventional scoring models can come:
      • Straight from WibiData Express.
      • Via PMML (Predictive Modeling Markup Language) generated by other modeling tools.
    • An appropriate user interface for the line-of-business folks to do certain kinds of injecting.

Let’s talk more about predictive experimentation. WibiData’s paradigm for that is:

  • Models are worked out in the usual way.
  • Businesspeople have reasons for tweaking the choices the models would otherwise dictate.
  • They enter those tweaks as rules.
  • The resulting combination — models plus rules — are executed and hence tested.

If those reasons for tweaking are in the form of hypotheses, then the experiment is a test of those hypotheses. However, WibiData has no provision at this time to automagically incorporate successful tweaks back into the base model.

What might those hypotheses be like? It’s a little tough to say, because I don’t know in fine detail what is already captured in the usual modeling process. WibiData gave me only one real-life example, in which somebody hypothesized that shoppers would be in more of a hurry at some times of day than others, and hence would want more streamlined experiences when they could spare less time. Tests confirmed that was correct.

That said, I did grow up around retailing, and so I’ll add:

  • Way back in the 1970s, Wal-Mart figured out that in large college towns, clothing in the football team’s colors was wildly popular. I’d hypothesize such a rule at any vendor selling clothing suitable for being worn in stadiums.
  • A news event, blockbuster movie or whatever might trigger a sudden change in/addition to fashion. An alert merchant might guess that before the models pick it up. Even better, she might guess which psychographic groups among her customers were most likely to be paying attention.
  • Similarly, if a news event caused a sudden shift in buyers’ optimism/pessimism/fear of disaster, I’d test that a response to that immediately.

Finally, data scientists seem to still be a few years away from neatly solving the problem of multiple shopping personas — are you shopping in your business capacity, or for yourself, or for a gift for somebody else (and what can we infer about that person)? Experimentation could help fill the gap.

Categories: Other

Notes and links, December 12, 2014

DBMS2 - Fri, 2014-12-12 05:05

1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.

For starters:

  • The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
  • If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.

I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:

  • Suppose we have lots of logs about lots of things.* Machine learning can help:
    • Notice what’s an anomaly.
    • Group* together things that seem to be experiencing similar anomalies.
  • That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.

Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.

2. Discussion of graph DBMS can get confusing. For example:

  • Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
  • Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
  • The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
  • My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.

I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.

3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*

*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.

Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.

4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.

5. Derrick Harris and Pivotal turn out to have been earlier than me in posting about Tachyon bullishness.

6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.

Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.

Categories: Other

A few numbers from MapR

DBMS2 - Wed, 2014-12-10 00:55

MapR put out a press release aggregating some customer information; unfortunately, the release is a monument to vagueness. Let me start by saying:

  • I don’t know for sure, but I’m guessing Derrick Harris was incorrect in suspecting that this release was a reaction to my recent post about Hortonworks’ numbers. For one thing, press releases usually don’t happen that quickly.
  • And as should be obvious from the previous point — notwithstanding that MapR is a client, I had no direct involvement in this release.
  • In general, I advise clients and other vendors to put out the kind of aggregate of customer success stories found in this release. However, I would like to see more substance than MapR offered.

Anyhow, the key statement in the MapR release is:

… the number of companies that have a paid subscription for MapR now exceeds 700.

Unfortunately, that includes OEM customers as well as direct ones; I imagine MapR’s direct customer count is much lower.

In one gesture to numerical conservatism, MapR did indicate by email that it counts by overall customer organization, not by department/cluster/contract (i.e., not the way Hortonworks does).

The MapR press release also said:

As of November 2014, MapR has one or more customers in eight vertical markets that have purchased more than one million dollars of MapR software and services.  These vertical markets are advertising/media, financial services, healthcare, internet, information technology, retail, security, and telecom.

Since the word “each” isn’t in that quote, so we don’t even know whether MapR is referring to individual big customers or just general sector penetration. We also don’t know whether the revenue is predominantly subscription or some other kind of relationship.

MapR also indicated that the average customer more than doubled its annualized subscription rate vs. a year ago; the comparable figure — albeit with heavy disclaimers — from Hortonworks was 25%.

Categories: Other

Hadoop’s next refactoring?

DBMS2 - Sun, 2014-12-07 08:59

I believe in all of the following trends:

  • Hadoop is a Big Deal, and here to stay.
  • Spark, for most practical purposes, is becoming a big part of Hadoop.
  • Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.

Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:

  • People would like this to be true, although in most cases only as one of several cluster computing platforms.
  • Hadoop, when viewed as an operating system, is extremely primitive.
  • Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.

There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.

Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:

  • Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
  • In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
  • Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.

Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:

The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:

  • Works well with Hadoop execution engines (Spark, Tez, Impala …).
  • Works well with other software people might want to put on their Hadoop nodes.
  • Interfaces nicely to HDFS, Isilon, object storage, et al.
  • Is fully parallel any time it needs to talk with persistent or external storage.
  • Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.

Tachyon could, I imagine, become that. HDFS caching probably could not.

In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.

Related links

Categories: Other

Notes on the Hortonworks IPO S-1 filing

DBMS2 - Sun, 2014-12-07 07:53

Given my stock research experience, perhaps I should post about Hortonworks’ initial public offering S-1 filing. :) For starters, let me say:

  • Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
    • $11.7 million from everybody but Microsoft, …
    • … plus $7.5 million from Microsoft, …
    • … for a total of $19.2 million.
  • Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
    • 2 on April 30, 2012.
    • 9 on December 31, 2012.
    • 25 on April 30, 2013.
    • 54 on September 30, 2013.
    • 95 on December 31, 2013.
    • 233 on September 30, 2014.
  • Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
  • Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
  • This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
    • In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
    • The tentative top of the offering’s price range is $14/share.
    • That’s also slightly down from the Series C price in mid-2013.

And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.

Overall, the Hortonworks S-1 is about 180 pages long, and — as is typical — most of it is boilerplate, minutiae or drivel. As is also typical, two of the most informative sections of the Hortonworks S-1 are:

The clearest financial statements in the Hortonworks S-1 are probably the quarterly figures on Page 62, along with the tables on Pages F3, F4, and F7.

Special difficulties in interpreting Hortonworks’ numbers include:

  • A large fraction of revenue has come from a few large customers, most notably Microsoft. Details about those revenues are further confused by:
    • Difficulty in some cases getting a fix on the subscription/professional services split. (It does seem clear that Microsoft revenues are 100% subscription.)
    • Some revenue deductions associated with stock deals, called “contra-revenue”.
  • Hortonworks changed the end of its fiscal year from April to December, leading to comparisons of a couple of eight-month periods.
  • There was a $6 million lawsuit settlement (some kind of employee poaching/trade secrets case), discussed on Page F-21.
  • There is some counter-intuitive treatment of Windows-related development (cost of revenue rather than R&D).

One weirdness is that cost of professional services revenue far exceeds 100% of such revenue in every period Hortonworks reports. Hortonworks suggests that this is because:

  • Professional services revenue is commonly bundled with support contracts.
  • Such revenue is recognized ratably over the life of the contract, as opposed to a more natural policy of recognizing professional services revenue when the services are actually performed.

I’m struggling to come up with a benign explanation for this.

In the interest of space, I won’t quote Hortonworks’ S-1 verbatim; instead, I’ll just note where some of the more specifically informative parts may be found.

  • Page 53 describes Hortonworks’ typical sales cycles (they’re long).
  • Page 54 says the average customer has increased subscription payments 25% year over year, but emphasize that the sample size is too small to be reliable.
  • Pages 55-63 have a lot of revenue and expense breakdowns.
  • Deferred revenue numbers (which are a proxy for billings and thus signed contracts) are on Page 65.
  • Pages II 2-3 list all (I think) Hortonworks financings in a concise manner.

And finally, Hortonworks’ dealings with its largest customers and strategic partners are cited in a number of places. In particular:

  • Pages 52-3 cover dealings with Yahoo, Teradata, Microsoft, and AT&T.
  • Pages 82-3 discusses OEM revenue from Hewlett-Packard, Red Hat, and Teradata, none of which amounts to very much.
  • Page 109 covers the Teradata agreement. It seems that there’s less going on than originally envisioned, in that Teradata made a nonrefundable prepayment far greater than turns out to have been necessary for subsequent work actually done. That could produce a sudden revenue spike or else positive revenue restatement as of February, 2015.
  • Page F-10 has a table showing revenue from Hortonworks’ biggest customers (Company A is Microsoft and Company B is Yahoo).
  • Pages F37-38 further cover Hortonworks’ relationships with Yahoo, Teradata and AT&T.

Correction notice: Some of the page numbers in this post were originally wrong, surely because Hortonworks posted an original and amended version of this filing, and I got the two documents mixed up.  A huge Thank You goes to Merv Adrian for calling my attention to this, and I think I’ve now fixed them. I apologize for the errors!

Related links

Categories: Other

Reminder: Fishbowl Solutions Webinar Tomorrow at 1 PM CST

Cole OrndorffThere’s still time to register for the webinar that Fishbowl Solutions and Oracle will be holding tomorrow from 1 PM-2 PM CST! Innovation in Managing the Chaos of Everyday Project Management will feature Fishbowl’s AEC Practice Director Cole Orndorff. Orndorff, who has a great deal of experience with enterprise information portals, said the following about the webinar:

“According to Psychology Today, the average employee can lose up to 40% of their productivity switching from task to task. The number of tasks executed across a disparate set of systems over the lifecycle of a complex project is overwhelming, and in most cases, 20% of each solution is utilized 80% of the time.

I am thrilled to have the opportunity to present on how improving workforce effectiveness can enhance your margins. This can be accomplished by providing a consistent, intuitive user experience across the diverse systems project teams use and by reusing the intellectual assets that already exist in your organization.”

To register for the webinar, visit Oracle’s website. To learn more about Fishbowl’s new Enterprise Information Portal for Project Management, visit our website.

The post Reminder: Fishbowl Solutions Webinar Tomorrow at 1 PM CST appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Thoughts and notes, Thanksgiving weekend 2014

DBMS2 - Sun, 2014-11-30 19:48

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

  • Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
  • Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: 

Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.

4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.

WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.

I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.

5. Hadoop’s traditional data distribution story goes something like:

  • Data lives on every non-special Hadoop node that does processing.
  • This gives the advantage of parallel data scans.
  • Sometimes data locality works well; sometimes it doesn’t.
  • Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
  • … but Hadoop is getting away from that kind of strict, I/O-intensive processing model.

However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.

6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.

7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.

8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:

  • IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
  • … upstarts have three behemoths to outdo, not just one.
  • MySQL, PostgreSQL and to some extent Sybase are still around as well.

Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:

  • General-purpose RDBMS.
  • Analytic RDBMS.
  • NoSQL.
  • Non-relational analytic data stores (perhaps Hadoop-based).

it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.

All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.

9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.

Categories: Other

Upcoming Webinar: Innovation in Managing the Chaos of Everyday Project Management

On Thursday, December 4th from 1 PM-2 PM CST, Fishbowl Solutions will hold a webinar in conjunction with Oracle about our new solution for enterprise project management. This solution transforms how project-based tools, like Oracle Primavera, and project assets, such as documents and diagrams, are accessed and shared.

With this solution:

  • Project teams will have access to the most accurate and up to date project assets based on their role within a specific project
  • Through a single dashboard, project managers will gain new real-time insight to the overall status of even the most complex projects
  • The new mobile workforce will now have direct access to the same insight and project assets through an intuitive mobile application

With real-time insight and enhanced information sharing and access, this solution can help project teams increase the ability to deliver on time and on budget. To learn more about our Enterprise Information Portal for Project Management, visit Fishbowl’s website.

Fishbowl’s Cole Orndorff, who has 10+ years in the engineering and construction industry, will keynote and share how a mobile-ready portal can integrate project information from Oracle Primavera and other sources to serve information up to users in a personalized, intuitive user experience.

Register here

The post Upcoming Webinar: Innovation in Managing the Chaos of Everyday Project Management appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Technical differentiation

DBMS2 - Sat, 2014-11-15 06:00

I commonly write about real or apparent technical differentiation, in a broad variety of domains. But actually, computers only do a couple of kinds of things:

  • Accept instructions.
  • Execute them.

And hence almost all IT product differentiation fits into two buckets:

  • Easier instruction-giving, whether that’s in the form of a user interface, a language, or an API.
  • Better execution, where “better” usually boils down to “faster”, “more reliable” or “more reliably fast”.

As examples of this reductionism, please consider:

  • Application development is of course a matter of giving instructions to a computer.
  • Database management systems accept and execute data manipulation instructions.
  • Data integration tools accept and execute data integration instructions.
  • System management software accepts and executes system management instructions.
  • Business intelligence tools accept and execute instructions for data retrieval, navigation, aggregation and display.

Similar stories are true about application software, or about anything that has an API (Application Programming Interface) or SDK (Software Development Kit).

Yes, all my examples are in software. That’s what I focus on. If I wanted to be more balanced in including hardware or data centers, I might phrase the discussion a little differently — but the core points would still remain true.

What I’ve said so far should make more sense if we combine it with the observation that differentiation is usually restricted to particular domains. I mean several different things by that last bit. First, most software only purports to do a limited class of things — manage data, display query results, optimize analytic models, manage a cluster, run a payroll, whatever. Even beyond that, any inherent superiority is usually restricted to a subset of potential use cases. For example:

  • Relational DBMS presuppose that data fits well (enough) into tabular structures. Further, most RDBMS differentiation is restricted to a further subset of such cases; there are many applications that don’t require — for example — columnar query selectivity or declarative referential integrity or Oracle’s elite set of security certifications.
  • Some BI tools are great for ad-hoc navigation. Some excel at high-volume report displays, perhaps with a particular flair for mobile devices. Some are learning how to query non-tabular data.
  • Hadoop, especially in its early days, presupposed data volumes big enough to cluster and application models that fit well with MapReduce.
  • A lot of distributed computing aids presuppose particular kinds of topologies.

A third reason for technical superiority to be domain-specific is that advantages are commonly coupled with drawbacks. Common causes of that include:

  • Many otherwise-advantageous choices strain hardware budgets. Examples include:
    • Robust data protection features (most famously RAID and two-phase commit)
    • Various kinds of translation or interpretation overhead.
  • Yet other choices are good for some purposes but bad for others. It’s fastest to write data in the exact way it comes in, but then it would be slow to retrieve later on.
  • Innovative technical strategies are likely to be found in new products that haven’t had time to become mature yet.

And that brings us to the main message of this post: Your spiffy innovation is important in fewer situations than you would like to believe. Many, many other smart organizations are solving the same kinds of problems as you; their solutions just happen to be effective in somewhat different scenarios than yours. This is especially true when your product and company are young. You may eventually grow to cover a broad variety of use cases, but to get there you’ll have to more or less match the effects of many other innovations that have come along before yours.

When advising vendors, I tend to think in terms of the layered messaging model, and ask the questions:

  • Which of your architectural features gives you sustainable advantages in features or performance?
  • Which of your sustainable advantages in features or performance provides substantial business value in which use cases?

Closely connected are the questions:

  • What lingering disadvantages, if any, does your architecture create?
  • What maturity advantages do your competitors have, and when (if ever) will you be able to catch up with them?
  • In which use cases are your disadvantages important?

Buyers and analysts should think in such terms as well.

Related links

Daniel Abadi, who is now connected to Teradata via their acquisition of Hadapt, put up a post promoting some interesting new features of theirs. Then he tweeted that this was an example of what I call Bottleneck Whack-A-Mole. He’s right. But since much of his theme was general praise of Teradata’s mature DBMS technology, it would also have been accurate to reference my post about The Cardinal Rules of DBMS Development.

Categories: Other

Enterprise Libraries: The Next Iteration of WebCenter Folders

This blog post was written by Matt Rudd, Enterprise Support Team Lead at Fishbowl Solutions. Matt has participated in multiple WebCenter 11g upgrades during his time with Fishbowl, and recently developed a solution for an issue he ran into frequently while performing upgrades.

With the release of the ADF Content UI for WebCenter Content, it has become clear that the long-term road map for folder-based storage within WebCenter Content is based on enterprise libraries. The new Content UI only allows you to browse content contained within these libraries, which are top-level “buckets” of logically grouped content for your enterprise. However, content (i.e. files) cannot be added directly under an enterprise library. One or more folders must be added under an enterprise library, and then files can be directly added to the folders. The enterprise libraries container can also be viewed via the legacy WebCenter Content UI by navigating to Browse Content->Folders, as shown below.

WebCenter Content screenshot

In order to use the ADF Content UI with existing folders and content, they need to be migrated to enterprise libraries via the Move command on a folder’s action menu.

WebCenter 11g screenshotFor customers that have been using folders-based storage within WebCenter Content for a number of years, this can be especially difficult. Changes involving special characters and double spaces have presented problems, as well as other more challenging issues. The most challenging has been the nondescript error message “Unable to update the content item information for MyContent.” This error message pops up repeatedly for content that is not in workflow, is in Released status, has no other errors related to the content of any kind, and to which the moving user has full admin permissions. In addition, the content can be moved individually without issue, but not as part of a Framework Folders to enterprise libraries migration.

During the course of alleviating errors for a successful enterprise libraries migration, we discovered that if we copied the folders but moved the content, we were able to successfully migrate the majority of the content while still being able to clean up special character issues as necessary. In order to do this efficiently for large folder structures, the process needed to be automated.

Rather than building a custom component, we opted to build a custom RIDC application to recursively copy all, or any portion of, a folder structure from one parent to another while moving the content to the newly copied destination. This flexibility, along with ensuring that duplicate folders were not created in the destination folder structure, allowed us to run the application as many times as necessary. If a folder failed to move due to an issue (e.g. disallowed special character), the folder name could be changed and the application could re-run to only recursively process that folder. The number of content items under a particular level of the folder structure was verified with database queries to ensure all content was moved before deleting the old, and now empty, folder structure. This iterative process allowed us to migrate approximately 50,000 folders that contained 400,000 content items in about 15 hours. However, this was after rigorously testing the content migration in development and staging environments to alleviate as many content and folder issues as possible prior to the go-live migration. The RIDC application used no custom services of any kind and relied solely on those provided by core WebCenter Content and the Framework Folders component.

 

 

The post Enterprise Libraries: The Next Iteration of WebCenter Folders appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Notes on predictive modeling, November 2, 2014

DBMS2 - Sun, 2014-11-02 05:49

Following up on my notes on predictive modeling post from three weeks ago, I’d like to tackle some areas of recurring confusion.

Why are we modeling?

Ultimately, there are two reasons to model some aspect of your business:

  • You generally want insight and understanding.
    • This is analogous to why you might want to do business intelligence.
    • It commonly includes a search for causality, whether or not “root cause analysis” is exactly the right phrase to describe the process.
  • You want to do calculations from the model to drive wholly or partially automated decisions.
    • A big set of examples can be found in website recommenders and personalizers.
    • Another big set of examples can be found in marketing campaigns.
    • For an example of partial automation, consider a tool that advises call center workers.

How precise do models need to be?

Use cases vary greatly with respect to the importance of modeling precision. If you’re doing an expensive mass mailing, 1% additional accuracy is a big deal. But if you’re doing root cause analysis, a 10% error may be immaterial.

Who is doing the work?

It is traditional to have a modeling department, of “data scientists” or SAS programmers as the case may be. While it seems cool to put predictive modeling straight in the hands of business users — some business users, at least — it’s rare for them to use predictive modeling tools more sophisticated than Excel. For example, KXEN never did all that well.

That said, I support the idea of putting more modeling in the hands of business users. Just be aware that doing so is still a small business at this time.

“Operationalizing” predictive models

The topic of “operationalizing” models arises often, and it turns out to be rather complex. Usually, to operationalize a model, you need:

  • A program that generates scores, based on the model.
  • A program that consumes scores (for example a recommender or fraud alerter).

In some cases, the two programs might be viewed as different modules of the same system.

While it is not actually necessary for there to be a numerical score — or scores — in the process, it seems pretty common that there are such. Certainly the score calculations can create a boundary for loose-coupling between model evaluation and the rest of the system.

That said:

  • Sometimes the scoring is done on the fly. In that case, the two programs mentioned above are closely integrated.
  • Sometimes the scoring is done in batch. In that case, loose coupling seems likely. Often, there will be ETL (Extract/Transform/Load) to make the scores available to the program that will eventually use them.
  • PMML (Predictive Modeling Markup Language) is good for some kinds of scoring but not others. (I’m not clear on the details.)

In any case, operationalizing a predictive model can or should include:

  • A process for creating the model.
  • A process for validating and refreshing the model.
  • A flow of derived data.
  • A program that consumes the model’s outputs.

Traditional IT considerations, such as testing and versioning, apply.

What do we call it anyway?

The term “predictive analytics” was coined by SPSS. It basically won. However, some folks — including whoever named PMML — like the term “predictive modeling” better. I’m in that camp, since “modeling” seems to be a somewhat more accurate description of what’s going on, but I’m fine with either phrase.

Some marketers now use the term “prescriptive analytics”. In theory that makes sense, since:

  • “Prescriptive” can be taken to mean “operationalized predictive”, saving precious syllables and pixels.
  • What’s going on is usually more directly about prescription than prescription anyway.

Edit: Ack! I left the final paragraph out of the post, namely:

In practice, however, the term “prescriptive analytics” is a strong indicator of marketing nonsense. Predictive modeling has long been used to — as it were — prescribe business decisions; marketers who use the term “prescriptive analytics” are usually trying to deny that very obvious fact.

Categories: Other

Analytics for lots and lots of business users

DBMS2 - Sun, 2014-11-02 05:45

A common marketing theme in the 2010s decade has been to claim that you make analytics available to many business users, as opposed to your competition, who only make analytics available to (pick one):

  • Specialists (with “PhD”s).
  • Fewer business users (a thinner part of the horizontally segmented pyramid — perhaps inverted — on your marketing slide, not to be confused with the horizontally segmented pyramids — perhaps inverted — on your competition’s marketing slides).

Versions of this claim were also common in the 1970s, 1980s, 1990s and 2000s.

Some of that is real. In particular:

  • Early adoption of analytic technology is often in line-of-business departments.
  • Business users on average really do get more numerate over time, my three favorite examples of that being:
    • Statistics is taught much more in business schools than it used to be.
    • Statistics is taught much more in high schools than it used to be.
    • Many people use Excel.

Even so, for most analytic tools, power users tend to be:

  • People with titles or roles like “business analyst”.
  • More junior folks pulling things together for their bosses.
  • A hardcore minority who fall into neither of the first two categories.

Asserting otherwise is rarely more than marketing hype.

Related link

Categories: Other

Datameer at the time of Datameer 5.0

DBMS2 - Sun, 2014-10-26 02:42

Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:

  • Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
  • Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.

Key aspects include:

  • Datameer runs straight against Hadoop.
  • Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
  • The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
  • Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
  • Datameer lets you write derived data back into Hadoop.

Datameer use cases sound like the usual mix, consisting mainly of a lot of customer analytics, a bit of anti-fraud, and some operational analytics/internet-of-things. Datameer claims 200 customers and 240 installations, the majority of which are low-end/single-node users, but at least one of which is a multi-million dollar relationship. I don’t think those figures include OEM sell-through. I forgot to ask for any company size metrics, such as headcount.

In a chargeable add-on, Datameer 5.0 has an interesting approach to execution. (The lower-cost version just uses MapReduce.)

  • An overall task can of course be regarded as a DAG (Directed Acyclic Graph).
  • Datameer automagically picks an execution strategy for each node. Administrator hints are allowed.
  • There are currently three choices for execution: MapReduce, clustered in-memory, or single-node. This all works over Tez and YARN.
  • Spark is a likely future option.

Datameer calls this “Smart Execution”. Notes on Smart Execution include:

  • Datameer sees a lot of tasks that look at 10-100 megabytes of data, especially in malware/anomaly detection. Datameer believes there can be a huge speed-up from running those on a single-node rather than in a clustered mode requiring data (re)distributed, with at least one customer reporting >20X speedup of at least one job.
  • Yes, each step of the overall DAG might look to the underlying execution engine as a DAG of its own.
  • Tez can fire up processes ahead of when they’re needed, so you don’t have to wait for all the process start-up delays in series.
  • Datameer had a sampling/preview engine from the getgo that outside of Hadoop MapReduce. That’s the basis for the non-MapReduce options now.

Strictly from a BI standpoint, Datameer seems clunky.

  • Datameer doesn’t have drilldown.
  • Datameer certainly doesn’t let you navigate from one visualization to the next ala QlikView/Tableau/et al. (Note to self: I really need to settle on a name for that feature.)
  • While Datameer does have a bit in the way of event series visualization, it seems limited.
  • Of course, Datameer doesn’t have streaming-oriented visualizations.
  • I’m not aware of any kind of text search navigation.

Datameer does let you publish BI artifacts, but doesn’t seem to have any collaboration features beyond that.

Last and also least: In an earlier positioning, Datameer made a big fuss about an online app store. Since analytic apps stores never amount to much, I scoffed.* That said, they do have it, so I asked which apps got the most uptake. Most of them seem to be apps which boil down to connectors, access to outside data sets, and/or tutorials. Also mentioned were two more substantive apps, one for path-oriented clickstream analysis, and one for funnel analysis combining several event series.

*I once had a conversation with a client that ended:

  • “This app store you’re proposing will not be a significant success.”
  • “Are you sure?”
  • “Almost certain. It really just sounds like StreamBase’s.”
  • “I ‘m not familiar with StreamBase’s app store.”
  • “My point exactly.”
Categories: Other

The Benefits of Integrating a Google Search Appliance with an Oracle WebCenter or Liferay Portal

This month, the Fishbowl team presented two webinars on integrating a Google Search Appliance with a WebCenter or Liferay Portal. Our new product, the GSA Portal Search Suite, makes integration simple and also allows for customization to create a seamless, secure search experience. It brings a powerful, Google-like search experience directly to your portal.

The first webinar, “The Benefits of Google Search for your Oracle WebCenter or Liferay Portal”, focused on the Google Search Appliance and the positive experiences users have had with incorporating Google search in the enterprise.

 

The second webinar, “Integrating the Google Search Appliance with a WebCenter or Liferay Portal”, dove deeper into how the GSA Portal Search Suite and how it improves the integration process.

 

The following is a list of questions and answers from the webinar series. If you have any other questions, please feel free to reach out to the Fishbowl team!

Q. What version of SharePoint does this product work with?

A. This product is not designed to work with SharePoint. Google has a SharePoint connector that indexes content from SharePoint and pulls it into the GSA, and then the GSA Portal Search Suite would allow any of that content to be served up in your portal.

Fishbowl also has a product called SharePoint Connector that connects SharePoint with Oracle WebCenter Content.

Q. Is Fishbowl a reseller of the GSA? Where can I get a GSA?

A. Yes, we sell the GSA, as well as add-on products and consulting services for the GSA. Visit our website for more information about our GSA services.

Q. What is the difficulty level of customizing the XSLT front end? How long would it take to roll out?

A. This will depend on what you’re trying to customize. If it’s just colors, headers, etc., you could do it pretty quickly because the difficulty level is fairly low. If you’re looking at doing a full-scale customization and entirely changing the look and feel, that could take a lot longer – I would say upwards of a month. The real challenge is that there isn’t a lot of documentation from Google on how to do it, so you would have to do a lot of experimentation.

One of the reasons we created this product is because most customers haven’t been able to fully customize their GSA with a portal, partly because Google didn’t design it to be customizable in this way.

Q. What versions of Liferay does this product support?

A. It supports version 6.2. If you have another version you’d like to integrate with, you can follow up with our team and we can discuss the possibility of working with other versions.

Q. Do you have a connector for IBM WCM?

A. Fishbowl does not have a connector, but Google has a number of connectors that can integrate with many different types of software.

Q. Are you talking about WebCenter Portal or WCM?

A. This connector is designed for WebCenter Portal. If you’re talking about WCM as in SiteStudio or WebCenter Content, we have done a number of projects with those programs. This particular product wouldn’t apply to those situations, but we have other connectors that would work with programs such as WebCenter Content.

Q. Where is the portlet deployed? Is it on the same managed node?

A. The portlets are deployed on the portlet server in WebCenter Portal.

Q. Where can we get the documentation for this product?

A. While the documentation is not publically available, we do have a product page on the website that includes a lot of information on the Portal Search Suite. Contact your Fishbowl representative if you’d like to learn more about it.

Q. What are the server requirements?

A. WebCenter Portal 11g or Liferay 6.2 and Google Search Appliance 7.2.

Q. Does this product include the connector for indexing content?

A. No, this product does not include a connector. We do have a product called GSA Connector for WebCenter that indexes content and then allows you to integrate that content with a portal. Depending on how your portal is configured, you could also crawl the portal just like you would in a regular website. However, this product focuses exclusively on serving and not on indexing.

Q. How many portals will a GSA support? I have several WebCenter Content domains on the same server.

A. The GSA is licensed according to number of content items, not number of sources. You purchase a license for a certain number of content items and then it doesn’t matter how many domains the content is coming from.

The post The Benefits of Integrating a Google Search Appliance with an Oracle WebCenter or Liferay Portal appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Is analytic data management finally headed for the cloud?

DBMS2 - Wed, 2014-10-22 02:48

It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:

  • Amazon Redshift appears to be prospering.
  • So are some SaaS (Software as a Service) business intelligence vendors.
  • Amazon Elastic MapReduce is still around.
  • Snowflake Computing launched with a cloud strategy.
  • Cazena, with vague intentions for cloud data warehousing, destealthed.*
  • Cloudera made various cloud-related announcements.
  • Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
  • The general argument for cloud-or-at-least-colocation has compelling aspects.
  • Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.

Also — although the specifics on this are generally vague and/or confidential — I sense a narrowing of the gap between:

  • The hardware + networking required for performant analytic data management.
  • The hardware + networking available in the cloud.

*Cazena is proud of its team of advisors. However, the only person yet announced for a Cazena operating role is Prat Moghe, and his time period in Netezza’s mainstream happens not to have been one in which Netezza had much technical or market accomplishment.

On the other hand:

  • If you have processing power very close to the data, then you can avoid a lot of I/O or data movement. Many cloud configurations do not support this.
  • Many optimizations depend upon controlling or at least knowing the hardware and networking set-up. Public clouds rarely offer that level of control.

And so I’m still more confident in SaaS/colocation analytic data management, or in Redshift, than I am in true arm’s-length cloud-based systems.

Categories: Other

Snowflake Computing

DBMS2 - Wed, 2014-10-22 02:45

I talked with the Snowflake Computing guys Friday. For starters:

  • Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
  • The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
  • The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
  • Snowflake claims excellent SQL coverage for a 1.0 product.
  • Snowflake, the company, has:
    • 50 people.
    • A similar number of current or past users.
    • 5 referenceable customers.
    • 2 techie founders out of Oracle, plus Marcin Zukowski.
    • Bob Muglia as CEO.

Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*

*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.

In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:

  • Ingest formats are JSON, XML or AVRO for now.
  • I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.

I don’t know enough details to judge whether I’d call that an example of schema-on-need.

A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather:

  • A lateral view is something like a join on a table function, inner or outer join as the case may be.
  • “Lateral view” is an Oracle term, while “Cross apply” is the term for the same thing in Microsoft SQL Server.
  • Lateral views are one of the ways of making SQL handle hierarchical data structures (others evidently are WITH and CONNECT BY).

Lateral views seem central to how Snowflake handles nested data structures. I presume Snowflake also uses or plans to use them in more traditional ways (subqueries, table functions, and/or complex FROM clauses).

If anybody has a good link explaining lateral views, please be so kind as to share! Elementary googling isn’t turning much up, and the Snowflake folks didn’t send over anything clearer than this and this.

Highlights of Snowflake’s cloud/elastic/simple/inexpensive story include:

  • Snowflake’s product is SaaS-only for the foreseeable future.
  • Data is stored in compressed 16 megabyte files on Amazon S3, and pulled into Amazon EC2 servers for query execution on an as-needed basis. Allegedly …
  • … this makes data storage significantly cheaper than it would be in, for example, an Amazon version of HDFS (Hadoop Distributed File System).
  • When you fire up Snowflake, you get a “virtual data warehouse” across one or more nodes. You can have multiple “virtual data warehouses” accessing identical or overlapping sets of data. Each of these “virtual data warehouses” has a physical copy of the data; i.e., this is not related to the Oliver Ratzesberger concept of a virtual data mart defined by workload management.
  • Snowflake has no indexes. It does have zone maps, aka data skipping. (Speaking of simple/inexpensive — both those aspects remind me of Netezza.)
  • Snowflake doesn’t distribute data on any kind of key. I.e. it’s round-robin. (I think that’s accurate; they didn’t have time to get back to me and confirm.)
  • This is not an in-memory story. Data pulled onto Snowflake’s EC2 nodes will commonly wind up in their local storage.

Snowflake pricing is based on the sum of:

  • Per EC2 server-hour, for a couple classes of node.
  • Per S3 terabyte-month of compressed storage.

Right now the cheaper class of EC2 node uses spinning disk, while the more expensive uses flash; soon they’ll both use flash.

DBMS 1.0 versions are notoriously immature, but Snowflake seems — or at least seems to think it is — further ahead than is typical.

  • Snowflake’s optimizer is fully cost-based.
  • Snowflake thinks it has strong SQL coverage, including a large fraction of SQL 2003 Analytics. Apparently Snowflake has run every TPC-H and TPC-DS query in-house, except that one TPC-DS query relied on a funky rewrite or something like that.
  • Snowflake bravely thinks that it’s licked concurrency from Day 1; you just fire up multiple identical virtual DWs if needed to handle the query load. (Note: The set of Version 1 DBMS without concurrent-usage bottlenecks has cardinality very close to 0.)
  • Similarly, Snowflake encourages you to fire up a separate load-only DW instance, and load mainly through trickle feeds.
  • Snowflake’s SaaS-only deployment obviates — or at least obscures :) — a variety of management, administration, etc. features that often are lacking in early DBMS releases.

Other DBMS technology notes include:

  • Compression is columnar (various algorithms, including file-at-a-time dictionary/token).
  • Joins and other database operations are performed on compressed data. (Yay!)
  • Those 16-megabyte files are column-organized and immutable. This strongly suggests which kinds of writes can or can’t be done efficiently. :) Note that adding a column — perhaps of derived data — is one of the things that could go well.
  • There’s some kind of conflict resolution if multiple virtual DWs try to write the same records — but as per the previous point, the kinds of writes for which that’s an issue should be rare anyway.

In the end, a lot boils down to how attractive Snowflake’s prices wind up being. What I can say now is:

  • I don’t actually know Snowflake’s pricing …
  • … nor the amount of work it can do per node.
  • It’s hard to imagine that passing queries from EC2 to S3 is going to give great performance. So Snowflake is more likely to do well when whatever parts of the database wind up being “cached” in the flash of the EC2 servers suffice to answer most queries.
  • In theory, Snowflake could offer aggressive loss-leader pricing for a while. But nobody should make a major strategic bet on Snowflake’s offerings unless it shows it has a sustainable business model.
Categories: Other

Cloudera’s announcements this week

DBMS2 - Thu, 2014-10-16 09:05

This week being Hadoop World, Cloudera naturally put out a flurry of press releases. In anticipation, I put out a context-setting post last weekend. That said, the gist of the news seems to be:

  • Cloudera continued to improve various aspects of its product line, especially Impala with a Version 2.0. Good for them. One should always be making one’s products better.
  • Cloudera announced a variety of partnerships with companies one would think are opposed to it. Not all are Barney. I’m now hard-pressed to think of any sustainable-looking relationship advantage Hortonworks has left in the Unix/Linux world. (However, I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.)
  • Cloudera is getting more cloud-friendly, via a new product — Cloudera Director. Probably there are or will be some cloud-services partnerships as well.

Notes on Cloudera Director start:

  • It’s closed-source.
  • Code and support are included in any version of Cloudera Enterprise.
  • It’s a management tool. Indeed, Cloudera characterized it to me as a sort of manager of Cloudera Managers.

What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:

  • Hadoop, like most analytic RDBMS, tightly couples processing and storage in a shared-nothing way.
  • Standard cloud architectures, however, decouple them, thus mooting a considerable fraction of Hadoop performance engineering.

Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.

Categories: Other

Context for Cloudera

DBMS2 - Mon, 2014-10-13 02:02

Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.

So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):

  • CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
  • Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
    • Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
    • In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
  • Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
  • Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
  • Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
  • Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
  • Cloudera Enterprise Basic Edition contains:
    • All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
    • Commercial licenses for all that code.
    • A license key to use the entirety of Cloudera Manager, not just the free part.
    • Support for the “Core Hadoop” part of CDH.
    • Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
    • The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
  • Cloudera Enterprise Data Hub Edition contains:
    • Everything in Cloudera Basic Edition.
    • A license key for Cloudera Navigator.
    • Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
  • Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.

In analyzing all this, I’m focused on two particular aspects:

  • The “zero, one, many” system for defining the editions of Cloudera Enterprise.
  • The use of “Data Hub” as a general marketing term.

Given its role as a highly influential yet still small “platform” vendor in a competitive open source market, Cloudera even more than most vendors faces the dilemma:

  • Cloudera wants customers to adopt its views as to which Hadoop-related technologies they should use.
  • However, Cloudera doesn’t want to be in the position of trying to ram some particular unwanted package down a customer’s throat.

The Flex/Data Hub packaging fits great with that juggling act, because Cloudera — and hence also Cloudera salespeople — get paid exactly as much when customers pick 2 Flex options as when they use all 5-6. If you prefer Cassandra or MongoDB to HBase, Cloudera is fine with that. Ditto if you prefer CitusDB or Vertica or Teradata Hadapt to Impala. Thus Cloudera can avoid a lot of religious wars, even if it can’t entirely escape Hortonworks’ “More open source than thou” positioning.

Meanwhile, so far as I can tell, Cloudera currently bets on the “Enterprise Data Hub” as its core proposition, as evidenced by that term being baked into the name of Cloudera’s most comprehensive and expensive offering. Notes on the EDH start:

  • Cloudera also portrays “enterprise data hub” as an architectural/reference architecture concept.
  • “Enterprise data hub” doesn’t really mean anything very different from “data lake” + “data refinery”; Cloudera just thinks it sounds more important. Indeed, Cloudera claims that the other terms are dismissive or disparaging, at least in some usages.

Cloudera’s long-term dream is clearly to make Hadoop the central data platform for an enterprise, while RDBMS fill more niche (or of course also legacy) roles. I don’t think that will ever happen, because I don’t think there really will be one central data platform in the future, any more than there has been in the past. As I wrote last year on appliances, clusters and clouds,

Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.

and earlier in the same post

… these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.

One system is not going to be optimal for all computing purposes.

Categories: Other

Notes on predictive modeling, October 10, 2014

DBMS2 - Fri, 2014-10-10 02:40

As planned, I’m getting more active in predictive modeling. Anyhow …

1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.

2. The most controversial part of that post was probably the claim:

I think the predictive modeling state of the art has become:

  • Cluster in some way.
  • Model separately on each cluster.

In particular:

  • It is always possible to instead go with a single model formally.
  • A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
  • Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.

3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start:

  • While KXEN was distinguished by how limited its choice of model templates was, Nutonian is distinguished by its remarkable breadth. Is the best model for your data a quadratic polynomial in which some of the terms are trigonometric functions? Nutonian is happy to find that for you.
  • Nutonian is starting out as a SaaS (Software as a Service) vendor.
  • A big part of Nutonian’s goal is to find a simple/parsimonious model, because — although this is my phrasing rather than theirs — the simpler the model, the more likely it is to have robust explanatory power.

With all those possibilities, what do Nutonian models actually wind up looking like? In internet/log analysis/whatever kinds of use cases, I gather that:

  • The model is likely to be a polynomial — of multiple variables of course — of order no more than 3 or 4.
  • Variables can have time delays built into them (e.g., sales today depend on email sent 2 weeks ago). Indeed, some of Nutonian’s flashiest early modeling successes seem to be based around the ease with which they capture time-delayed causality.
  • In each monomial, all variables except 1 are likely to be “control”/”capping”/”transition-point”/”on-off switch”/logical/conditional/whatever variables — i.e., variables whose range is likely to be either {0,1} or perhaps [0,1] instead.

Nutonian also serves real scientists, however, and their models can be all over the place.

4. One set of predictive modeling complexities goes something like this:

  • A modeling exercise may have 100s or 1000s of potential variables to work with. (For simplicity, think of a potential variable as a column or field in the input data.)
  • The winning models are likely to use only a small fraction of these variables.
  • Those may not be variables you’re thrilled about using.
  • Fortunately, many variables have strong covariances with each other, so it’s often possible to exclude your disfavored variables and come out with a model almost as good.

I pushed the Nutonian folks to brainstorm with me about why one would want to exclude variables, and quite a few kinds of reasons came up, including:

  • (My top example.) Regulatory compliance may force you to exclude certain variables. E.g., credit scores in the US mustn’t be based on race.
  • (Their top example.) Some data is just expensive to get. E.g., a life insurer would like to come up with a way to avoid using blood test results in their decision making, because they’d like to drop the expense of the blood tests.
  • (Perhaps our joint other top example.) Clarity of explanation is an important goal. Some models are black boxes, and that’s that. Others are also supposed to uncover causality that helps humans make all kinds of better decision. Regulators may also want clear models. Note: Model clarity can be affected by model structure and variable(s) choice alike.
  • Certain variables can simply be more or less trusted, in terms of the accuracy of the data.
  • Certain variables can be more or less certain to be available in the future. However, I wonder how big a concern that is in a world where models are frequently retrained anyway.

5. I’m not actually seeing much support for the theory that Julia will replace R except perhaps from Revolution Analytics, the company most identified with R. Go figure.

6. And finally, I don’t think it’s wholly sunk in among predictive modeling folks that Spark both:

  • Has great momentum.
  • Was designed with machine learning in mind.
Categories: Other