1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.
- The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
- If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.
I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:
- Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
- That can inform a BI-plus interface for a human to figure out what is happening.
Makes sense to me.
* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.
Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.
2. Discussion of graph DBMS can get confusing. For example:
- Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
- Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
- The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
- My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.
I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.
3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*
*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.
Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.
4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.
6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.
Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.
MapR put out a press release aggregating some customer information; unfortunately, the release is a monument to vagueness. Let me start by saying:
- I don’t know for sure, but I’m guessing Derrick Harris was incorrect in suspecting that this release was a reaction to my recent post about Hortonworks’ numbers. For one thing, press releases usually don’t happen that quickly.
- And as should be obvious from the previous point — notwithstanding that MapR is a client, I had no direct involvement in this release.
- In general, I advise clients and other vendors to put out the kind of aggregate of customer success stories found in this release. However, I would like to see more substance than MapR offered.
Anyhow, the key statement in the MapR release is:
… the number of companies that have a paid subscription for MapR now exceeds 700.
Unfortunately, that includes OEM customers as well as direct ones; I imagine MapR’s direct customer count is much lower.
In one gesture to numerical conservatism, MapR did indicate by email that it counts by overall customer organization, not by department/cluster/contract (i.e., not the way Hortonworks does).
The MapR press release also said:
As of November 2014, MapR has one or more customers in eight vertical markets that have purchased more than one million dollars of MapR software and services. These vertical markets are advertising/media, financial services, healthcare, internet, information technology, retail, security, and telecom.
Since the word “each” isn’t in that quote, so we don’t even know whether MapR is referring to individual big customers or just general sector penetration. We also don’t know whether the revenue is predominantly subscription or some other kind of relationship.
MapR also indicated that the average customer more than doubled its annualized subscription rate vs. a year ago; the comparable figure — albeit with heavy disclaimers — from Hortonworks was 25%.
I believe in all of the following trends:
- Hadoop is a Big Deal, and here to stay.
- Spark, for most practical purposes, is becoming a big part of Hadoop.
- Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.
Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:
- People would like this to be true, although in most cases only as one of several cluster computing platforms.
- Hadoop, when viewed as an operating system, is extremely primitive.
- Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.
There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.
Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:
- Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
- In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
- Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.
Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:
- HDFS caching, about which I know relatively little.
- Tachyon, in which I gather Spark’s creators continue to believe.
- Cloudera’s deal to run Hadoop against Isilon storage and its associated interest in also supporting other kinds of storage, such as the object storage commonly found in clouds.
The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:
- Works well with Hadoop execution engines (Spark, Tez, Impala …).
- Works well with other software people might want to put on their Hadoop nodes.
- Interfaces nicely to HDFS, Isilon, object storage, et al.
- Is fully parallel any time it needs to talk with persistent or external storage.
- Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.
Tachyon could, I imagine, become that. HDFS caching probably could not.
In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.
- Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
- $11.7 million from everybody but Microsoft, …
- … plus $7.5 million from Microsoft, …
- … for a total of $19.2 million.
- Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
- 2 on April 30, 2012.
- 9 on December 31, 2012.
- 25 on April 30, 2013.
- 54 on September 30, 2013.
- 95 on December 31, 2013.
- 233 on September 30, 2014.
- Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
- Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
- This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
- In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
- The tentative top of the offering’s price range is $14/share.
- That’s also slightly down from the Series C price in mid-2013.
And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.
Overall, the Hortonworks S-1 is about 180 pages long, and — as is typical — most of it is boilerplate, minutiae or drivel. As is also typical, two of the most informative sections of the Hortonworks S-1 are:
- The section called Management’s Discussion and Analysis.
- The footnotes to the numbers, starting a couple of pages in.
The clearest financial statements in the Hortonworks S-1 are probably the quarterly figures on Page 62, along with the tables on Pages F3, F4, and F7.
Special difficulties in interpreting Hortonworks’ numbers include:
- A large fraction of revenue has come from a few large customers, most notably Microsoft. Details about those revenues are further confused by:
- Difficulty in some cases getting a fix on the subscription/professional services split. (It does seem clear that Microsoft revenues are 100% subscription.)
- Some revenue deductions associated with stock deals, called “contra-revenue”.
- Hortonworks changed the end of its fiscal year from April to December, leading to comparisons of a couple of eight-month periods.
- There was a $6 million lawsuit settlement (some kind of employee poaching/trade secrets case), discussed on Page F-21.
- There is some counter-intuitive treatment of Windows-related development (cost of revenue rather than R&D).
One weirdness is that cost of professional services revenue far exceeds 100% of such revenue in every period Hortonworks reports. Hortonworks suggests that this is because:
- Professional services revenue is commonly bundled with support contracts.
- Such revenue is recognized ratably over the life of the contract, as opposed to a more natural policy of recognizing professional services revenue when the services are actually performed.
I’m struggling to come up with a benign explanation for this.
In the interest of space, I won’t quote Hortonworks’ S-1 verbatim; instead, I’ll just note where some of the more specifically informative parts may be found.
- Page 53 describes Hortonworks’ typical sales cycles (they’re long).
- Page 54 says the average customer has increased subscription payments 25% year over year, but emphasize that the sample size is too small to be reliable.
- Pages 55-63 have a lot of revenue and expense breakdowns.
- Deferred revenue numbers (which are a proxy for billings and thus signed contracts) are on Page 65.
- Pages II 2-3 list all (I think) Hortonworks financings in a concise manner.
And finally, Hortonworks’ dealings with its largest customers and strategic partners are cited in a number of places. In particular:
- Pages 52-3 cover dealings with Yahoo, Teradata, Microsoft, and AT&T.
- Pages 82-3 discusses OEM revenue from Hewlett-Packard, Red Hat, and Teradata, none of which amounts to very much.
- Page 109 covers the Teradata agreement. It seems that there’s less going on than originally envisioned, in that Teradata made a nonrefundable prepayment far greater than turns out to have been necessary for subsequent work actually done. That could produce a sudden revenue spike or else positive revenue restatement as of February, 2015.
- Page F-10 has a table showing revenue from Hortonworks’ biggest customers (Company A is Microsoft and Company B is Yahoo).
- Pages F37-38 further cover Hortonworks’ relationships with Yahoo, Teradata and AT&T.
Correction notice: Some of the page numbers in this post were originally wrong, surely because Hortonworks posted an original and amended version of this filing, and I got the two documents mixed up. A huge Thank You goes to Merv Adrian for calling my attention to this, and I think I’ve now fixed them. I apologize for the errors!