Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:
- Government snooping on the contents of communications.
- Communication traffic analysis.
- Photos and videos (airport scanners, public cameras, etc.)
- Commercial ad targeting.
- Traditional medical records.
Other areas, however, continue to be overlooked, with the two biggies in my opinion being:
- The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
- The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.
3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …
4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.
5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.
6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.
Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:
- High-frequency trading is an extreme race; microseconds matter.
- Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
- Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.
There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.
7. My views about predictive analytics are still somewhat confused. For starters:
- The math and technology of predictive modeling both still seem pretty simple …
- … but sometimes achieve mind-blowing results even so.
- There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
- Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.
So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.
- WibiData has some innovative ideas in predictive experimentation.
- Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.
- It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.
- Ayasdi reminded us that there’s room for innovation in clustering.
- My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.
Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example listed above is certainly imaginative. But I’d like to see a lot more in the area.
Even more important, of course, could be some kind of revolution in predictive modeling for medicine.
If you missed Fishbowl’s recent webinar on our new Enterprise Information Portal for Project Management, you can now view a recording of it on YouTube.
Innovation in Managing the Chaos of Everyday Project Management discusses our strategy for leveraging the content management and collaboration features of Oracle WebCenter to enable project-centric organizations to build and deploy a project management portal. This solution was designed especially for groups like E & C firms and oil and gas companies, who need applications to be combined into one portal for simple access.
If you’d like to learn more about the Enterprise Information Portal for Project Management, visit our website or email our sales team at email@example.com.
The post “Innovation in Managing the Chaos of Everyday Project Management” is now on YouTube appeared first on Fishbowl Solutions' C4 Blog.
A conversation I have too often with vendors goes something like:
- “That confidential thing you told me is interesting, and wouldn’t harm you if revealed; probably quite the contrary.”
- “Well, I guess we could let you mention a small subset of it.”
- “I’m sorry, that’s not enough to make for an interesting post.”
That was the genesis of some tidbits I recently dropped about WibiData and predictive modeling, especially but not only in the area of experimentation. However, Wibi just reversed course and said it would be OK for me to tell more or less the full story, as long as I note that we’re talking about something that’s still in beta test, with all the limitations (to the product and my information alike) that beta implies.
As you may recall:
- WibiData started out with a rich technology stack …
- … but decided to cast itself as an application company …
- … whose first vertical market is retailing,
With that as background, WibiData’s approach to predictive modeling as of its next release will go something like this:
- There is still a strong element of classical modeling by data scientists/statisticians, with the models re-scored in batch, perhaps nightly.
- But of course at least some scoring should be done as real-time as possible, to accommodate fresh data such as:
- User interactions earlier in today’s session.
- Technology for today’s session (device, connection speed, etc.)
- Today’s weather.
- WibiData Express is/incorporates a Scala-based language for modeling and query.
- WibiData believes Express plus a small algorithm library gives better results than more mature modeling libraries.
- There is some confirming evidence of this …
- … but WibiData’s customers have by no means switched over yet to doing the bulk of their modeling in Wibi.
- WibiData will allow line-of-business folks to experiment with augmentations to the base models.
- Supporting technology for predictive experimentation in WibiData will include:
- Automated multi-armed bandit testing (in previous versions even A/B testing has been manual).
- A facility for allowing fairly arbitrary code to be included into otherwise conventional model-scoring algorithms, where conventional scoring models can come:
- Straight from WibiData Express.
- Via PMML (Predictive Modeling Markup Language) generated by other modeling tools.
- An appropriate user interface for the line-of-business folks to do certain kinds of injecting.
Let’s talk more about predictive experimentation. WibiData’s paradigm for that is:
- Models are worked out in the usual way.
- Businesspeople have reasons for tweaking the choices the models would otherwise dictate.
- They enter those tweaks as rules.
- The resulting combination — models plus rules — are executed and hence tested.
If those reasons for tweaking are in the form of hypotheses, then the experiment is a test of those hypotheses. However, WibiData has no provision at this time to automagically incorporate successful tweaks back into the base model.
What might those hypotheses be like? It’s a little tough to say, because I don’t know in fine detail what is already captured in the usual modeling process. WibiData gave me only one real-life example, in which somebody hypothesized that shoppers would be in more of a hurry at some times of day than others, and hence would want more streamlined experiences when they could spare less time. Tests confirmed that was correct.
That said, I did grow up around retailing, and so I’ll add:
- Way back in the 1970s, Wal-Mart figured out that in large college towns, clothing in the football team’s colors was wildly popular. I’d hypothesize such a rule at any vendor selling clothing suitable for being worn in stadiums.
- A news event, blockbuster movie or whatever might trigger a sudden change in/addition to fashion. An alert merchant might guess that before the models pick it up. Even better, she might guess which psychographic groups among her customers were most likely to be paying attention.
- Similarly, if a news event caused a sudden shift in buyers’ optimism/pessimism/fear of disaster, I’d test that a response to that immediately.
Finally, data scientists seem to still be a few years away from neatly solving the problem of multiple shopping personas — are you shopping in your business capacity, or for yourself, or for a gift for somebody else (and what can we infer about that person)? Experimentation could help fill the gap.
1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.
- The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
- If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.
I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:
- Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
- That can inform a BI-plus interface for a human to figure out what is happening.
Makes sense to me.
* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.
Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.
2. Discussion of graph DBMS can get confusing. For example:
- Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
- Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
- The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
- My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.
I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.
3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*
*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.
Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.
4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.
6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.
Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.
MapR put out a press release aggregating some customer information; unfortunately, the release is a monument to vagueness. Let me start by saying:
- I don’t know for sure, but I’m guessing Derrick Harris was incorrect in suspecting that this release was a reaction to my recent post about Hortonworks’ numbers. For one thing, press releases usually don’t happen that quickly.
- And as should be obvious from the previous point — notwithstanding that MapR is a client, I had no direct involvement in this release.
- In general, I advise clients and other vendors to put out the kind of aggregate of customer success stories found in this release. However, I would like to see more substance than MapR offered.
Anyhow, the key statement in the MapR release is:
… the number of companies that have a paid subscription for MapR now exceeds 700.
Unfortunately, that includes OEM customers as well as direct ones; I imagine MapR’s direct customer count is much lower.
In one gesture to numerical conservatism, MapR did indicate by email that it counts by overall customer organization, not by department/cluster/contract (i.e., not the way Hortonworks does).
The MapR press release also said:
As of November 2014, MapR has one or more customers in eight vertical markets that have purchased more than one million dollars of MapR software and services. These vertical markets are advertising/media, financial services, healthcare, internet, information technology, retail, security, and telecom.
Since the word “each” isn’t in that quote, so we don’t even know whether MapR is referring to individual big customers or just general sector penetration. We also don’t know whether the revenue is predominantly subscription or some other kind of relationship.
MapR also indicated that the average customer more than doubled its annualized subscription rate vs. a year ago; the comparable figure — albeit with heavy disclaimers — from Hortonworks was 25%.
I believe in all of the following trends:
- Hadoop is a Big Deal, and here to stay.
- Spark, for most practical purposes, is becoming a big part of Hadoop.
- Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.
Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:
- People would like this to be true, although in most cases only as one of several cluster computing platforms.
- Hadoop, when viewed as an operating system, is extremely primitive.
- Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.
There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.
Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:
- Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
- In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
- Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.
Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:
- HDFS caching, about which I know relatively little.
- Tachyon, in which I gather Spark’s creators continue to believe.
- Cloudera’s deal to run Hadoop against Isilon storage and its associated interest in also supporting other kinds of storage, such as the object storage commonly found in clouds.
The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:
- Works well with Hadoop execution engines (Spark, Tez, Impala …).
- Works well with other software people might want to put on their Hadoop nodes.
- Interfaces nicely to HDFS, Isilon, object storage, et al.
- Is fully parallel any time it needs to talk with persistent or external storage.
- Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.
Tachyon could, I imagine, become that. HDFS caching probably could not.
In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.
- Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
- $11.7 million from everybody but Microsoft, …
- … plus $7.5 million from Microsoft, …
- … for a total of $19.2 million.
- Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
- 2 on April 30, 2012.
- 9 on December 31, 2012.
- 25 on April 30, 2013.
- 54 on September 30, 2013.
- 95 on December 31, 2013.
- 233 on September 30, 2014.
- Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
- Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
- This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
- In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
- The tentative top of the offering’s price range is $14/share.
- That’s also slightly down from the Series C price in mid-2013.
And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.
Overall, the Hortonworks S-1 is about 180 pages long, and — as is typical — most of it is boilerplate, minutiae or drivel. As is also typical, two of the most informative sections of the Hortonworks S-1 are:
- The section called Management’s Discussion and Analysis.
- The footnotes to the numbers, starting a couple of pages in.
The clearest financial statements in the Hortonworks S-1 are probably the quarterly figures on Page 62, along with the tables on Pages F3, F4, and F7.
Special difficulties in interpreting Hortonworks’ numbers include:
- A large fraction of revenue has come from a few large customers, most notably Microsoft. Details about those revenues are further confused by:
- Difficulty in some cases getting a fix on the subscription/professional services split. (It does seem clear that Microsoft revenues are 100% subscription.)
- Some revenue deductions associated with stock deals, called “contra-revenue”.
- Hortonworks changed the end of its fiscal year from April to December, leading to comparisons of a couple of eight-month periods.
- There was a $6 million lawsuit settlement (some kind of employee poaching/trade secrets case), discussed on Page F-21.
- There is some counter-intuitive treatment of Windows-related development (cost of revenue rather than R&D).
One weirdness is that cost of professional services revenue far exceeds 100% of such revenue in every period Hortonworks reports. Hortonworks suggests that this is because:
- Professional services revenue is commonly bundled with support contracts.
- Such revenue is recognized ratably over the life of the contract, as opposed to a more natural policy of recognizing professional services revenue when the services are actually performed.
I’m struggling to come up with a benign explanation for this.
In the interest of space, I won’t quote Hortonworks’ S-1 verbatim; instead, I’ll just note where some of the more specifically informative parts may be found.
- Page 53 describes Hortonworks’ typical sales cycles (they’re long).
- Page 54 says the average customer has increased subscription payments 25% year over year, but emphasize that the sample size is too small to be reliable.
- Pages 55-63 have a lot of revenue and expense breakdowns.
- Deferred revenue numbers (which are a proxy for billings and thus signed contracts) are on Page 65.
- Pages II 2-3 list all (I think) Hortonworks financings in a concise manner.
And finally, Hortonworks’ dealings with its largest customers and strategic partners are cited in a number of places. In particular:
- Pages 52-3 cover dealings with Yahoo, Teradata, Microsoft, and AT&T.
- Pages 82-3 discusses OEM revenue from Hewlett-Packard, Red Hat, and Teradata, none of which amounts to very much.
- Page 109 covers the Teradata agreement. It seems that there’s less going on than originally envisioned, in that Teradata made a nonrefundable prepayment far greater than turns out to have been necessary for subsequent work actually done. That could produce a sudden revenue spike or else positive revenue restatement as of February, 2015.
- Page F-10 has a table showing revenue from Hortonworks’ biggest customers (Company A is Microsoft and Company B is Yahoo).
- Pages F37-38 further cover Hortonworks’ relationships with Yahoo, Teradata and AT&T.
Correction notice: Some of the page numbers in this post were originally wrong, surely because Hortonworks posted an original and amended version of this filing, and I got the two documents mixed up. A huge Thank You goes to Merv Adrian for calling my attention to this, and I think I’ve now fixed them. I apologize for the errors!