Skip navigation.

DBMS2

Syndicate content
Choices in data management and analysis
Updated: 2 hours 49 min ago

Data models

Sun, 2015-02-22 21:08

7-10 years ago, I repeatedly argued the viewpoints:

  • Relational DBMS were the right choice in most cases.
  • Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
  • There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

  • Hadoop has flourished.
  • NoSQL has flourished.
  • Graph DBMS have matured somewhat.
  • Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of <name, value> pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.

Important distinctions include:

  • Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
  • If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
  • In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
  • Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
  • Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.

Some major data models can be put into a fairly strict ordering of query desirability by noting:

  • The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
  • The next-best thing to query is another kind of data store with known field names. In such data stores:
    • SQL or SQL-like SELECTs will still work, or can easily be made to do.
    • Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
    • In the (mainly) future, perhaps JOINs can be grafted on as well.
  • The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place

Unsurprisingly, that ordering is reversed when it comes to writing data.

  • The easiest thing to write to is a data store with no structure.
  • Next-easiest is to write to a data store that lets you make up the structure as you go along.
  • The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
    • Implicit field names, governed by metadata.
    • Unique field names within any one record.
    • The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.

And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:

  • Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
    • I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
    • SAP really, really wants you to use HANA to run SAP’s apps.
    • You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
  • Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
    • They’re all immature right now, with advantages over each other.
    • The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
  • Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.

Beyond that:

  • You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
  • Splunk is cool.
  • Use cases for various other kinds of data stores can often be found.
  • Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.

And finally, I think in-memory data grids:

Related links

Categories: Other

Greenplum is being open sourced

Wed, 2015-02-18 15:51

While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:

  • Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
  • In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
  • That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
  • The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.

Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.

Further, the low-cost alternatives for analytic RDBMS are adding up.

  • Amazon Redshift has considerable traction.
  • Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
  • Now Greenplum is in the mix as well.

For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.

By no means do I want to suggest those are the only alternatives.

  • Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
  • Larger vendors can always slash price in specific deals.
  • MonetDB is still around.

But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.

Related link

Categories: Other

Hadoop: And then there were three

Wed, 2015-02-18 15:50

Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:

  • An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
  • A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
  • An excuse for press releases.
  • A source of an extra logo graphic to put on marketing slides.

Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).

The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?

Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:

  • Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
  • MapR’s flavor, as software from MapR.
  • Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.

In saying that, I’m glossing over a few points, such as:

  • There are various remote services that run Hadoop, most famously Amazon’s Elastic MapReduce.
  • You could get Apache Hadoop directly, rather than using the free or paid versions of a vendor distro. But why would you make that choice, unless you’re an internet bad-ass on the level of Facebook, or at least think that you are?
  • There will surely always be some proprietary stuff mixed into, for example, IBM’s BigInsights, so as to preserve at least the perception of all-important vendor lock-in.

But the main point stands — big computer companies, such as IBM, EMC (Pivotal) and previously Intel, are figuring out that they can’t bigfoot something that started out as an elephant — stuffed or otherwise — in the first place.

If you think I’m not taking this whole ODP thing very seriously, you’re right.

Related links

  • It’s a bit eyebrow-raising to see Mike Olson take a “more open source than thou” stance about something, but basically his post about this news is spot-on.
  • My take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that post and the joke that ends it?
Categories: Other

MongoDB 3.0

Thu, 2015-02-12 13:44

Old joke:

  • Question: Why do policemen work in pairs?
  • Answer: One to read and one to write.

A lot has happened in MongoDB technology over the past year. For starters:

  • The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
    • 7-10X improvement in write performance.
    • No change in read performance (which however was boosted in MongoDB 2.6).
    • ~70% reduction in data size due to compression (disk only).
    • ~50% reduction in index size due to compression (disk and memory both).
  • MongoDB has been adding administration modules.
    • A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
    • An on-premise version came out with 3.0.
    • They have similar features, but are expected to grow apart from each other over time. They have different names.

*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.

To forestall confusion, let me quickly add:

  • MongoDB acquired the WiredTiger product and company, and continues to sell the product on a standalone basis, as well as bundling a version into MongoDB. This could cause confusion because …
  • … the standalone version of WiredTiger has numerous capabilities that are not in the bundled MongoDB storage engine.
  • There’s some ambiguity as to when MongoDB first “ships” a feature, in that …
  • … code goes to open source with an earlier version number than it goes into the packaged product.

I should also clarify that the addition of WiredTiger is really two different events:

  • MongoDB added the ability to have multiple plug-compatible storage engines. Depending on how one counts, MongoDB now ships two or three engines:
    • Its legacy engine, now called MMAP v1 (for “Memory Map”). MMAP continues to be enhanced.
    • The WiredTiger engine.
    • A “please don’t put this immature thing into production yet” memory-only engine.
  • WiredTiger is now the particular storage engine MongoDB recommends for most use cases.

I’m not aware of any other storage engines using this architecture at this time. In particular, last I heard TokuMX was not an example. (Edit: Actually, see Tim Callaghan’s comment below.)

Most of the issues in MongoDB write performance have revolved around locking, the story on which is approximately:

  • Until MongoDB 2.2, locks were held at the process level. (One MongoDB process can control multiple databases.)
  • As of MongoDB 2.2, locks were held at the database level, and some sanity was added as to how long they would last.
  • As of MongoDB 3.0, MMAP locks are held at the collection level.
  • WiredTiger locks are held at the document level. Thus MongoDB 3.0 with WiredTiger breaks what was previously a huge write performance bottleneck.

In understanding that, I found it helpful to do a partial review of what “documents” and so on in MongoDB really are.

  • A MongoDB document is somewhat like a record, except that it can be more like what in a relational database would be all the records that define a business object, across dozens or hundreds of tables.*
  • A MongoDB collection is somewhat like a table, although the documents that comprise it do not need to each have the same structure.
  • MongoDB documents want to be capped at 16 MB in size. If you need one bigger, there’s a special capability called GridFS to break it into lots of little pieces (default = 1KB) while treating it as a single document logically.

*One consequence — MongoDB’s single-document ACID guarantees aren’t quite as lame as single-record ACID guarantees would be in an RDBMS.

By the way:

  • Row-level locking was a hugely important feature in RDBMS about 20 years ago. Sybase’s lack of it is a big part of what doomed them to second-tier status.
  • Going forward, MongoDB has made the unsurprising marketing decision to talk about “locks” as little as possible, relying instead on alternate terms such as “concurrency control”.

Since its replication mechanism is transparent to the storage engine, MongoDB allows one to use different storage engines for different replicas of data. Reasons one might want to do this include:

  • Fastest persistent writes (WiredTiger engine).
  • Fastest reads (wholly in-memory engine).
  • Migration from one engine to another.
  • Integration with some other data store. (Imagine, for example, a future storage engine that works over HDFS. It probably wouldn’t have top performance, but it might make Hadoop integration easier.)

In theory one can even do a bit of information lifecycle management (ILM), by using different storage engines for different subsets of database, by:

  • Pinning specific shards of data to specific servers.
  • Using different storage engines on those different servers.

That said, similar stories have long been told about MySQL, and I’m not aware of many users who run multiple storage engines side by side.

The MongoDB WiredTiger option is shipping with a couple of options for block-level compression (plus prefix compression that is being used for indexes only). The full WiredTiger product also has some forms of columnar compression for data.

One other feature in MongoDB 3.0 is the ability to have 50 replicas of data (the previous figure was 12). MongoDB can’t think of a great reason to have more than 3 replicas per data center or more than 2 replicas per metropolitan area, but some customers want to replicate data to numerous locations around the world.

Related link

Categories: Other

Information technology for personal safety

Sun, 2015-02-01 05:34

There are numerous ways that technology, now or in the future, can significantly improve personal safety. Three of the biggest areas of application are or will be:

  • Crime prevention.
  • Vehicle accident prevention.
  • Medical emergency prevention and response.

Implications will be dramatic for numerous industries and government activities, including but not limited to law enforcement, automotive manufacturing, infrastructure/construction, health care and insurance. Further, these technologies create a near-certainty that individuals’ movements and status will be electronically monitored in fine detail. Hence their development and eventual deployment constitutes a ticking clock toward a deadline for society deciding what to do about personal privacy.

Theoretically, humans aren’t the only potential kind of tyrants. Science fiction author Jack Williamson postulated a depressing nanny-technology in With Folded Hands, the idea for which was later borrowed by the humorous Star Trek episode I, Mudd.

Of these three areas, crime prevention is the furthest along; in particular, sidewalk cameras, license plate cameras and internet snooping are widely deployed around the world. So let’s consider the other two.

Vehicle accident prevention

Suppose every automobile on the road “knew” where all nearby vehicles were, and their speed and direction as well. Then it could also “know” the safest and fastest ways to move you along. You might actively drive, while it advised and warned you; it might be the default “driver”, with you around to override. Inbetween possibilities exist as well.

Frankly, I don’t know how expensive a suitably powerful and rugged transponder for such purposes would be. I also don’t know to what extent the most efficient solutions would involve substantial investment in complementary, stationary equipment. But I imagine the total cost would be relatively small compared to that of automobiles or auto insurance.

Universal deployment of such technology could be straightforward. If the government can issue you license plates, it can issue transponders as well, or compel you to get your own. It would have several strong motivations to do so, including:

  • Electronic toll collection — this is already happening in a significant fraction of automobiles around the world.
  • Snooping for the purpose of law enforcement.
  • Accident prevention.
  • (The biggest of all.) Easing the transition to autonomous vehicles.

Insurance companies have their own motivations to support safety-related technology. And the automotive industry has long been aggressive in incorporating microprocessor technology. Putting that all together, I am confident in the prediction: Smart cars are going to happen.

The story goes further yet. Despite improvements in safety technology, accidents will still happen. And the same location-tracking technology used for real-time accident avoidance should provide a massive boost to post-accident forensics, for use in:

  • Insurance adjudication (obviously and often),
  • Criminal justice (when the accident has criminal implications), and
  • Predictive modeling.

The predictive modeling, in turn, could influence (among other areas):

  • General automobile design (if a lot of accidents have a common cause, re-engineer to address it).
  • Maintenance of specific automobiles (if the car’s motion is abnormal, have it checked out).
  • Individual drivers’ insurance rates.

Transportation is going to change a lot.

Medical emergency prevention and response

I both expect and welcome the rise of technology that helps people who can’t reliably take care of themselves (babies, the elderly) to be continually monitored. My father and aunt might each have lived longer if such technology had been available sooner. But while the life-saving emergency response uses will be important enough, emergency avoidance may be an even bigger deal. Much as in my discussion above of cars, the technology could also be used to analyze when an old person is at increasing risk of falls or other incidents. In a world where families live apart but nursing homes are terrible places, this could all be a very important set of developments.

Another area where the monitoring/response/analysis/early-warning cycle could work is cardio-vascular incidents. I imagine we’ll soon have wearable devices that help detect the development or likelihood of various kinds of blockages, and hence forestall cardiovascular emergencies, such as those that often befall seemingly-healthy middle-aged people. Over time, I think those devices will become pretty effective. The large market opportunity should be obvious.

Once life-and-death benefits lead the way, I expect less emergency-focused kinds of fitness monitoring to find receptive consumers as well. (E.g. in the intestinal/nutrition domain.) And so I have another prediction (with an apology to Socrates): The unexamined life will seem too dangerous to continue living.

Trivia challenge: Where was the wordplay in that last paragraph?

Related links

  • My overview of innovation opportunities ended by saying there was great opportunity in devices. It also offered notes on predictive modeling and so on.
  • My survey of technologies around machine-generated data ended by focusing on predictive modeling for problem and anomaly detection and diagnosis, for machines and bodies alike.
  • The topics of this post are part of why I’m bullish on machine-generated data growth.
  • I think soft robots that also provide practical assistance could become a big part of health-related monitoring.
Categories: Other

Growth in machine-generated data

Fri, 2015-01-30 13:31

In one of my favorite posts, namely When I am a VC Overlord, I wrote:

I will not fund any entrepreneur who mentions “market projections” in other than ironic terms. Nobody who talks of market projections with a straight face should be trusted.

Even so, I got talked today into putting on the record a prediction that machine-generated data will grow at more than 40% for a while.

My reasons for this opinion are little more than:

  • Moore’s Law suggests that the same expenditure will buy 40% or so more machine-generated data each year.
  • Budgets spent on producing machine-generated data seem to be going up.

I was referring to the creation of such data, but the growth rates of new creation and of persistent storage are likely, at least at this back-of-the-envelope level, to be similar.

Anecdotal evidence actually suggests 50-60%+ growth rates, so >40% seemed like a responsible claim.

Related links

Categories: Other

Soft robots, Part 2 — implications

Tue, 2015-01-27 06:31

What will soft, mobile robots be able to do that previous generations cannot? A lot. But I’m particularly intrigued by two large categories:

  • Inspection, maintenance and repair.
  • Health care/family care assistance.

There are still many things that are hard for humans to keep in good working order, including:

  • Power lines.
  • Anything that’s underwater (cables, drilling platforms, etc.)
  • Pipelines, ducts, and water mains (especially from the inside).
  • Any kind of geographically remote power station or other installation.

Sometimes the issue is (hopefully minor) repairs. Sometimes it’s cleaning or lubrication. In some cases one might want to upgrade a structure with fixed sensors, and the “repair” is mainly putting those sensors in place. In all these cases, it seems that soft robots could eventually offer a solution. Further examples, I’m sure, could be found in factories, mines, or farms.

Of course, if there’s a maintenance/repair need, inspection is at least part of the challenge; in some cases it’s almost the whole thing. And so this technology will help lead us toward the point that substantially all major objects will be associated with consistent flows of data. Opportunities for data analysis will abound.

One other point about data flows — suppose you have two kinds of machines that can do a task, one of which is flexible, the other rigid. The flexible one will naturally have much more variance in what happens from one instance of the task to the next one. That’s just another way in which soft robots will induce greater quantities of machine-generated data.

Let’s now consider health care, whose basic characteristics include:

  • It’s done to people …
  • … especially ones who don’t feel very good.

People who are sick, elderly or whatever can often use help with simple tasks — e.g., taking themselves to the bathroom, or fetching a glass water. So can their caretakers — e.g., turning a patient over in bed. That’s even before we get to more medical tasks such as checking and re-bandaging an awkwardly-placed wound. And on the healthier side, I wouldn’t mind having a robot around the house that could, for example, spot me with free weights. Fully general forms of this seem rather futuristic. But even limited forms might augment skilled-nurse labor, or let people stay in their own homes who at the moment can’t quite make it there.

And, once again, any of these use cases would likely be associated with its own stream(s) of observational and introspective data.

Related link

Categories: Other

Soft robots, Part 1 — introduction

Tue, 2015-01-27 06:29

There may be no other subject on which I’m so potentially biased as robotics, given that:

  • I don’t spend a lot of time on the area, but …
  • … one of the better robotics engineers in the world (Kevin Albert) just happens to be in my family …
  • … and thus he’s overwhelmingly my main source on the general subject of robots.

Still, I’m solely responsible for my own posts and opinions, while Kevin is busy running his startup (Pneubotics) and raising my grandson. And by the way — I’ve been watching the robotics industry slightly longer than Kevin has been alive. ;)

My overview messages about all this are:

  • Historically, robots have been very limited in their scope of motion and action. Indeed, most successful robots to date have been immobile, metallic programmable machines, serving on classic assembly lines.
  • Next-generation robots should and will be much more able to safely and effectively navigate through and work within general human-centric environments.
  • This will affect a variety of application areas in ways that readers of this blog may care about.

Examples of the first point may be found in any number of automobile factory videos, such as:

A famous example of the second point is a 5-year-old video of Kevin’s work on prototype robot locomotion, namely:

Walking robots (such as Big Dog) and general soft robots (such as those from Pneubotics) rely on real-time adaptation to physical feedback. Robots have long enjoyed machine vision,* but their touch capabilities have been very limited. Current research/development proposes to solve that problem, hence allowing robots that can navigate uneven real-world surfaces, grip and lift objects of unpredictable weight or position, and minimize consequences when unwanted collisions do occur. (See for example in the video where Big Dog is kicked sideways across a nasty patch of ice.)

*Little-remembered fact — Symantec spun out ~30 years ago from a vision company called Machine Intelligence, back when “artificial intelligence” was viewed as a meaningful product category. Symantec’s first product — which explains the company name — was in natural language query.

Pneubotics and others take this further, by making robots out of soft, light, flexible materials. Benefits will/could include:

  • Safety (obviously).
  • Cost-effectiveness (better weight/strength ratios –> less power needed –> less lugging of batteries or whatever –> much more capability for actual work).
  • Operation in varied environments (underwater, outer space, etc.).
  • Better locomotion even on dry land (because of weight and safety).

Above all, soft robots will have more effective senses of touch, as they literally bend and conform to contact with real-world surfaces and objects.

Now let’s turn to some of the implications of soft and mobile robotic technology.

Related links

  • This series partially fulfils an IOU left in my recent post on IT innovation.
  • Business Week is one of several publications that have written about soft robots.
  • Kevin shared links to three more videos on robot locomotion.
  • What I wrote about analyst bias back in 2006 still applies.
Categories: Other

Where the innovation is

Mon, 2015-01-19 02:27

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. :) But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

  • Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
  • Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
  • Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

  • Product maturity is a huge issue for all the above, and will remain one for years.
  • Hadoop and Spark show that application execution engines:
    • Have a lot of innovation ahead of them.
    • Are tightly entwined with data management, and with data movement as well.
  • Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
  • There are many issues in storage that can affect data technologies as well, including but not limited to:
    • Solid-state (flash or post-flash) vs. spinning disk.
    • Networked vs. direct-attached.
    • Virtualized vs. identifiable-physical.
    • Object/file/block.
  • Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation. 

  • MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
  • The smart data preparation crowd is deservedly getting attention.
  • The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Beyond that:

  • Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
  • I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
  • While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

  • “Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
  • Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
  • … the cost of such process isolation may need to be borne.
  • Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

  • It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
  • It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
  • On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

  • What I’m talking about would do little to help the authentication/authorization aspects of security, but …
  • … those will never be perfect in any case (because they depend upon fallible humans) …
  • … which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

  • Application development technology, languages, frameworks, etc.
  • The integration of analytics into old-style operational apps.
  • The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related link

In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.

Categories: Other

Migration

Sat, 2015-01-10 00:45

There is much confusion about migration, by which I mean applications or investment being moved from one “platform” technology — hardware, operating system, DBMS, Hadoop, appliance, cluster, cloud, etc. — to another. Let’s sort some of that out. For starters:

  • There are several fundamentally different kinds of “migration”.
    • You can re-host an existing application.
    • You can replace an existing application with another one that does similar (and hopefully also new) things. This new application may be on a different platform than the old one.
    • You can build or buy a wholly new application.
    • There’s also the inbetween case in which you extend an old application with significant new capabilities — which may not be well-suited for the existing platform.
  • Motives for migration generally fall into a few buckets. The main ones are:
    • You want to use a new app, and it only runs on certain platforms.
    • The new platform may be cheaper to buy, rent or lease.
    • The new platform may have lower operating costs in other ways, such as administration.
    • Your employees may like the new platform’s “cool” aspect. (If the employee is sufficiently high-ranking, substitute “strategic” for “cool”.)
  • Different apps may be much easier or harder to re-host. At two extremes:
    • It can be forbiddingly difficult to re-host an OLTP (OnLine Transaction Processing) app that is heavily tuned, tightly integrated with your other apps, and built using your DBMS vendor’s proprietary stored-procedure language.
    • It might be trivial to migrate a few long-running SQL queries to a new engine, and pretty easy to handle the data connectivity part of the move as well.
  • Certain organizations, usually packaged software companies, design portability into their products from the get-go, with at least partial success.

I mixed together true migration and new-app platforms in a post last year about DBMS architecture choices, when I wrote:

  • Sometimes something isn’t broken, and doesn’t need fixing.
  • Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
  • Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.

In particular, migration away from legacy DBMS raises many issues:

  • Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
  • Your staff’s programming and administrative skill-sets.
  • Your investment in DBMS-related tools.
  • Your supply of hockey tickets from the vendor’s salesman.

Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.

I then argued that such reasons are likely to exist for NoSQL DBMS, but less commonly for NewSQL. My views on that haven’t changed in the interim.

More generally, my pro-con thoughts on migration start:

  • Pure application re-hosting is rarely worthwhile. Migration risks and costs outweigh the benefits, except in a few cases, one of which is the migration of ELT (Extract/Load/Transform) from expensive analytic RDBMS to Hadoop.
  • Moving from in-house to co-located data centers can offer straightforward cost savings, because it’s not accompanied by much in the way of programming costs, risks, or delays. Hence Rackspace’s refocus on colo at the expense of cloud. (But it can be hard on your data center employees.)
  • Moving to an in-house cluster can be straightforward, and is common. VMware is the most famous such example. Exadata consolidation is another.
  • Much of new application/new functionality development is in areas where application lifespans are short — e.g. analytics, or customer-facing internet. Platform changes are then more practical as well.
  • New apps or app functionality often should and do go where the data already is. This is especially true in the case of cloud/colo/on-premises decisions. Whether it’s important in a single location may depend upon the challenges of data integration.

I’m also often asked for predictions about migration. In light of the above, I’d say:

  • Successful DBMS aren’t going away.
    • OLTP workloads can usually be lost only so fast as applications are replaced, and that tends to be a slow process. Claims to the contrary are rarely persuasive.
    • Analytic DBMS can lose workloads more easily — but their remaining workloads often grow quickly, creating an offset.
  • A large fraction of new apps are up for grabs. Analytic applications go well on new data platforms. So do internet apps of many kinds. The underlying data for these apps often starts out in the cloud. SaaS (Software as a Service) is coming on strong. Etc.
  • I stand by my previous view that most computing will wind up on appliances, clusters or clouds.
  • New relational DBMS will be slow to capture old workloads, even if they are slathered with in-memory fairy dust.

And for a final prediction — discussion of migration isn’t going to go away either. :)

Categories: Other

Notes on machine-generated data, year-end 2014

Wed, 2014-12-31 21:49

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

  • Web, network and other IT logs.
  • Game and mobile app event data.
  • CDRs (telecom Call Detail Records).
  • “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
  • Sensor network output (for example from a pipeline or other utility network).
  • Vehicle telemetry.
  • Health care data, in hospitals.
  • Digital health data from consumer devices.
  • Images from public-safety camera networks.
  • Stock tickers (if you regard them as being machine-generated, which I do).

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:

  • Government snooping on the contents of communications.
  • Communication traffic analysis.
  • Photos and videos (airport scanners, public cameras, etc.)
  • Commercial ad targeting.
  • Traditional medical records.

Other areas, however, continue to be overlooked, with the two biggies in my opinion being:

  • The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
  • The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.

My core arguments about privacy and surveillance seem as valid as ever.

3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …

4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.

5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.

6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.

Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:

  • High-frequency trading is an extreme race; microseconds matter.
  • Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
  • Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.

There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.

7. My views about predictive analytics are still somewhat confused. For starters:

  • The math and technology of predictive modeling both still seem pretty simple …
  • … but sometimes achieve mind-blowing results even so.
  • There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
  • Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.

So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.

Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example listed above is certainly imaginative. But I’d like to see a lot more in the area.

Even more important, of course, could be some kind of revolution in predictive modeling for medicine.

Categories: Other

WibiData’s approach to predictive modeling and experimentation

Tue, 2014-12-16 06:29

A conversation I have too often with vendors goes something like:

  • “That confidential thing you told me is interesting, and wouldn’t harm you if revealed; probably quite the contrary.”
  • “Well, I guess we could let you mention a small subset of it.”
  • “I’m sorry, that’s not enough to make for an interesting post.”

That was the genesis of some tidbits I recently dropped about WibiData and predictive modeling, especially but not only in the area of experimentation. However, Wibi just reversed course and said it would be OK for me to tell more or less the full story, as long as I note that we’re talking about something that’s still in beta test, with all the limitations (to the product and my information alike) that beta implies.

As you may recall:

With that as background, WibiData’s approach to predictive modeling as of its next release will go something like this:

  • There is still a strong element of classical modeling by data scientists/statisticians, with the models re-scored in batch, perhaps nightly.
  • But of course at least some scoring should be done as real-time as possible, to accommodate fresh data such as:
    • User interactions earlier in today’s session.
    • Technology for today’s session (device, connection speed, etc.)
    • Today’s weather.
  • WibiData Express is/incorporates a Scala-based language for modeling and query.
  • WibiData believes Express plus a small algorithm library gives better results than more mature modeling libraries.
    • There is some confirming evidence of this …
    • … but WibiData’s customers have by no means switched over yet to doing the bulk of their modeling in Wibi.
  • WibiData will allow line-of-business folks to experiment with augmentations to the base models.
  • Supporting technology for predictive experimentation in WibiData will include:
    • Automated multi-armed bandit testing (in previous versions even A/B testing has been manual).
    • A facility for allowing fairly arbitrary code to be included into otherwise conventional model-scoring algorithms, where conventional scoring models can come:
      • Straight from WibiData Express.
      • Via PMML (Predictive Modeling Markup Language) generated by other modeling tools.
    • An appropriate user interface for the line-of-business folks to do certain kinds of injecting.

Let’s talk more about predictive experimentation. WibiData’s paradigm for that is:

  • Models are worked out in the usual way.
  • Businesspeople have reasons for tweaking the choices the models would otherwise dictate.
  • They enter those tweaks as rules.
  • The resulting combination — models plus rules — are executed and hence tested.

If those reasons for tweaking are in the form of hypotheses, then the experiment is a test of those hypotheses. However, WibiData has no provision at this time to automagically incorporate successful tweaks back into the base model.

What might those hypotheses be like? It’s a little tough to say, because I don’t know in fine detail what is already captured in the usual modeling process. WibiData gave me only one real-life example, in which somebody hypothesized that shoppers would be in more of a hurry at some times of day than others, and hence would want more streamlined experiences when they could spare less time. Tests confirmed that was correct.

That said, I did grow up around retailing, and so I’ll add:

  • Way back in the 1970s, Wal-Mart figured out that in large college towns, clothing in the football team’s colors was wildly popular. I’d hypothesize such a rule at any vendor selling clothing suitable for being worn in stadiums.
  • A news event, blockbuster movie or whatever might trigger a sudden change in/addition to fashion. An alert merchant might guess that before the models pick it up. Even better, she might guess which psychographic groups among her customers were most likely to be paying attention.
  • Similarly, if a news event caused a sudden shift in buyers’ optimism/pessimism/fear of disaster, I’d test that a response to that immediately.

Finally, data scientists seem to still be a few years away from neatly solving the problem of multiple shopping personas — are you shopping in your business capacity, or for yourself, or for a gift for somebody else (and what can we infer about that person)? Experimentation could help fill the gap.

Categories: Other

Notes and links, December 12, 2014

Fri, 2014-12-12 05:05

1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.

For starters:

  • The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
  • If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.

I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:

  • Suppose we have lots of logs about lots of things.* Machine learning can help:
    • Notice what’s an anomaly.
    • Group* together things that seem to be experiencing similar anomalies.
  • That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.

Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.

2. Discussion of graph DBMS can get confusing. For example:

  • Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
  • Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
  • The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
  • My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.

I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.

3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*

*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.

Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.

4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.

5. Derrick Harris and Pivotal turn out to have been earlier than me in posting about Tachyon bullishness.

6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.

Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.

Categories: Other

A few numbers from MapR

Wed, 2014-12-10 00:55

MapR put out a press release aggregating some customer information; unfortunately, the release is a monument to vagueness. Let me start by saying:

  • I don’t know for sure, but I’m guessing Derrick Harris was incorrect in suspecting that this release was a reaction to my recent post about Hortonworks’ numbers. For one thing, press releases usually don’t happen that quickly.
  • And as should be obvious from the previous point — notwithstanding that MapR is a client, I had no direct involvement in this release.
  • In general, I advise clients and other vendors to put out the kind of aggregate of customer success stories found in this release. However, I would like to see more substance than MapR offered.

Anyhow, the key statement in the MapR release is:

… the number of companies that have a paid subscription for MapR now exceeds 700.

Unfortunately, that includes OEM customers as well as direct ones; I imagine MapR’s direct customer count is much lower.

In one gesture to numerical conservatism, MapR did indicate by email that it counts by overall customer organization, not by department/cluster/contract (i.e., not the way Hortonworks does).

The MapR press release also said:

As of November 2014, MapR has one or more customers in eight vertical markets that have purchased more than one million dollars of MapR software and services.  These vertical markets are advertising/media, financial services, healthcare, internet, information technology, retail, security, and telecom.

Since the word “each” isn’t in that quote, so we don’t even know whether MapR is referring to individual big customers or just general sector penetration. We also don’t know whether the revenue is predominantly subscription or some other kind of relationship.

MapR also indicated that the average customer more than doubled its annualized subscription rate vs. a year ago; the comparable figure — albeit with heavy disclaimers — from Hortonworks was 25%.

Categories: Other

Hadoop’s next refactoring?

Sun, 2014-12-07 08:59

I believe in all of the following trends:

  • Hadoop is a Big Deal, and here to stay.
  • Spark, for most practical purposes, is becoming a big part of Hadoop.
  • Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.

Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:

  • People would like this to be true, although in most cases only as one of several cluster computing platforms.
  • Hadoop, when viewed as an operating system, is extremely primitive.
  • Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.

There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.

Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:

  • Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
  • In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
  • Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.

Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:

The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:

  • Works well with Hadoop execution engines (Spark, Tez, Impala …).
  • Works well with other software people might want to put on their Hadoop nodes.
  • Interfaces nicely to HDFS, Isilon, object storage, et al.
  • Is fully parallel any time it needs to talk with persistent or external storage.
  • Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.

Tachyon could, I imagine, become that. HDFS caching probably could not.

In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.

Related links

Categories: Other

Notes on the Hortonworks IPO S-1 filing

Sun, 2014-12-07 07:53

Given my stock research experience, perhaps I should post about Hortonworks’ initial public offering S-1 filing. :) For starters, let me say:

  • Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
    • $11.7 million from everybody but Microsoft, …
    • … plus $7.5 million from Microsoft, …
    • … for a total of $19.2 million.
  • Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
    • 2 on April 30, 2012.
    • 9 on December 31, 2012.
    • 25 on April 30, 2013.
    • 54 on September 30, 2013.
    • 95 on December 31, 2013.
    • 233 on September 30, 2014.
  • Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
  • Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
  • This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
    • In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
    • The tentative top of the offering’s price range is $14/share.
    • That’s also slightly down from the Series C price in mid-2013.

And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.

Overall, the Hortonworks S-1 is about 180 pages long, and — as is typical — most of it is boilerplate, minutiae or drivel. As is also typical, two of the most informative sections of the Hortonworks S-1 are:

The clearest financial statements in the Hortonworks S-1 are probably the quarterly figures on Page 62, along with the tables on Pages F3, F4, and F7.

Special difficulties in interpreting Hortonworks’ numbers include:

  • A large fraction of revenue has come from a few large customers, most notably Microsoft. Details about those revenues are further confused by:
    • Difficulty in some cases getting a fix on the subscription/professional services split. (It does seem clear that Microsoft revenues are 100% subscription.)
    • Some revenue deductions associated with stock deals, called “contra-revenue”.
  • Hortonworks changed the end of its fiscal year from April to December, leading to comparisons of a couple of eight-month periods.
  • There was a $6 million lawsuit settlement (some kind of employee poaching/trade secrets case), discussed on Page F-21.
  • There is some counter-intuitive treatment of Windows-related development (cost of revenue rather than R&D).

One weirdness is that cost of professional services revenue far exceeds 100% of such revenue in every period Hortonworks reports. Hortonworks suggests that this is because:

  • Professional services revenue is commonly bundled with support contracts.
  • Such revenue is recognized ratably over the life of the contract, as opposed to a more natural policy of recognizing professional services revenue when the services are actually performed.

I’m struggling to come up with a benign explanation for this.

In the interest of space, I won’t quote Hortonworks’ S-1 verbatim; instead, I’ll just note where some of the more specifically informative parts may be found.

  • Page 53 describes Hortonworks’ typical sales cycles (they’re long).
  • Page 54 says the average customer has increased subscription payments 25% year over year, but emphasize that the sample size is too small to be reliable.
  • Pages 55-63 have a lot of revenue and expense breakdowns.
  • Deferred revenue numbers (which are a proxy for billings and thus signed contracts) are on Page 65.
  • Pages II 2-3 list all (I think) Hortonworks financings in a concise manner.

And finally, Hortonworks’ dealings with its largest customers and strategic partners are cited in a number of places. In particular:

  • Pages 52-3 cover dealings with Yahoo, Teradata, Microsoft, and AT&T.
  • Pages 82-3 discusses OEM revenue from Hewlett-Packard, Red Hat, and Teradata, none of which amounts to very much.
  • Page 109 covers the Teradata agreement. It seems that there’s less going on than originally envisioned, in that Teradata made a nonrefundable prepayment far greater than turns out to have been necessary for subsequent work actually done. That could produce a sudden revenue spike or else positive revenue restatement as of February, 2015.
  • Page F-10 has a table showing revenue from Hortonworks’ biggest customers (Company A is Microsoft and Company B is Yahoo).
  • Pages F37-38 further cover Hortonworks’ relationships with Yahoo, Teradata and AT&T.

Correction notice: Some of the page numbers in this post were originally wrong, surely because Hortonworks posted an original and amended version of this filing, and I got the two documents mixed up.  A huge Thank You goes to Merv Adrian for calling my attention to this, and I think I’ve now fixed them. I apologize for the errors!

Related links

Categories: Other

Thoughts and notes, Thanksgiving weekend 2014

Sun, 2014-11-30 19:48

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

  • Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
  • Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: 

Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.

4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.

WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.

I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.

5. Hadoop’s traditional data distribution story goes something like:

  • Data lives on every non-special Hadoop node that does processing.
  • This gives the advantage of parallel data scans.
  • Sometimes data locality works well; sometimes it doesn’t.
  • Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
  • … but Hadoop is getting away from that kind of strict, I/O-intensive processing model.

However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.

6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.

7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.

8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:

  • IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
  • … upstarts have three behemoths to outdo, not just one.
  • MySQL, PostgreSQL and to some extent Sybase are still around as well.

Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:

  • General-purpose RDBMS.
  • Analytic RDBMS.
  • NoSQL.
  • Non-relational analytic data stores (perhaps Hadoop-based).

it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.

All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.

9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.

Categories: Other

Technical differentiation

Sat, 2014-11-15 06:00

I commonly write about real or apparent technical differentiation, in a broad variety of domains. But actually, computers only do a couple of kinds of things:

  • Accept instructions.
  • Execute them.

And hence almost all IT product differentiation fits into two buckets:

  • Easier instruction-giving, whether that’s in the form of a user interface, a language, or an API.
  • Better execution, where “better” usually boils down to “faster”, “more reliable” or “more reliably fast”.

As examples of this reductionism, please consider:

  • Application development is of course a matter of giving instructions to a computer.
  • Database management systems accept and execute data manipulation instructions.
  • Data integration tools accept and execute data integration instructions.
  • System management software accepts and executes system management instructions.
  • Business intelligence tools accept and execute instructions for data retrieval, navigation, aggregation and display.

Similar stories are true about application software, or about anything that has an API (Application Programming Interface) or SDK (Software Development Kit).

Yes, all my examples are in software. That’s what I focus on. If I wanted to be more balanced in including hardware or data centers, I might phrase the discussion a little differently — but the core points would still remain true.

What I’ve said so far should make more sense if we combine it with the observation that differentiation is usually restricted to particular domains. I mean several different things by that last bit. First, most software only purports to do a limited class of things — manage data, display query results, optimize analytic models, manage a cluster, run a payroll, whatever. Even beyond that, any inherent superiority is usually restricted to a subset of potential use cases. For example:

  • Relational DBMS presuppose that data fits well (enough) into tabular structures. Further, most RDBMS differentiation is restricted to a further subset of such cases; there are many applications that don’t require — for example — columnar query selectivity or declarative referential integrity or Oracle’s elite set of security certifications.
  • Some BI tools are great for ad-hoc navigation. Some excel at high-volume report displays, perhaps with a particular flair for mobile devices. Some are learning how to query non-tabular data.
  • Hadoop, especially in its early days, presupposed data volumes big enough to cluster and application models that fit well with MapReduce.
  • A lot of distributed computing aids presuppose particular kinds of topologies.

A third reason for technical superiority to be domain-specific is that advantages are commonly coupled with drawbacks. Common causes of that include:

  • Many otherwise-advantageous choices strain hardware budgets. Examples include:
    • Robust data protection features (most famously RAID and two-phase commit)
    • Various kinds of translation or interpretation overhead.
  • Yet other choices are good for some purposes but bad for others. It’s fastest to write data in the exact way it comes in, but then it would be slow to retrieve later on.
  • Innovative technical strategies are likely to be found in new products that haven’t had time to become mature yet.

And that brings us to the main message of this post: Your spiffy innovation is important in fewer situations than you would like to believe. Many, many other smart organizations are solving the same kinds of problems as you; their solutions just happen to be effective in somewhat different scenarios than yours. This is especially true when your product and company are young. You may eventually grow to cover a broad variety of use cases, but to get there you’ll have to more or less match the effects of many other innovations that have come along before yours.

When advising vendors, I tend to think in terms of the layered messaging model, and ask the questions:

  • Which of your architectural features gives you sustainable advantages in features or performance?
  • Which of your sustainable advantages in features or performance provides substantial business value in which use cases?

Closely connected are the questions:

  • What lingering disadvantages, if any, does your architecture create?
  • What maturity advantages do your competitors have, and when (if ever) will you be able to catch up with them?
  • In which use cases are your disadvantages important?

Buyers and analysts should think in such terms as well.

Related links

Daniel Abadi, who is now connected to Teradata via their acquisition of Hadapt, put up a post promoting some interesting new features of theirs. Then he tweeted that this was an example of what I call Bottleneck Whack-A-Mole. He’s right. But since much of his theme was general praise of Teradata’s mature DBMS technology, it would also have been accurate to reference my post about The Cardinal Rules of DBMS Development.

Categories: Other