I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.
Here are some thoughts as to why, and also as to challenges that need to be overcome.
There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:
- When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
- Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.
Trivial query routing or federation is … trivial.
- Databases have or can be given some kind of data catalog interface. Of course, this is easier for databases that are tabular, whether relational or MOLAP (Multidimensional OnLine Analytic Processing), but to some extent it can be done for anything.
- Combining the catalogs can be straightforward. So can routing queries through the system to the underlying data stores.
In fact, what I just described is Business Objects’ original innovation — the semantic layer — two decades ago.
Careless query routing or federation can be a performance nightmare. Do a full scan. Move all the data to some intermediate server that lacks capacity or optimization to process it quickly. Wait. Wait. Wait. Wait … hmmm, maybe this wasn’t the best data-architecture strategy.
Streaming goes well with federation. Some data just arrived, and you want to analyze it before it ever gets persisted. You want to analyze it in conjunction with data that’s been around longer. That’s a form of federation right there.
There are ways to navigate schema messes. Sometimes they work.
- Polishing one neat relational schema for all your data is exactly what people didn’t want to do when they decided to store a lot of the data non-relationally instead. Still, memorializing some schema after that fact may not be terribly painful.
- Even so, text search can help you navigate the data wilds. So can collaboration tools. Neither helps all the time, however.
Neither extreme view here — “It’s easy!” or “It will never work!” — seems right. Rather, I think there’s room for a lot of effort and differentiation in exposing cross-database schema information.
I’m leaving out one part of the story on purpose — how these data layers are going to be packaged, and specifically what other functionality they will be bundled with. Confidentially would screw up that part of the discussion; so also would my doubts as to whether some of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL (Extract/Transform/Load) alike.
One way or another, I don’t think the subject of logical data layers is going away any time soon.
- Implicit in this post is the belief that enterprises should and do use many different data stores (June, 2014)
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:
- Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
- eBay has an HBase database largely on spinning disk, used to inform its search engine.
Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.
4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.
5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.
- It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
- An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
- Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.
In general, Cloudera suggested that HBase was used in a fair number of OEM situations.
6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.
- A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.
Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. Here’s the small amount that I have actually figured out.
As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.
- In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
- Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
- For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).
Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:
- The NoSQL database is likely to have more null values.
- The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.
That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.
Another set of issues arise on the performance side. Many NoSQL systems have indexes, and thus some kind of filtering capability. Some — e.g. MongoDB — have aggregation frameworks as well. So if you’re getting at the data with some combination of a BI tool, ETL tool or ODBC/JDBC drivers — are you leveraging the capabilities in place? Or are you doing the simplest and slowest thing, which is to suck data out en masse and operate on it somewhere else? Getting good answers to those questions is a work-in-progress at best.
Having established that NoSQL data structures cause problems for BI, let’s turn that around. Is there any way that they actually help? I want to say “NoSQL data often comes in hierarchies, and hierarchies are good for roll-up/drill-down.” But the hierarchies that describe NoSQL data aren’t necessarily the same kinds of hierarchies that are useful for BI aggregation, and I’m indeed skeptical as to how often those two categories overlap.
Hierarchies aside, I do think there are use cases for fundamentally non-tabular BI. For example, consider the following scenario, typically implemented with the help of NoSQL today:
- You have more data — presumably machine-generated — than you can afford to keep.
- So you keep time-sliced aggregates.
- You also keep selective details, namely ones that you identified when they streamed in as being interesting in some way.
Visualizing that properly would be very hard in a traditional tabularly-oriented BI tool. So it could end up with NoSQL-oriented BI tools running over NoSQL data stores. Event series BI done right also seems to be quite non-tabular. That said, I don’t know for sure about the actual data structures used under the best event series BI today.
And at that inconclusive point, I’ll stop for now. If you have something to add, please share it in the comments below, or hit me up as per my Contact link above.
Fishbowl team members competed in our third annual Hackathon over the weekend. Our consultants split up into four groups to create an innovative solution, and were given just over 24 hours to complete the task. The goal was to create usable software that could be used in the future when working with clients; some of our most creative products and services originally derived from Hackathon events.
Here’s a summary of what the teams came up with this year:
Team 1: Integrated Fishbowl’s Control Center with Oracle WebCenter Portal
Team 2: Integrated the Google Search Appliance with Oracle’s Document Cloud Service
Team 3: Worked to connect WebCenter Portal with Oracle’s new Document Cloud Service
Team 4: Got Fishbowl’s ControlCenter to run in the cloud with Google Compute Engine
At the end of the Hackathon, all participants voted on the best solution. The winning team was #2 (GSA-Doc Cloud Service integration), made up of Kim Negaard, Andy Weaver, and Mike Hill. Although there could only be one winner, overall consensus was that each team came up with something extremely useful that could help solve common problems we’ve run into while working with WebCenter and the GSA. If you’re interested in learning more about any of the team’s programs, feel free to contact us at email@example.com.
The post Hackathon Provides Opportunity for Collaboration and Innovation appeared first on Fishbowl Solutions' C4 Blog.
I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:
- The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
- Cloudera still uses a figure of 20% of its customers being HBase-centric.
- HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
- With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.
- Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
- As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
- Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)
Cloudera’s views on HBase history — in response to the priorities I brought to the conversation — include:
- HBase initially favored consistency over performance/availability, while Cassandra initially favored the opposite choice. Both products, however, have subsequently become more tunable in those tradeoffs.
- Cloudera’s initial contributions to HBase focused on replication, disaster recovery and so on. I guess that could be summarized as “scaling”.
- Hortonworks’ early HBase contributions included (but were not necessarily limited to):
- Making recovery much faster (10s of seconds or less, rather than minutes or more).
- Some of that consistency vs. availability tuning.
- “Coprocessors” were added to HBase ~3 years ago, to add extensibility, with the first use being in security/permissions.
- With more typical marketing-oriented version numbers:
- HBase .90, the first release that did a good job on durability, could have been 1.0.
- HBase .92 and .94, which introduced coprocessors, could have been Version 2.
- HBase .96 and .98 could have been Version 3.
- The recent HBase 1.0 could have been 4.0.
The HBase roadmap includes:
- A kind of BLOB/CLOB (Binary/Character Large OBject) support.
- Intel is heavily involved in this feature.
- The initial limit is 10 megabytes or so, due to some limitations in the API (I didn’t ask why that made sense). This happens to be all the motivating Chinese customer needs for the traffic photographs it wants to store.
- Various kinds of “multi-tenancy” support (multi-tenancy is one of those terms whose meaning is getting stretched beyond recognition), including:
- Mixed workload support (short-request and analytic) on the same nodes.
- Mixed workload support on different nodes in the same cluster.
- Security between different apps in the same cluster.
- (Still in the design phase) Bottleneck Whack-A-Mole, with goals including but not limited to:
- Scale-out beyond the current assumed limit of ~1200 nodes.
- More predictable performance, based on smaller partition sizes.
- (Possibly) Multi-data-center fail-over.
Not on the HBase roadmap per se are global/secondary indexes. Rather, we talked about projects on top of HBase which are meant to provide those. One is Apache Phoenix, which supposedly:
- Makes it simple to manage compound keys. (E.g., City/State/ZipCode)
- Provides global secondary indexes (but not in a fully ACID way).
- Offers some very basic JOIN support.
- Provides a JDBC interface.
- Offers efficiencies in storage utilization, scan optimizations, and aggregate calculations.
Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip.
There was a lot more to the conversation, but I’ll stop here for two reasons:
- This post is pretty long already.
- I’m reserving some of the discussion until after I’ve chatted with vendors of other NoSQL systems.
- My July 2011 post on HBase offers context, as do the comments on it.
I found yesterday’s news quite unpleasant.
- A guy I knew and had a brief rivalry with in high school died of colon cancer, a disease that I’m at high risk for myself.
- GigaOm, in my opinion the best tech publication — at least for my interests — shut down.
- The sex discrimination trial around Kleiner Perkins is undermining some people I thought well of.
So I want to unclutter my mind a bit. Here goes.
1. There are a couple of stories involving Sam Simon and me that are too juvenile to tell on myself, even now. But I’ll say that I ran for senior class president, in a high school where the main way to campaign was via a single large poster, against a guy with enough cartoon-drawing talent to be one of the creators of the Simpsons. Oops.
2. If one suffers from ulcerative colitis as my mother did, one is at high risk of getting colon cancer, as she also did. Mine isn’t as bad as hers was, due to better tolerance for medication controlling the disease. Still, I’ve already had a double-digit number of colonoscopies in my life. They’re not fun. I need another one soon; in fact, I canceled one due to the blizzards.
Pro-tip — never, ever have a colonoscopy without some kind of anesthesia or sedation. Besides the unpleasantness, the lack of meds increases the risk that the colonoscopy will tear you open and make things worse. I learned that the hard way in New York in the early 1980s.
3. Five years ago I wrote optimistically about the evolution of the information ecosystem, specifically using the example of the IT sector. One could argue that I was right. After all:
- Gartner still seems to be going strong.
- O’Reilly, Gartner and vendors probably combine to produce enough good conferences.
- A few traditional journalists still do good work (in the areas covered by this blog Doug Henschen comes to mind).
- A few vendor folks are talented and responsible enough to add to the discussion. A few small-operation folks — e.g. me — are still around.
Still, the GigaOm news is not encouraging.
4. As TechCrunch and Pando reported, plaintiff Ellen Pao took the stand and sounded convincing in her sexual harassment suit against Kleiner Perkins (but of course she hadn’t been cross-examined yet). Apparently there was a major men-only party hosted by partner Al Gore, a candidate I first supported in 1988. And partner Ray Lane, somebody who at Oracle showed tremendous management effectiveness, evidently didn’t do much to deal with Pao’s situation.
At some point I want to write about a few women who were prominent in my part of the tech industry in the 1980s — at least Ann Winblad, Esther Dyson, and Sandy Kurtzig, maybe analyst/investment banker folks Cristina Morgan and Ruthann Quindlen as well. We’ve come a long way since those days (when, in particular, I could briefly list a significant fraction of the important women in the industry). There seems to be a lot further yet to go.
5. All that said — I’m indeed working on some cool stuff. Some is evident from recent posts. Other may be reflected in an upcoming set of posts that focus on NoSQL, business intelligence, and — I hope — the intersection of the two areas.
6. Speaking of recent posts, I did one on marketing for young companies that brings a lot of advice and tips together. I think it’s close to being a must-read.
- Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
- Continuuity recently changed its name to Cask and went open source.
- Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
- Cask and Cloudera partnered.
- I got a more technical Cask briefing this week.
- App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
- Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.
So far as I can tell:
- Cask’s current focus is to orchestrate job flows, with lots of data mappings.
- This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
- CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
- CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
- … sub-optimal choices in data mapping, database design or workflow composition.
I’d didn’t push the competition point hard (the call was generally a bit rushed due to a hard stop on my side), but:
- Cask thinks it doesn’t have much in the way of exact or head-to-head competitors, but cites Spring and WibiData/Kiji as coming closest.
- I’d think that data integration vendors who use Hadoop as an execution engine (Informatica, Syncsort and many more) would be in the mix as well.
- Cask disclaimed competition with Teradata Revelytix, on the theory that Cask is focused on operational/”real-time” use cases, while Revelytix Loom is focused on data science/investigative analytics.
To reiterate part of that last bullet — like much else we’re hearing about these days, CDAP is focused on operational apps, perhaps with a streaming aspect.
To some extent CDAP can be viewed as restoring the programmer/DBA distinction to the non-SQL world and streaming worlds. That is:
- Somebody creates a data mapping “pattern”.
- Programmers (including perhaps the creator) write to that pattern.
- Somebody (perhaps the creator) tweaks the mapping to optimize performance, or to reflect changes in the underlying data management.
Further notes on CDAP data access include:
- Cask is proud that a pattern can literally be remapped from one data store to another, although I wonder how often that is likely to happen in practice.
- Also, a single “row” can reference multiple data stores.
- Cask’s demo focused on imposing a schema on a log file, something you might do incrementally as you decide to extract another field of information. This is similar to major use cases for schema-on-need and for Splunk.
- For most SQL-like access and operations, CDAP relies on Hive, even to external data stores or non-tabular data. Cask is working with Cloudera on Impala access.
Examples of things that Cask supposedly makes easy include:
- Chunking streaming data by time (e.g. 1 minute buckets).
- Generating database stats (histograms and so on).
Tidbits as to how Cask perceives or CDAP plays with other technologies include:
- Kafka is hot.
- Spark Streaming is hot enough to be on the CDAP roadmap.
- Cask believes that its administrative tools don’t conflict with Cloudera Manager or Ambari, because they’re more specific to an application, job or dataset.
- CDAP is built on Twill, which is a thread-like abstraction over YARN that Cask contributed to Apache. Mesos is in the picture as well, as a YARN alternative.
- Cask is seeing some interest in Flink. (Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded.)
Cask has ~40 people, multiple millions of dollars in trailing revenue, and — naturally — high expectations for future growth. I neglected, however, to ask how that revenue was split between subscription, professional services and miscellaneous. Cask expects to finish 2015 with a healthy two-digit number of customers.
Cask’s customers seem concentrated in usual-suspect internet-related sectors, although Cask gave it a bit of an enterprise-y spin by specifically citing SaaS (Software as a Service) and telecom. When I asked who else seems to be a user or interested based on mailing list activity, Cask mentioned a lot of financial services and some health care as well.
- Cask doesn’t have the obvious .com URL.
I’m on record as believing that:
- Hadoop needs a memory-centric storage grid.
- Tachyon is a strong candidate to fill the role.
- It’s an open secret that there will be a Tachyon company. However, …
- … no details have been publicized. Indeed, the open secret itself is still officially secret.
- Tachyon technology, which just hit 0.6 a couple of days ago, still lacks many features I regard as essential.
- As a practical matter, most Tachyon interest to date has been associated with Spark. This makes perfect sense given Tachyon’s origin and initial technical focus.
- Tachyon was in 50 or more sites last year. Most of these sites were probably just experimenting with it. However …
- … there are production Tachyon clusters with >100 nodes.
As a reminder of Tachyon basics:
- You do I/O with Tachyon in memory.
- Tachyon data can optionally be persisted.
- That “tiered storage” capability — including SSDs — was just introduced in 0.6. So in particular …
- … it’s very primitive and limited at the moment.
- I’ve heard it said that Intel was a big contributor to tiered storage/SSD support. (Solid-State Drives.)
- Tachyon has some ability to understand “lineage” in the Spark sense of term. (In essence, that amounts to knowing what operations created a set of data, and potentially replaying them.)
Beyond that, I get the impressions:
- Synchronous write-through from Tachyon to persistent storage is extremely primitive right now — but even so I am told it is being used in production by multiple companies already.
- Asynchronous write-through, relying on lineage tracking to recreate any data that gets lost, is slightly further along.
- One benefit of adding Tachyon to your Spark installation is a reduction in garbage collection issues.
And with that I have little more to say than my bottom lines:
- If you’re writing your own caching layer for some project you should seriously consider adapting Tachyon instead.
- If you’re using Spark you should seriously consider using Tachyon as well.
- I think Tachyon will be a big deal, but it’s far too early to be sure.
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as:
- 15 certified distributions.
- ~40 certified applications.
- 2000 people trained last year by Databricks alone.
Please note that certification of a Spark distribution is a free service from Databricks, and amounts to checking that the API works against a test harness. Speaking of certification, Ion basically agrees with my views on ODP, although like many — most? — people he expresses himself more politely than I do.
We talked briefly about several aspects of Spark or related projects. One was DataFrames. Per Databricks:
In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
I gather this is modeled on Python pandas, and extends an earlier Spark capability for RDDs (Resilient Distributed Datasets) to carry around metadata that was tantamount to a schema.
SparkR is also on the rise, although it has the usual parallel R story to the effect:
- You can partition data, run arbitrary R on every partition, and aggregate the results.
- A handful of algorithms are truly parallel.
So of course is Spark Streaming. And then there are Spark Packages, which are — and I’m speaking loosely here — a kind of user-defined function.
- Thankfully, Ion did not give me the usual hype about how a public repository of user-created algorithms is a Great Big Deal.
- Ion did point out that providing an easy way for people to publish their own algorithms is a lot easier than evaluating every candidate contribution to the Spark project itself.
I’ll stop here. However, I have a couple of other Spark-related posts in the research pipeline.
7-10 years ago, I repeatedly argued the viewpoints:
- Relational DBMS were the right choice in most cases.
- Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
- There were a variety of specialized use cases in which non-relational data models were best.
Since then, however:
- Hadoop has flourished.
- NoSQL has flourished.
- Graph DBMS have matured somewhat.
- Much of the action has shifted to machine-generated data, of which there are many kinds.
So it’s probably best to revisit all that in a somewhat organized way.
To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of <name, value> pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.
Important distinctions include:
- Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
- If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
- In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
- Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
- Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.
Some major data models can be put into a fairly strict ordering of query desirability by noting:
- The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
- The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
- The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place
Unsurprisingly, that ordering is reversed when it comes to writing data.
- The easiest thing to write to is a data store with no structure.
- Next-easiest is to write to a data store that lets you make up the structure as you go along.
- The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.
And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:
- Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
- Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
- Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.
- You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
- Splunk is cool.
- Use cases for various other kinds of data stores can often be found.
- Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.
And finally, I think in-memory data grids:
- Will be widely used and important.
- Will be used to instantiate multiple data models at once.
- One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
- I’ve long mused about the challenges of getting by without joins. (November, 2010)
- In 2013 I observed that data models will be in perpetual, rapid flux.
- In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
- I surveyed data models and access methods back in 2008.
While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:
- Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
- In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
- That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
- The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.
Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.
Further, the low-cost alternatives for analytic RDBMS are adding up.
- Amazon Redshift has considerable traction.
- Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
- Now Greenplum is in the mix as well.
For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.
By no means do I want to suggest those are the only alternatives.
- Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
- Larger vendors can always slash price in specific deals.
- MonetDB is still around.
But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.
- Greenplum revenue at EMC was problematic from the get-go.
Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as:
- There are various remote services that run Hadoop, most famously Amazon’s Elastic MapReduce.
- You could get Apache Hadoop directly, rather than using the free or paid versions of a vendor distro. But why would you make that choice, unless you’re an internet bad-ass on the level of Facebook, or at least think that you are?
- There will surely always be some proprietary stuff mixed into, for example, IBM’s BigInsights, so as to preserve at least the perception of all-important vendor lock-in.
But the main point stands — big computer companies, such as IBM, EMC (Pivotal) and previously Intel, are figuring out that they can’t bigfoot something that started out as an elephant — stuffed or otherwise — in the first place.
If you think I’m not taking this whole ODP thing very seriously, you’re right.
- It’s a bit eyebrow-raising to see Mike Olson take a “more open source than thou” stance about something, but basically his post about this news is spot-on.
- My take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that post and the joke that ends it?
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add:
- MongoDB acquired the WiredTiger product and company, and continues to sell the product on a standalone basis, as well as bundling a version into MongoDB. This could cause confusion because …
- … the standalone version of WiredTiger has numerous capabilities that are not in the bundled MongoDB storage engine.
- There’s some ambiguity as to when MongoDB first “ships” a feature, in that …
- … code goes to open source with an earlier version number than it goes into the packaged product.
I should also clarify that the addition of WiredTiger is really two different events:
- MongoDB added the ability to have multiple plug-compatible storage engines. Depending on how one counts, MongoDB now ships two or three engines:
- Its legacy engine, now called MMAP v1 (for “Memory Map”). MMAP continues to be enhanced.
- The WiredTiger engine.
- A “please don’t put this immature thing into production yet” memory-only engine.
- WiredTiger is now the particular storage engine MongoDB recommends for most use cases.
I’m not aware of any other storage engines using this architecture at this time. In particular, last I heard TokuMX was not an example. (Edit: Actually, see Tim Callaghan’s comment below.)
Most of the issues in MongoDB write performance have revolved around locking, the story on which is approximately:
- Until MongoDB 2.2, locks were held at the process level. (One MongoDB process can control multiple databases.)
- As of MongoDB 2.2, locks were held at the database level, and some sanity was added as to how long they would last.
- As of MongoDB 3.0, MMAP locks are held at the collection level.
- WiredTiger locks are held at the document level. Thus MongoDB 3.0 with WiredTiger breaks what was previously a huge write performance bottleneck.
In understanding that, I found it helpful to do a partial review of what “documents” and so on in MongoDB really are.
- A MongoDB document is somewhat like a record, except that it can be more like what in a relational database would be all the records that define a business object, across dozens or hundreds of tables.*
- A MongoDB collection is somewhat like a table, although the documents that comprise it do not need to each have the same structure.
- MongoDB documents want to be capped at 16 MB in size. If you need one bigger, there’s a special capability called GridFS to break it into lots of little pieces (default = 1KB) while treating it as a single document logically.
*One consequence — MongoDB’s single-document ACID guarantees aren’t quite as lame as single-record ACID guarantees would be in an RDBMS.
By the way:
- Row-level locking was a hugely important feature in RDBMS about 20 years ago. Sybase’s lack of it is a big part of what doomed them to second-tier status.
- Going forward, MongoDB has made the unsurprising marketing decision to talk about “locks” as little as possible, relying instead on alternate terms such as “concurrency control”.
Since its replication mechanism is transparent to the storage engine, MongoDB allows one to use different storage engines for different replicas of data. Reasons one might want to do this include:
- Fastest persistent writes (WiredTiger engine).
- Fastest reads (wholly in-memory engine).
- Migration from one engine to another.
- Integration with some other data store. (Imagine, for example, a future storage engine that works over HDFS. It probably wouldn’t have top performance, but it might make Hadoop integration easier.)
In theory one can even do a bit of information lifecycle management (ILM), by using different storage engines for different subsets of database, by:
- Pinning specific shards of data to specific servers.
- Using different storage engines on those different servers.
That said, similar stories have long been told about MySQL, and I’m not aware of many users who run multiple storage engines side by side.
The MongoDB WiredTiger option is shipping with a couple of options for block-level compression (plus prefix compression that is being used for indexes only). The full WiredTiger product also has some forms of columnar compression for data.
One other feature in MongoDB 3.0 is the ability to have 50 replicas of data (the previous figure was 12). MongoDB can’t think of a great reason to have more than 3 replicas per data center or more than 2 replicas per metropolitan area, but some customers want to replicate data to numerous locations around the world.
- I occasionally post a few notes about MongoDB use cases, e.g. last May.
There are numerous ways that technology, now or in the future, can significantly improve personal safety. Three of the biggest areas of application are or will be:
- Crime prevention.
- Vehicle accident prevention.
- Medical emergency prevention and response.
Implications will be dramatic for numerous industries and government activities, including but not limited to law enforcement, automotive manufacturing, infrastructure/construction, health care and insurance. Further, these technologies create a near-certainty that individuals’ movements and status will be electronically monitored in fine detail. Hence their development and eventual deployment constitutes a ticking clock toward a deadline for society deciding what to do about personal privacy.
Theoretically, humans aren’t the only potential kind of tyrants. Science fiction author Jack Williamson postulated a depressing nanny-technology in With Folded Hands, the idea for which was later borrowed by the humorous Star Trek episode I, Mudd.
Of these three areas, crime prevention is the furthest along; in particular, sidewalk cameras, license plate cameras and internet snooping are widely deployed around the world. So let’s consider the other two.
Vehicle accident prevention
Suppose every automobile on the road “knew” where all nearby vehicles were, and their speed and direction as well. Then it could also “know” the safest and fastest ways to move you along. You might actively drive, while it advised and warned you; it might be the default “driver”, with you around to override. Inbetween possibilities exist as well.
Frankly, I don’t know how expensive a suitably powerful and rugged transponder for such purposes would be. I also don’t know to what extent the most efficient solutions would involve substantial investment in complementary, stationary equipment. But I imagine the total cost would be relatively small compared to that of automobiles or auto insurance.
Universal deployment of such technology could be straightforward. If the government can issue you license plates, it can issue transponders as well, or compel you to get your own. It would have several strong motivations to do so, including:
- Electronic toll collection — this is already happening in a significant fraction of automobiles around the world.
- Snooping for the purpose of law enforcement.
- Accident prevention.
- (The biggest of all.) Easing the transition to autonomous vehicles.
Insurance companies have their own motivations to support safety-related technology. And the automotive industry has long been aggressive in incorporating microprocessor technology. Putting that all together, I am confident in the prediction: Smart cars are going to happen.
The story goes further yet. Despite improvements in safety technology, accidents will still happen. And the same location-tracking technology used for real-time accident avoidance should provide a massive boost to post-accident forensics, for use in:
- Insurance adjudication (obviously and often),
- Criminal justice (when the accident has criminal implications), and
- Predictive modeling.
The predictive modeling, in turn, could influence (among other areas):
- General automobile design (if a lot of accidents have a common cause, re-engineer to address it).
- Maintenance of specific automobiles (if the car’s motion is abnormal, have it checked out).
- Individual drivers’ insurance rates.
Transportation is going to change a lot.
Medical emergency prevention and response
I both expect and welcome the rise of technology that helps people who can’t reliably take care of themselves (babies, the elderly) to be continually monitored. My father and aunt might each have lived longer if such technology had been available sooner. But while the life-saving emergency response uses will be important enough, emergency avoidance may be an even bigger deal. Much as in my discussion above of cars, the technology could also be used to analyze when an old person is at increasing risk of falls or other incidents. In a world where families live apart but nursing homes are terrible places, this could all be a very important set of developments.
Another area where the monitoring/response/analysis/early-warning cycle could work is cardio-vascular incidents. I imagine we’ll soon have wearable devices that help detect the development or likelihood of various kinds of blockages, and hence forestall cardiovascular emergencies, such as those that often befall seemingly-healthy middle-aged people. Over time, I think those devices will become pretty effective. The large market opportunity should be obvious.
Once life-and-death benefits lead the way, I expect less emergency-focused kinds of fitness monitoring to find receptive consumers as well. (E.g. in the intestinal/nutrition domain.) And so I have another prediction (with an apology to Socrates): The unexamined life will seem too dangerous to continue living.
Trivia challenge: Where was the wordplay in that last paragraph?
- My overview of innovation opportunities ended by saying there was great opportunity in devices. It also offered notes on predictive modeling and so on.
- My survey of technologies around machine-generated data ended by focusing on predictive modeling for problem and anomaly detection and diagnosis, for machines and bodies alike.
- The topics of this post are part of why I’m bullish on machine-generated data growth.
- I think soft robots that also provide practical assistance could become a big part of health-related monitoring.
In one of my favorite posts, namely When I am a VC Overlord, I wrote:
I will not fund any entrepreneur who mentions “market projections” in other than ironic terms. Nobody who talks of market projections with a straight face should be trusted.
Even so, I got talked today into putting on the record a prediction that machine-generated data will grow at more than 40% for a while.
My reasons for this opinion are little more than:
- Moore’s Law suggests that the same expenditure will buy 40% or so more machine-generated data each year.
- Budgets spent on producing machine-generated data seem to be going up.
I was referring to the creation of such data, but the growth rates of new creation and of persistent storage are likely, at least at this back-of-the-envelope level, to be similar.
Anecdotal evidence actually suggests 50-60%+ growth rates, so >40% seemed like a responsible claim.
- My recent survey of machine-generated data topics started with a list of many different kinds of the stuff.
- My 2009 post on data warehouse volume growth makes similar points, and notes that high growth rates mean we likely can never afford to keep all machine-generated data permanently.
- My 2011 claim that traditional databases will migrate into RAM is sort of this argument’s flipside.
What will soft, mobile robots be able to do that previous generations cannot? A lot. But I’m particularly intrigued by two large categories:
- Inspection, maintenance and repair.
- Health care/family care assistance.
There are still many things that are hard for humans to keep in good working order, including:
- Power lines.
- Anything that’s underwater (cables, drilling platforms, etc.)
- Pipelines, ducts, and water mains (especially from the inside).
- Any kind of geographically remote power station or other installation.
Sometimes the issue is (hopefully minor) repairs. Sometimes it’s cleaning or lubrication. In some cases one might want to upgrade a structure with fixed sensors, and the “repair” is mainly putting those sensors in place. In all these cases, it seems that soft robots could eventually offer a solution. Further examples, I’m sure, could be found in factories, mines, or farms.
Of course, if there’s a maintenance/repair need, inspection is at least part of the challenge; in some cases it’s almost the whole thing. And so this technology will help lead us toward the point that substantially all major objects will be associated with consistent flows of data. Opportunities for data analysis will abound.
One other point about data flows — suppose you have two kinds of machines that can do a task, one of which is flexible, the other rigid. The flexible one will naturally have much more variance in what happens from one instance of the task to the next one. That’s just another way in which soft robots will induce greater quantities of machine-generated data.
Let’s now consider health care, whose basic characteristics include:
- It’s done to people …
- … especially ones who don’t feel very good.
People who are sick, elderly or whatever can often use help with simple tasks — e.g., taking themselves to the bathroom, or fetching a glass water. So can their caretakers — e.g., turning a patient over in bed. That’s even before we get to more medical tasks such as checking and re-bandaging an awkwardly-placed wound. And on the healthier side, I wouldn’t mind having a robot around the house that could, for example, spot me with free weights. Fully general forms of this seem rather futuristic. But even limited forms might augment skilled-nurse labor, or let people stay in their own homes who at the moment can’t quite make it there.
And, once again, any of these use cases would likely be associated with its own stream(s) of observational and introspective data.
- Part 1 of this series was a quick introduction to soft and mobile robotics.
There may be no other subject on which I’m so potentially biased as robotics, given that:
- I don’t spend a lot of time on the area, but …
- … one of the better robotics engineers in the world (Kevin Albert) just happens to be in my family …
- … and thus he’s overwhelmingly my main source on the general subject of robots.
Still, I’m solely responsible for my own posts and opinions, while Kevin is busy running his startup (Pneubotics) and raising my grandson. And by the way — I’ve been watching the robotics industry slightly longer than Kevin has been alive.
My overview messages about all this are:
- Historically, robots have been very limited in their scope of motion and action. Indeed, most successful robots to date have been immobile, metallic programmable machines, serving on classic assembly lines.
- Next-generation robots should and will be much more able to safely and effectively navigate through and work within general human-centric environments.
- This will affect a variety of application areas in ways that readers of this blog may care about.
Examples of the first point may be found in any number of automobile factory videos, such as:
A famous example of the second point is a 5-year-old video of Kevin’s work on prototype robot locomotion, namely:
Walking robots (such as Big Dog) and general soft robots (such as those from Pneubotics) rely on real-time adaptation to physical feedback. Robots have long enjoyed machine vision,* but their touch capabilities have been very limited. Current research/development proposes to solve that problem, hence allowing robots that can navigate uneven real-world surfaces, grip and lift objects of unpredictable weight or position, and minimize consequences when unwanted collisions do occur. (See for example in the video where Big Dog is kicked sideways across a nasty patch of ice.)
*Little-remembered fact — Symantec spun out ~30 years ago from a vision company called Machine Intelligence, back when “artificial intelligence” was viewed as a meaningful product category. Symantec’s first product — which explains the company name — was in natural language query.
Pneubotics and others take this further, by making robots out of soft, light, flexible materials. Benefits will/could include:
- Safety (obviously).
- Cost-effectiveness (better weight/strength ratios –> less power needed –> less lugging of batteries or whatever –> much more capability for actual work).
- Operation in varied environments (underwater, outer space, etc.).
- Better locomotion even on dry land (because of weight and safety).
Above all, soft robots will have more effective senses of touch, as they literally bend and conform to contact with real-world surfaces and objects.
Now let’s turn to some of the implications of soft and mobile robotic technology.
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Graph analytics and data management are still confused.
3. As I suggested last year, data transformation is an important area for innovation.
- MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
- The smart data preparation crowd is deservedly getting attention.
- The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.
4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:
- Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
- Similarly, more complex clustering.
- Predictive experimentation.
- The use of business intelligence and predictive modeling to inform each other.
- Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
- I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
- While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.
5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:
- “Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
- Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
- … the cost of such process isolation may need to be borne.
- Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.
- It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
- It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
- On servers, we may in many cases be talking about lightweight virtual machines.
And to be clear:
- What I’m talking about would do little to help the authentication/authorization aspects of security, but …
- … those will never be perfect in any case (because they depend upon fallible humans) …
- … which is exactly why other forms of security will always be needed.
6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.
7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:
- Application development technology, languages, frameworks, etc.
- The integration of analytics into old-style operational apps.
- The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.
8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.
There is much confusion about migration, by which I mean applications or investment being moved from one “platform” technology — hardware, operating system, DBMS, Hadoop, appliance, cluster, cloud, etc. — to another. Let’s sort some of that out. For starters:
- There are several fundamentally different kinds of “migration”.
- You can re-host an existing application.
- You can replace an existing application with another one that does similar (and hopefully also new) things. This new application may be on a different platform than the old one.
- You can build or buy a wholly new application.
- There’s also the inbetween case in which you extend an old application with significant new capabilities — which may not be well-suited for the existing platform.
- Motives for migration generally fall into a few buckets. The main ones are:
- You want to use a new app, and it only runs on certain platforms.
- The new platform may be cheaper to buy, rent or lease.
- The new platform may have lower operating costs in other ways, such as administration.
- Your employees may like the new platform’s “cool” aspect. (If the employee is sufficiently high-ranking, substitute “strategic” for “cool”.)
- Different apps may be much easier or harder to re-host. At two extremes:
- It can be forbiddingly difficult to re-host an OLTP (OnLine Transaction Processing) app that is heavily tuned, tightly integrated with your other apps, and built using your DBMS vendor’s proprietary stored-procedure language.
- It might be trivial to migrate a few long-running SQL queries to a new engine, and pretty easy to handle the data connectivity part of the move as well.
- Certain organizations, usually packaged software companies, design portability into their products from the get-go, with at least partial success.
I mixed together true migration and new-app platforms in a post last year about DBMS architecture choices, when I wrote:
- Sometimes something isn’t broken, and doesn’t need fixing.
- Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
- Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.
In particular, migration away from legacy DBMS raises many issues:
- Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
- Your staff’s programming and administrative skill-sets.
- Your investment in DBMS-related tools.
- Your supply of hockey tickets from the vendor’s salesman.
Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.
I then argued that such reasons are likely to exist for NoSQL DBMS, but less commonly for NewSQL. My views on that haven’t changed in the interim.
More generally, my pro-con thoughts on migration start:
- Pure application re-hosting is rarely worthwhile. Migration risks and costs outweigh the benefits, except in a few cases, one of which is the migration of ELT (Extract/Load/Transform) from expensive analytic RDBMS to Hadoop.
- Moving from in-house to co-located data centers can offer straightforward cost savings, because it’s not accompanied by much in the way of programming costs, risks, or delays. Hence Rackspace’s refocus on colo at the expense of cloud. (But it can be hard on your data center employees.)
- Moving to an in-house cluster can be straightforward, and is common. VMware is the most famous such example. Exadata consolidation is another.
- Much of new application/new functionality development is in areas where application lifespans are short — e.g. analytics, or customer-facing internet. Platform changes are then more practical as well.
- New apps or app functionality often should and do go where the data already is. This is especially true in the case of cloud/colo/on-premises decisions. Whether it’s important in a single location may depend upon the challenges of data integration.
I’m also often asked for predictions about migration. In light of the above, I’d say:
- Successful DBMS aren’t going away.
- OLTP workloads can usually be lost only so fast as applications are replaced, and that tends to be a slow process. Claims to the contrary are rarely persuasive.
- Analytic DBMS can lose workloads more easily — but their remaining workloads often grow quickly, creating an offset.
- A large fraction of new apps are up for grabs. Analytic applications go well on new data platforms. So do internet apps of many kinds. The underlying data for these apps often starts out in the cloud. SaaS (Software as a Service) is coming on strong. Etc.
- I stand by my previous view that most computing will wind up on appliances, clusters or clouds.
- New relational DBMS will be slow to capture old workloads, even if they are slathered with in-memory fairy dust.
And for a final prediction — discussion of migration isn’t going to go away either.
Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:
- Government snooping on the contents of communications.
- Communication traffic analysis.
- Photos and videos (airport scanners, public cameras, etc.)
- Commercial ad targeting.
- Traditional medical records.
Other areas, however, continue to be overlooked, with the two biggies in my opinion being:
- The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
- The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.
3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …
4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.
5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.
6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.
Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:
- High-frequency trading is an extreme race; microseconds matter.
- Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
- Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.
There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.
7. My views about predictive analytics are still somewhat confused. For starters:
- The math and technology of predictive modeling both still seem pretty simple …
- … but sometimes achieve mind-blowing results even so.
- There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
- Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.
So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.
- WibiData has some innovative ideas in predictive experimentation.
- Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.
- It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.
- Ayasdi reminded us that there’s room for innovation in clustering.
- My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.
Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example listed above is certainly imaginative. But I’d like to see a lot more in the area.
Even more important, of course, could be some kind of revolution in predictive modeling for medicine.