Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as:
- There are various remote services that run Hadoop, most famously Amazon’s Elastic MapReduce.
- You could get Apache Hadoop directly, rather than using the free or paid versions of a vendor distro. But why would you make that choice, unless you’re an internet bad-ass on the level of Facebook, or at least think that you are?
- There will surely always be some proprietary stuff mixed into, for example, IBM’s BigInsights, so as to preserve at least the perception of all-important vendor lock-in.
But the main point stands — big computer companies, such as IBM, EMC (Pivotal) and previously Intel, are figuring out that they can’t bigfoot something that started out as an elephant — stuffed or otherwise — in the first place.
If you think I’m not taking this whole ODP thing very seriously, you’re right.
- It’s a bit eyebrow-raising to see Mike Olson take a “more open source than thou” stance about something, but basically his post about this news is spot-on.
- My take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that post and the joke that ends it?
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add:
- MongoDB acquired the WiredTiger product and company, and continues to sell the product on a standalone basis, as well as bundling a version into MongoDB. This could cause confusion because …
- … the standalone version of WiredTiger has numerous capabilities that are not in the bundled MongoDB storage engine.
- There’s some ambiguity as to when MongoDB first “ships” a feature, in that …
- … code goes to open source with an earlier version number than it goes into the packaged product.
I should also clarify that the addition of WiredTiger is really two different events:
- MongoDB added the ability to have multiple plug-compatible storage engines. Depending on how one counts, MongoDB now ships two or three engines:
- Its legacy engine, now called MMAP v1 (for “Memory Map”). MMAP continues to be enhanced.
- The WiredTiger engine.
- A “please don’t put this immature thing into production yet” memory-only engine.
- WiredTiger is now the particular storage engine MongoDB recommends for most use cases.
I’m not aware of any other storage engines using this architecture at this time. In particular, last I heard TokuMX was not an example. (Edit: Actually, see Tim Callaghan’s comment below.)
Most of the issues in MongoDB write performance have revolved around locking, the story on which is approximately:
- Until MongoDB 2.2, locks were held at the process level. (One MongoDB process can control multiple databases.)
- As of MongoDB 2.2, locks were held at the database level, and some sanity was added as to how long they would last.
- As of MongoDB 3.0, MMAP locks are held at the collection level.
- WiredTiger locks are held at the document level. Thus MongoDB 3.0 with WiredTiger breaks what was previously a huge write performance bottleneck.
In understanding that, I found it helpful to do a partial review of what “documents” and so on in MongoDB really are.
- A MongoDB document is somewhat like a record, except that it can be more like what in a relational database would be all the records that define a business object, across dozens or hundreds of tables.*
- A MongoDB collection is somewhat like a table, although the documents that comprise it do not need to each have the same structure.
- MongoDB documents want to be capped at 16 MB in size. If you need one bigger, there’s a special capability called GridFS to break it into lots of little pieces (default = 1KB) while treating it as a single document logically.
*One consequence — MongoDB’s single-document ACID guarantees aren’t quite as lame as single-record ACID guarantees would be in an RDBMS.
By the way:
- Row-level locking was a hugely important feature in RDBMS about 20 years ago. Sybase’s lack of it is a big part of what doomed them to second-tier status.
- Going forward, MongoDB has made the unsurprising marketing decision to talk about “locks” as little as possible, relying instead on alternate terms such as “concurrency control”.
Since its replication mechanism is transparent to the storage engine, MongoDB allows one to use different storage engines for different replicas of data. Reasons one might want to do this include:
- Fastest persistent writes (WiredTiger engine).
- Fastest reads (wholly in-memory engine).
- Migration from one engine to another.
- Integration with some other data store. (Imagine, for example, a future storage engine that works over HDFS. It probably wouldn’t have top performance, but it might make Hadoop integration easier.)
In theory one can even do a bit of information lifecycle management (ILM), by using different storage engines for different subsets of database, by:
- Pinning specific shards of data to specific servers.
- Using different storage engines on those different servers.
That said, similar stories have long been told about MySQL, and I’m not aware of many users who run multiple storage engines side by side.
The MongoDB WiredTiger option is shipping with a couple of options for block-level compression (plus prefix compression that is being used for indexes only). The full WiredTiger product also has some forms of columnar compression for data.
One other feature in MongoDB 3.0 is the ability to have 50 replicas of data (the previous figure was 12). MongoDB can’t think of a great reason to have more than 3 replicas per data center or more than 2 replicas per metropolitan area, but some customers want to replicate data to numerous locations around the world.
- I occasionally post a few notes about MongoDB use cases, e.g. last May.
There are numerous ways that technology, now or in the future, can significantly improve personal safety. Three of the biggest areas of application are or will be:
- Crime prevention.
- Vehicle accident prevention.
- Medical emergency prevention and response.
Implications will be dramatic for numerous industries and government activities, including but not limited to law enforcement, automotive manufacturing, infrastructure/construction, health care and insurance. Further, these technologies create a near-certainty that individuals’ movements and status will be electronically monitored in fine detail. Hence their development and eventual deployment constitutes a ticking clock toward a deadline for society deciding what to do about personal privacy.
Theoretically, humans aren’t the only potential kind of tyrants. Science fiction author Jack Williamson postulated a depressing nanny-technology in With Folded Hands, the idea for which was later borrowed by the humorous Star Trek episode I, Mudd.
Of these three areas, crime prevention is the furthest along; in particular, sidewalk cameras, license plate cameras and internet snooping are widely deployed around the world. So let’s consider the other two.
Vehicle accident prevention
Suppose every automobile on the road “knew” where all nearby vehicles were, and their speed and direction as well. Then it could also “know” the safest and fastest ways to move you along. You might actively drive, while it advised and warned you; it might be the default “driver”, with you around to override. Inbetween possibilities exist as well.
Frankly, I don’t know how expensive a suitably powerful and rugged transponder for such purposes would be. I also don’t know to what extent the most efficient solutions would involve substantial investment in complementary, stationary equipment. But I imagine the total cost would be relatively small compared to that of automobiles or auto insurance.
Universal deployment of such technology could be straightforward. If the government can issue you license plates, it can issue transponders as well, or compel you to get your own. It would have several strong motivations to do so, including:
- Electronic toll collection — this is already happening in a significant fraction of automobiles around the world.
- Snooping for the purpose of law enforcement.
- Accident prevention.
- (The biggest of all.) Easing the transition to autonomous vehicles.
Insurance companies have their own motivations to support safety-related technology. And the automotive industry has long been aggressive in incorporating microprocessor technology. Putting that all together, I am confident in the prediction: Smart cars are going to happen.
The story goes further yet. Despite improvements in safety technology, accidents will still happen. And the same location-tracking technology used for real-time accident avoidance should provide a massive boost to post-accident forensics, for use in:
- Insurance adjudication (obviously and often),
- Criminal justice (when the accident has criminal implications), and
- Predictive modeling.
The predictive modeling, in turn, could influence (among other areas):
- General automobile design (if a lot of accidents have a common cause, re-engineer to address it).
- Maintenance of specific automobiles (if the car’s motion is abnormal, have it checked out).
- Individual drivers’ insurance rates.
Transportation is going to change a lot.
Medical emergency prevention and response
I both expect and welcome the rise of technology that helps people who can’t reliably take care of themselves (babies, the elderly) to be continually monitored. My father and aunt might each have lived longer if such technology had been available sooner. But while the life-saving emergency response uses will be important enough, emergency avoidance may be an even bigger deal. Much as in my discussion above of cars, the technology could also be used to analyze when an old person is at increasing risk of falls or other incidents. In a world where families live apart but nursing homes are terrible places, this could all be a very important set of developments.
Another area where the monitoring/response/analysis/early-warning cycle could work is cardio-vascular incidents. I imagine we’ll soon have wearable devices that help detect the development or likelihood of various kinds of blockages, and hence forestall cardiovascular emergencies, such as those that often befall seemingly-healthy middle-aged people. Over time, I think those devices will become pretty effective. The large market opportunity should be obvious.
Once life-and-death benefits lead the way, I expect less emergency-focused kinds of fitness monitoring to find receptive consumers as well. (E.g. in the intestinal/nutrition domain.) And so I have another prediction (with an apology to Socrates): The unexamined life will seem too dangerous to continue living.
Trivia challenge: Where was the wordplay in that last paragraph?
- My overview of innovation opportunities ended by saying there was great opportunity in devices. It also offered notes on predictive modeling and so on.
- My survey of technologies around machine-generated data ended by focusing on predictive modeling for problem and anomaly detection and diagnosis, for machines and bodies alike.
- The topics of this post are part of why I’m bullish on machine-generated data growth.
- I think soft robots that also provide practical assistance could become a big part of health-related monitoring.
In one of my favorite posts, namely When I am a VC Overlord, I wrote:
I will not fund any entrepreneur who mentions “market projections” in other than ironic terms. Nobody who talks of market projections with a straight face should be trusted.
Even so, I got talked today into putting on the record a prediction that machine-generated data will grow at more than 40% for a while.
My reasons for this opinion are little more than:
- Moore’s Law suggests that the same expenditure will buy 40% or so more machine-generated data each year.
- Budgets spent on producing machine-generated data seem to be going up.
I was referring to the creation of such data, but the growth rates of new creation and of persistent storage are likely, at least at this back-of-the-envelope level, to be similar.
Anecdotal evidence actually suggests 50-60%+ growth rates, so >40% seemed like a responsible claim.
- My recent survey of machine-generated data topics started with a list of many different kinds of the stuff.
- My 2009 post on data warehouse volume growth makes similar points, and notes that high growth rates mean we likely can never afford to keep all machine-generated data permanently.
- My 2011 claim that traditional databases will migrate into RAM is sort of this argument’s flipside.
What will soft, mobile robots be able to do that previous generations cannot? A lot. But I’m particularly intrigued by two large categories:
- Inspection, maintenance and repair.
- Health care/family care assistance.
There are still many things that are hard for humans to keep in good working order, including:
- Power lines.
- Anything that’s underwater (cables, drilling platforms, etc.)
- Pipelines, ducts, and water mains (especially from the inside).
- Any kind of geographically remote power station or other installation.
Sometimes the issue is (hopefully minor) repairs. Sometimes it’s cleaning or lubrication. In some cases one might want to upgrade a structure with fixed sensors, and the “repair” is mainly putting those sensors in place. In all these cases, it seems that soft robots could eventually offer a solution. Further examples, I’m sure, could be found in factories, mines, or farms.
Of course, if there’s a maintenance/repair need, inspection is at least part of the challenge; in some cases it’s almost the whole thing. And so this technology will help lead us toward the point that substantially all major objects will be associated with consistent flows of data. Opportunities for data analysis will abound.
One other point about data flows — suppose you have two kinds of machines that can do a task, one of which is flexible, the other rigid. The flexible one will naturally have much more variance in what happens from one instance of the task to the next one. That’s just another way in which soft robots will induce greater quantities of machine-generated data.
Let’s now consider health care, whose basic characteristics include:
- It’s done to people …
- … especially ones who don’t feel very good.
People who are sick, elderly or whatever can often use help with simple tasks — e.g., taking themselves to the bathroom, or fetching a glass water. So can their caretakers — e.g., turning a patient over in bed. That’s even before we get to more medical tasks such as checking and re-bandaging an awkwardly-placed wound. And on the healthier side, I wouldn’t mind having a robot around the house that could, for example, spot me with free weights. Fully general forms of this seem rather futuristic. But even limited forms might augment skilled-nurse labor, or let people stay in their own homes who at the moment can’t quite make it there.
And, once again, any of these use cases would likely be associated with its own stream(s) of observational and introspective data.
- Part 1 of this series was a quick introduction to soft and mobile robotics.
There may be no other subject on which I’m so potentially biased as robotics, given that:
- I don’t spend a lot of time on the area, but …
- … one of the better robotics engineers in the world (Kevin Albert) just happens to be in my family …
- … and thus he’s overwhelmingly my main source on the general subject of robots.
Still, I’m solely responsible for my own posts and opinions, while Kevin is busy running his startup (Pneubotics) and raising my grandson. And by the way — I’ve been watching the robotics industry slightly longer than Kevin has been alive.
My overview messages about all this are:
- Historically, robots have been very limited in their scope of motion and action. Indeed, most successful robots to date have been immobile, metallic programmable machines, serving on classic assembly lines.
- Next-generation robots should and will be much more able to safely and effectively navigate through and work within general human-centric environments.
- This will affect a variety of application areas in ways that readers of this blog may care about.
Examples of the first point may be found in any number of automobile factory videos, such as:
A famous example of the second point is a 5-year-old video of Kevin’s work on prototype robot locomotion, namely:
Walking robots (such as Big Dog) and general soft robots (such as those from Pneubotics) rely on real-time adaptation to physical feedback. Robots have long enjoyed machine vision,* but their touch capabilities have been very limited. Current research/development proposes to solve that problem, hence allowing robots that can navigate uneven real-world surfaces, grip and lift objects of unpredictable weight or position, and minimize consequences when unwanted collisions do occur. (See for example in the video where Big Dog is kicked sideways across a nasty patch of ice.)
*Little-remembered fact — Symantec spun out ~30 years ago from a vision company called Machine Intelligence, back when “artificial intelligence” was viewed as a meaningful product category. Symantec’s first product — which explains the company name — was in natural language query.
Pneubotics and others take this further, by making robots out of soft, light, flexible materials. Benefits will/could include:
- Safety (obviously).
- Cost-effectiveness (better weight/strength ratios –> less power needed –> less lugging of batteries or whatever –> much more capability for actual work).
- Operation in varied environments (underwater, outer space, etc.).
- Better locomotion even on dry land (because of weight and safety).
Above all, soft robots will have more effective senses of touch, as they literally bend and conform to contact with real-world surfaces and objects.
Now let’s turn to some of the implications of soft and mobile robotic technology.
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Graph analytics and data management are still confused.
3. As I suggested last year, data transformation is an important area for innovation.
- MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
- The smart data preparation crowd is deservedly getting attention.
- The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.
4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:
- Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
- Similarly, more complex clustering.
- Predictive experimentation.
- The use of business intelligence and predictive modeling to inform each other.
- Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
- I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
- While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.
5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:
- “Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
- Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
- … the cost of such process isolation may need to be borne.
- Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.
- It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
- It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
- On servers, we may in many cases be talking about lightweight virtual machines.
And to be clear:
- What I’m talking about would do little to help the authentication/authorization aspects of security, but …
- … those will never be perfect in any case (because they depend upon fallible humans) …
- … which is exactly why other forms of security will always be needed.
6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.
7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:
- Application development technology, languages, frameworks, etc.
- The integration of analytics into old-style operational apps.
- The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.
8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.