In one of my favorite posts, namely When I am a VC Overlord, I wrote:
I will not fund any entrepreneur who mentions “market projections” in other than ironic terms. Nobody who talks of market projections with a straight face should be trusted.
Even so, I got talked today into putting on the record a prediction that machine-generated data will grow at more than 40% for a while.
My reasons for this opinion are little more than:
- Moore’s Law suggests that the same expenditure will buy 40% or so more machine-generated data each year.
- Budgets spent on producing machine-generated data seem to be going up.
I was referring to the creation of such data, but the growth rates of new creation and of persistent storage are likely, at least at this back-of-the-envelope level, to be similar.
Anecdotal evidence actually suggests 50-60%+ growth rates, so >40% seemed like a responsible claim.
- My recent survey of machine-generated data topics started with a list of many different kinds of the stuff.
- My 2009 post on data warehouse volume growth makes similar points, and notes that high growth rates mean we likely can never afford to keep all machine-generated data permanently.
- My 2011 claim that traditional databases will migrate into RAM is sort of this argument’s flipside.
What will soft, mobile robots be able to do that previous generations cannot? A lot. But I’m particularly intrigued by two large categories:
- Inspection, maintenance and repair.
- Health care/family care assistance.
There are still many things that are hard for humans to keep in good working order, including:
- Power lines.
- Anything that’s underwater (cables, drilling platforms, etc.)
- Pipelines, ducts, and water mains (especially from the inside).
- Any kind of geographically remote power station or other installation.
Sometimes the issue is (hopefully minor) repairs. Sometimes it’s cleaning or lubrication. In some cases one might want to upgrade a structure with fixed sensors, and the “repair” is mainly putting those sensors in place. In all these cases, it seems that soft robots could eventually offer a solution. Further examples, I’m sure, could be found in factories, mines, or farms.
Of course, if there’s a maintenance/repair need, inspection is at least part of the challenge; in some cases it’s almost the whole thing. And so this technology will help lead us toward the point that substantially all major objects will be associated with consistent flows of data. Opportunities for data analysis will abound.
One other point about data flows — suppose you have two kinds of machines that can do a task, one of which is flexible, the other rigid. The flexible one will naturally have much more variance in what happens from one instance of the task to the next one. That’s just another way in which soft robots will induce greater quantities of machine-generated data.
Let’s now consider health care, whose basic characteristics include:
- It’s done to people …
- … especially ones who don’t feel very good.
People who are sick, elderly or whatever can often use help with simple tasks — e.g., taking themselves to the bathroom, or fetching a glass water. So can their caretakers — e.g., turning a patient over in bed. That’s even before we get to more medical tasks such as checking and re-bandaging an awkwardly-placed wound. And on the healthier side, I wouldn’t mind having a robot around the house that could, for example, spot me with free weights. Fully general forms of this seem rather futuristic. But even limited forms might augment skilled-nurse labor, or let people stay in their own homes who at the moment can’t quite make it there.
And, once again, any of these use cases would likely be associated with its own stream(s) of observational and introspective data.
- Part 1 of this series was a quick introduction to soft and mobile robotics.
There may be no other subject on which I’m so potentially biased as robotics, given that:
- I don’t spend a lot of time on the area, but …
- … one of the better robotics engineers in the world (Kevin Albert) just happens to be in my family …
- … and thus he’s overwhelmingly my main source on the general subject of robots.
Still, I’m solely responsible for my own posts and opinions, while Kevin is busy running his startup (Pneubotics) and raising my grandson. And by the way — I’ve been watching the robotics industry slightly longer than Kevin has been alive.
My overview messages about all this are:
- Historically, robots have been very limited in their scope of motion and action. Indeed, most successful robots to date have been immobile, metallic programmable machines, serving on classic assembly lines.
- Next-generation robots should and will be much more able to safely and effectively navigate through and work within general human-centric environments.
- This will affect a variety of application areas in ways that readers of this blog may care about.
Examples of the first point may be found in any number of automobile factory videos, such as:
A famous example of the second point is a 5-year-old video of Kevin’s work on prototype robot locomotion, namely:
Walking robots (such as Big Dog) and general soft robots (such as those from Pneubotics) rely on real-time adaptation to physical feedback. Robots have long enjoyed machine vision,* but their touch capabilities have been very limited. Current research/development proposes to solve that problem, hence allowing robots that can navigate uneven real-world surfaces, grip and lift objects of unpredictable weight or position, and minimize consequences when unwanted collisions do occur. (See for example in the video where Big Dog is kicked sideways across a nasty patch of ice.)
*Little-remembered fact — Symantec spun out ~30 years ago from a vision company called Machine Intelligence, back when “artificial intelligence” was viewed as a meaningful product category. Symantec’s first product — which explains the company name — was in natural language query.
Pneubotics and others take this further, by making robots out of soft, light, flexible materials. Benefits will/could include:
- Safety (obviously).
- Cost-effectiveness (better weight/strength ratios –> less power needed –> less lugging of batteries or whatever –> much more capability for actual work).
- Operation in varied environments (underwater, outer space, etc.).
- Better locomotion even on dry land (because of weight and safety).
Above all, soft robots will have more effective senses of touch, as they literally bend and conform to contact with real-world surfaces and objects.
Now let’s turn to some of the implications of soft and mobile robotic technology.
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Graph analytics and data management are still confused.
3. As I suggested last year, data transformation is an important area for innovation.
- MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
- The smart data preparation crowd is deservedly getting attention.
- The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.
4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:
- Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
- Similarly, more complex clustering.
- Predictive experimentation.
- The use of business intelligence and predictive modeling to inform each other.
- Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
- I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
- While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.
5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:
- “Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
- Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
- … the cost of such process isolation may need to be borne.
- Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.
- It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
- It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
- On servers, we may in many cases be talking about lightweight virtual machines.
And to be clear:
- What I’m talking about would do little to help the authentication/authorization aspects of security, but …
- … those will never be perfect in any case (because they depend upon fallible humans) …
- … which is exactly why other forms of security will always be needed.
6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.
7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:
- Application development technology, languages, frameworks, etc.
- The integration of analytics into old-style operational apps.
- The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.
8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.
There is much confusion about migration, by which I mean applications or investment being moved from one “platform” technology — hardware, operating system, DBMS, Hadoop, appliance, cluster, cloud, etc. — to another. Let’s sort some of that out. For starters:
- There are several fundamentally different kinds of “migration”.
- You can re-host an existing application.
- You can replace an existing application with another one that does similar (and hopefully also new) things. This new application may be on a different platform than the old one.
- You can build or buy a wholly new application.
- There’s also the inbetween case in which you extend an old application with significant new capabilities — which may not be well-suited for the existing platform.
- Motives for migration generally fall into a few buckets. The main ones are:
- You want to use a new app, and it only runs on certain platforms.
- The new platform may be cheaper to buy, rent or lease.
- The new platform may have lower operating costs in other ways, such as administration.
- Your employees may like the new platform’s “cool” aspect. (If the employee is sufficiently high-ranking, substitute “strategic” for “cool”.)
- Different apps may be much easier or harder to re-host. At two extremes:
- It can be forbiddingly difficult to re-host an OLTP (OnLine Transaction Processing) app that is heavily tuned, tightly integrated with your other apps, and built using your DBMS vendor’s proprietary stored-procedure language.
- It might be trivial to migrate a few long-running SQL queries to a new engine, and pretty easy to handle the data connectivity part of the move as well.
- Certain organizations, usually packaged software companies, design portability into their products from the get-go, with at least partial success.
I mixed together true migration and new-app platforms in a post last year about DBMS architecture choices, when I wrote:
- Sometimes something isn’t broken, and doesn’t need fixing.
- Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
- Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.
In particular, migration away from legacy DBMS raises many issues:
- Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
- Your staff’s programming and administrative skill-sets.
- Your investment in DBMS-related tools.
- Your supply of hockey tickets from the vendor’s salesman.
Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.
I then argued that such reasons are likely to exist for NoSQL DBMS, but less commonly for NewSQL. My views on that haven’t changed in the interim.
More generally, my pro-con thoughts on migration start:
- Pure application re-hosting is rarely worthwhile. Migration risks and costs outweigh the benefits, except in a few cases, one of which is the migration of ELT (Extract/Load/Transform) from expensive analytic RDBMS to Hadoop.
- Moving from in-house to co-located data centers can offer straightforward cost savings, because it’s not accompanied by much in the way of programming costs, risks, or delays. Hence Rackspace’s refocus on colo at the expense of cloud. (But it can be hard on your data center employees.)
- Moving to an in-house cluster can be straightforward, and is common. VMware is the most famous such example. Exadata consolidation is another.
- Much of new application/new functionality development is in areas where application lifespans are short — e.g. analytics, or customer-facing internet. Platform changes are then more practical as well.
- New apps or app functionality often should and do go where the data already is. This is especially true in the case of cloud/colo/on-premises decisions. Whether it’s important in a single location may depend upon the challenges of data integration.
I’m also often asked for predictions about migration. In light of the above, I’d say:
- Successful DBMS aren’t going away.
- OLTP workloads can usually be lost only so fast as applications are replaced, and that tends to be a slow process. Claims to the contrary are rarely persuasive.
- Analytic DBMS can lose workloads more easily — but their remaining workloads often grow quickly, creating an offset.
- A large fraction of new apps are up for grabs. Analytic applications go well on new data platforms. So do internet apps of many kinds. The underlying data for these apps often starts out in the cloud. SaaS (Software as a Service) is coming on strong. Etc.
- I stand by my previous view that most computing will wind up on appliances, clusters or clouds.
- New relational DBMS will be slow to capture old workloads, even if they are slathered with in-memory fairy dust.
And for a final prediction — discussion of migration isn’t going to go away either.
Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:
- Government snooping on the contents of communications.
- Communication traffic analysis.
- Photos and videos (airport scanners, public cameras, etc.)
- Commercial ad targeting.
- Traditional medical records.
Other areas, however, continue to be overlooked, with the two biggies in my opinion being:
- The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
- The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.
3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …
4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.
5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.
6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.
Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:
- High-frequency trading is an extreme race; microseconds matter.
- Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
- Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.
There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.
7. My views about predictive analytics are still somewhat confused. For starters:
- The math and technology of predictive modeling both still seem pretty simple …
- … but sometimes achieve mind-blowing results even so.
- There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
- Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.
So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.
- WibiData has some innovative ideas in predictive experimentation.
- Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.
- It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.
- Ayasdi reminded us that there’s room for innovation in clustering.
- My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.
Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example listed above is certainly imaginative. But I’d like to see a lot more in the area.
Even more important, of course, could be some kind of revolution in predictive modeling for medicine.
If you missed Fishbowl’s recent webinar on our new Enterprise Information Portal for Project Management, you can now view a recording of it on YouTube.
Innovation in Managing the Chaos of Everyday Project Management discusses our strategy for leveraging the content management and collaboration features of Oracle WebCenter to enable project-centric organizations to build and deploy a project management portal. This solution was designed especially for groups like E & C firms and oil and gas companies, who need applications to be combined into one portal for simple access.
If you’d like to learn more about the Enterprise Information Portal for Project Management, visit our website or email our sales team at email@example.com.
The post “Innovation in Managing the Chaos of Everyday Project Management” is now on YouTube appeared first on Fishbowl Solutions' C4 Blog.