It’s difficult to project the rate of IT change in health care, because:
- Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
- … health care is heavily bureaucratic, political and regulated.
Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:
- The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
- Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
- Most tests exploit electronic technology. Progress in electronics is intense.
- Biomedical research is itself intense.
- In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
These vastly greater amounts of data cited above will allow for greatly changed analytics.
- Right now, medical decisions are made based on research that looks at a few data points each for a specially-recruited sample of patients, then draws conclusions based on simplistic and questionable statistical methods.
- More sophisticated analytic methods are commonly used, but almost always still to aid in the discovery and formation of hypotheses that will then be validated, if at all, using the bad old analytic techniques.
- State of the art predictive modeling, applied to vastly more data, will surely yield greatly better results.
And so I believe that health care itself will be revolutionized.
- Diagnosis will be much more accurate, pretty much across the board, except in those limited areas where it’s already excellent today.
- Medication regimens will be much more personalized. (Pharma manufacturing may have to change greatly as a result.) So will other treatments. So will diet/fitness regimens.
- The vulnerable (elderly, hospital patients) will be more accurately and comprehensively monitored. Also, their care will likely be aided by robotics.
- Some of the same things will be true of infants and toddlers. (In other cases, they get such close attention today that I can’t imagine how it could be greatly increased. )
I believe that this will all happen because I believe that it will make health care vastly more successful. And if I’m right about that, no obstacles will be able to prevent it from coming into play — not cost (which will keep going down in a quasi-Moore’s-Law way), not bureaucratic inertia (although that will continue to slow things greatly), and not privacy fears (despite the challenges cited below).
So what are the IT implications of all this?
- I already mentioned the need for new (or newly-used) kinds of predictive modeling.
- Probably in association with those, event detection — which in many but not all cases will amount to anomaly detection — will be huge. If one goal is to let the elderly and ailing live independently, but receive help when it’s needed — well, recognizing when that help is needed will be crucial. Similar dynamics will occur in hospitals.
- And in support of that, there will be great amount of monitoring, and hence strong demands upon sensors and recognition. Potentially, all five human senses will be mimicked, among others. These technologies will become even more important in health care if I’m right that robotics will play a big role.
- Data quality will be a major challenge, especially in the doctors’-notes parts of health records. Reasons start:
- Different medical professionals might evaluate the same situation differently; diagnosis is a craft rather than a dumb, repeatable skill.
- If entries are selected from a predefined set of options, none may be a perfect match to the doctor’s actual opinion.
- Doctors often say what’s needful to have their decisions (care, tests, etc.) approved, whether or not it precisely matches what they really think. Thus, there are significant incentives to enter bad data.
- Free-text data is more central to health care than to many other application areas, and text data is inherently dirty.
- Health records are decades later than many other applications in moving from paper to IT.
- Data integration problems will also be and indeed already are huge, because different health care providers have addressed the tough challenges of record-keeping in different ways.
As for data management — well, almost everything discussed in this blog could come into play.
- A person’s entire medical record resembles the kind of mess increasingly often dumped these days into NoSQL — typically MongoDB, Cassandra, or HBase.
- There are plenty of business-transaction records in the mix, of the kind that have long been managed by RDBMS.
- There are a whole lot of diverse machines in the mix, and managing the data to keep such a menagerie running is commonly the job of Splunk or streaming-enhanced Hadoop.
- There’s a lot of free text in medical records. Also images, video and so on.
- Since graph analytics is used in research today, it might at some point make its way into clinical use.
Finally, let me say:
- Data-driven medicine cannot live up to its potential unless researchers can investigate data sets comprising private information of large numbers of people.
- Researchers will not have the appropriate permissions unless privacy law moves toward a basis in data use, rather than exclusively regulating data possession.
- The New York Times and Hacker News discussed the benefits of using your own medical records a couple months ago.
- I wrote about the monitoring/early response aspects of health care in February, 2015.
- Perhaps my most recent survey of privacy issues was in September, 2014.
- A pretty good survey of the debate about statistical methods in medical research came out in December, 2013.
I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:
- MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
- … but quickly positioned with “We also do ‘real-time’ analytic processing” …
- … and backed that up by adding a flash-based column store option …
- … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
- There’s also a JSON option.
The main new aspects of MemSQL 4.0 are:
- Geospatial indexing. This is for me the most interesting part.
- A new optimizer and, I suppose, query planner …
- … which in particular allow for serious distributed joins.
- Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
- Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.
There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.
Before MemSQL 4.0, distributed joins were restricted to the easy cases:
- Two tables are distributed (i.e. sharded) on the same key.
- One table is small enough to be broadcast to each node.
Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:
- Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
- MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
- The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
- MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
- MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.
To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.
In other connectivity notes:
- MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
- MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.
Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.
And now to the geospatial stuff. I thought I heard:
- A “point” is actually a square region less than 1 mm per side.
- There are on the order of 2^30 such points on the surface of the Earth.
Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)
Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:
- In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
- Hilbert curves seem to be in vogue, including at MemSQL.
- Nice properties of Hilbert space-filling curves include:
- Numbers near each other always correspond to points near each other.
- The converse is almost always true as well.*
- If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)
*You could say it’s true except in edge cases … but then you’d deserve to be punished.
Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:
- Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
- A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
- … a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.
As for company metrics — MemSQL cites >50 customers and >60 employees.
1. There are multiple ways in which analytics is inherently modular. For example:
- Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
- The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
- Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.
Also, analytics is inherently iterative.
- Everything I just called “modular” can reasonably be called “iterative” as well.
- So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”
If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.
2. In 2011, I wrote, in the context of agile predictive analytics, that
… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.
I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises.
3. Speaking of Gartner, Mark Beyer tweeted
In data management’s future “hybrid” becomes a useless term. Data management is mutable, location agnostic and services oriented.
And that’s why I launched DBMS2 a decade ago, for “DataBase Management System SERVICES”.
A post earlier this year offers a strong clue as to why Mark’s tweet was at least directionally correct: The best structures for writing data are the worst for query, and vice-versa.
4. The foregoing notwithstanding, I continue to believe that there’s a large place in the world for “full-stack” analytics. Of course, some stacks are fuller than others, with SaaS (Software as a Service) offerings probably being the only true complete-stack products.
5. Speaking of full-stack vendors, some of the thoughts in this post were sparked by a recent conversation with Platfora. Platfora, of course, is full-stack except for the Hadoop underneath. They’ve taken to saying “data lake” instead of Hadoop, because they believe:
- It’s a more benefits-oriented than geek-oriented term.
- It seems to be more popular than the roughly equivalent terms “data hub” or “data reservoir”.
6. Platfora is coy about metrics, but does boast of high growth, and had >100 employees earlier this year. However, they are refreshingly precise about competition, saying they primarily see four competitors — Tableau, SAS Visual Analytics, Datameer (“sometimes”), and Oracle Data Discovery (who they view as flatteringly imitative of them).
Platfora seems to have a classic BI “land-and-expand” kind of model, with initial installations commonly being a few servers and a few terabytes. Applications cited were the usual suspects — customer analytics, clickstream, and compliance/governance. But they do have some big customer/big database stories as well, including:
- 100s of terabytes or more (but with a “lens” typically being 5 TB or less).
- 4-5 customers who pressed them to break a previous cap of 2 billion discrete values.
7. Another full-stack vendor, ScalingData, has been renamed to Rocana, for “root cause analysis”. I’m hearing broader support for their ideas about BI/predictive modeling integration. For example, Platfora has something similar on its roadmap.
- I did a kind of analytics overview last month, which had a whole lot of links in it. This post is meant to be additive to that one.
I’m going to be out-of-sorts this week, due to a colonoscopy. (Between the prep, the procedure, and the recovery, that’s a multi-day disablement.) In the interim, here’s a collection of links, quick comments and the like.
1. Are you an engineer considering a start-up? This post is for you. It’s based on my long experience in and around such scenarios, and includes a section on “Deadly yet common mistakes”.
2. There seems to be a lot of confusion regarding the business model at my clients Databricks. Indeed, my own understanding of Databricks’ on-premises business has changed recently. There are no changes in my beliefs that:
- Databricks does not directly license or support on-premises Spark users. Rather …
- … it helps partner companies to do so, where:
- Examples of partner companies include usual-suspect Hadoop distribution vendors, and DataStax.
- “Help” commonly includes higher-level support.
However, I now get the impression that revenue from such relationships is a bigger deal to Databricks than I previously thought.
Databricks, by the way, has grown to >50 people.
3. DJ Patil and Ruslan Belkin apparently had a great session on lessons learned, covering a lot of ground. Many of the points are worth reading, but one in particular echoed something I’m hearing lots of places — “Data is super messy, and data cleanup will always be literally 80% of the work.” Actually, I’d replace the “always” by something like “very often”, and even that mainly for newish warehouses, data marts or datasets. But directionally the comment makes a whole lot of sense.
4. Of course, dirty data is a particular problem when the data is free-text.
5. In 2010 I wrote that the use of textual news information in investment algorithms had become “more common”. It’s become a bigger deal since. For example:
- It seems to be quite profitable to do automated options trading based on the parsing of tweets.
- In a funny example, Tesla motors stock gyrated due to Tesla’s April Fool’s press release about a new wristwatch product.
6. Sometimes a post here gets a comment thread so rich it’s worth doubling back to see what other folks added. I think the recent geek-out on indexes is one such case. Good stuff was added by multiple smart people.
7. Finally, I’ve been banging the drum for electronic health records for a long time, arguing that the great difficulties should be solved due to the great benefits of doing so. The Hacker News/New York Times combo offers a good recent discussion of the subject.
Indexes are central to database management.
- My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
- … and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
- Recently, I’ve had numerous conversations in which indexing strategies played a central role.
Perhaps it’s time for a round-up post on indexing.
1. First, let’s review some basics. Classically:
- An index is a DBMS data structure that you probe to discover where to find the data you really want.
- Indexes make data retrieval much more selective and hence faster.
- While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
- Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)
- A DBMS or other system can index data it doesn’t control.
- This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security.
- This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.
- Indexes can be updated in batch mode, rather than real time.
- Most famously, this is why Google invented MapReduce.
- Indeed, in cases where you index external data, it’s almost mandatory.
- Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes.
- The most famous example is probably the rebalancing of B-trees.
- Append-only index writes call for later clean-up as well.
3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.
- The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)
- In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that:
- He felt that lockless indexes were essential to scale-out, and to that end …
- … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)
- Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.
- solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.
I’ll try to explain this paradox below.
4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.
Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.
5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.
- To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.
- Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.
- Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.
6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.
The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:
- For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.
- So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.
- Then regions on a plane are covered by subsequences (or unions of same).
The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.
7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.
I chatted with the MariaDB folks on Tuesday. Let me start by noting:
- MariaDB, the product, is a MySQL fork.
- MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
- MariaDB, the company, is the former SkySQL …
- … which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
- I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
- It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.
The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:
- ~80 people in ~15 countries.
- 20-25 engineers, which hopefully doesn’t count a few field support people.
- “Tiny” headquarters in Helsinki.
- Business leadership growing in the US and especially the SF area.
MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.
MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries.
- As you might guess, MaxScale has a sharding story.
- All MaxScale sharding is transparent.
- Right now MaxScale sharding is “schema-based”, which I interpret to mean as different tables potentially being on different servers.
- Planned to come soon is “key-based” sharding, which I interpret to mean as the kind of sharding that lets you scale a table across multiple servers without the application needing to know that is happening.
- I didn’t ask about join performance when tables are key-sharded.
- MaxScale includes a firewall.
- MaxScale has 5 “well-defined” APIs, which were described as:
- I think MaxScale’s development schedule is “asynchronous” from that of the MariaDB product.
- Further, MaxScale has a “plug-in” architecture that is said to make it easy to extend.
- One plug-in on the roadmap is replication into Hadoop-based tables. (I think “into” is correct.)
I had trouble figuring out the differences between MariaDB’s free and enterprise editions. Specifically, I thought I heard that there were no feature differences, but I also thought I heard examples of feature differences. Further, there are third-party products included, but plans to replace some of those with in-house developed products in the future.
A few more notes:
- MariaDB’s optimizer is rewritten vs. MySQL.
- Like other vendors before it, MariaDB has gotten bored with its old version numbering scheme and jumped to 10.0.
- One of the storage engines MariaDB ships is TokuDB. Surprisingly, TokuDB’s most appreciated benefit seems to be compression, not performance.
- As an example of significant outside code contributions, MariaDB cites Google contributing whole-database encryption into what will be MariaDB 10.1.
- Online schema change is on the roadmap.
- There’s ~$20 million of venture capital in the backstory.
- Engineering is mainly in Germany, Eastern Europe, and the US.
- MariaDB Power8 performance is reportedly great (2X Intel Sandy Bridge or a little better). Power8 sales are mainly in Europe.
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks:
- Reporting for regulatory compliance (financial or otherwise). The purpose of this is to follow rules.
- This is non-innovative almost by design.
- Somebody probably originally issued the regulations for a reason, so the reports may be useful for monitoring purposes. Failing that, they probably are supported by the same infrastructure that also tries to do useful monitoring.
- Data governance is crucial. Submitting incorrect data to regulators can have dire consequences. That said, when we hear about poor governance of poly-structured data, I question whether that data is being used in the applications where strong governance is actually needed.
- Other routine, monitoring-oriented business intelligence. The purpose can be general monitoring or general communication. Sometimes the purpose is lost to history entirely. This is generally lame, at least technically, unless interesting requirements are added.
- Displaying it on mobile devices makes it snazzier, and in some cases more convenient. Whoop-de-do.
- Usually what makes it interesting these days is the desire to actually explore the data and gain new insights. More on that below.
- BI for inherently non-tabular data is definitely an unsolved problem.
- Integration of BI with enterprise apps continues to be an interesting subject, but one I haven’t learned anything new about recently.
- All that said, this is an area for some of the most demanding classical data warehouse installations, specifically ones that are demanding along dimensions such as concurrency or schema complexity. (Recall that the most complicated data warehouses are often not the largest ones.) Data governance can be important here as well.
- Investigation by business analysts or line-of-business executives. Much of the action is here, not least because …
- … it’s something of a catch-all category.
- “Business analyst” is a flexible job description, and business analysts can have a variety of goals.
- Alleged line-of-business executives doing business-analyst work are commonly delegating it to fuller-time business analysts.
- These folks can probably manage departmental analytic RDBMS if they need to (that was one of Netezza’s early value propositions), but a Hadoop cluster stretches them. So easy deployment and administration stories — e.g. “Hadoop with less strain”/”Spark with less strain” — can have merit. This could be true even if there’s a separate team of data wranglers pre-processing data that the analysts will then work with.
- Further, when it comes to business intelligence:
- Tableau and its predecessors have set a high bar for quality of user interface.
- The non-tabular BI challenges are present in spades.
- ETL reduction/elimination (Extract/Transform/Load) is a major need.
- Predictive modeling by business analysts is problematic from beginning to end; much progress needs to be made here.
- … it’s something of a catch-all category.
- Investigation by data scientists. The “data scientist”/”business analyst” distinction is hardly precise. But for the purpose of this post, a business analyst may be presumed to excel at elementary mathematics — even stock analysts just use math at a high school level — and at using tabular databases, while data scientists (individuals or teams) have broader skill sets and address harder technical or mathematical problems.
- Rapid-response trouble-shooting. There are some folks — for example network operators — whose job includes monitoring things moment to moment and, when there’s a problem, reacting quickly.
- “Operationalization” of investigative results. This is a hot area, because doing something with insights — “insights” being a hot analytic buzzword these days — is more valuable than merely having them.
- This is where short-request kinds of data stores — NoSQL or otherwise — are often stressed, especially in the low-latency analytics they need to support.
- This is the big area for any kind of “closed loop” predictive modeling story, e.g. in experimentation.
- At least in theory, this is another big area for streaming.
And finally — across multiple kinds of user group and use case, there are some applications that will only be possible when sensors or other data sources improve.
Bottom line: Almost every interesting analytic technology problem is worth solving for some market, but please be careful about finding the right match.
- Where the innovation is (January, 2015)
- Various notes (November, 2014)
- “Freeing business analysts from IT” (August, 2014)
- Data integration as a business opportunity (July, 2014)
- Differentiation in BI usability (March, 2014)
- Analytic database distinctions (February, 2013)
- Juggling analytic databases (March, 2012)
- Applications of an analytic kind (February, 2012)
- Agile predictive analytics (November, 2011)
- Eight kinds of analytic database (July, 2011)
- Use cases for low-latency analytics (April, 2011)
- The three principle kinds of analytic business benefit (March, 2011)
My favorite educational video growing up, by far, was a 1960 film embedded below. I love it because it pranks its viewers, starting right in the opening scene. (Start at the 0:50 mark to see what I mean.)
If you’re ever in the position of helping a kid or young adult understand physics, this video could be a great help. Frankly, it could help in political discussions as well …
I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.
Here are some thoughts as to why, and also as to challenges that need to be overcome.
There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:
- When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
- Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.
Trivial query routing or federation is … trivial.
- Databases have or can be given some kind of data catalog interface. Of course, this is easier for databases that are tabular, whether relational or MOLAP (Multidimensional OnLine Analytic Processing), but to some extent it can be done for anything.
- Combining the catalogs can be straightforward. So can routing queries through the system to the underlying data stores.
In fact, what I just described is Business Objects’ original innovation — the semantic layer — two decades ago.
Careless query routing or federation can be a performance nightmare. Do a full scan. Move all the data to some intermediate server that lacks capacity or optimization to process it quickly. Wait. Wait. Wait. Wait … hmmm, maybe this wasn’t the best data-architecture strategy.
Streaming goes well with federation. Some data just arrived, and you want to analyze it before it ever gets persisted. You want to analyze it in conjunction with data that’s been around longer. That’s a form of federation right there.
There are ways to navigate schema messes. Sometimes they work.
- Polishing one neat relational schema for all your data is exactly what people didn’t want to do when they decided to store a lot of the data non-relationally instead. Still, memorializing some schema after that fact may not be terribly painful.
- Even so, text search can help you navigate the data wilds. So can collaboration tools. Neither helps all the time, however.
Neither extreme view here — “It’s easy!” or “It will never work!” — seems right. Rather, I think there’s room for a lot of effort and differentiation in exposing cross-database schema information.
I’m leaving out one part of the story on purpose — how these data layers are going to be packaged, and specifically what other functionality they will be bundled with. Confidentially would screw up that part of the discussion; so also would my doubts as to whether some of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL (Extract/Transform/Load) alike.
One way or another, I don’t think the subject of logical data layers is going away any time soon.
- Implicit in this post is the belief that enterprises should and do use many different data stores (June, 2014)
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:
- Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
- eBay has an HBase database largely on spinning disk, used to inform its search engine.
Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.
4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.
5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.
- It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
- An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
- Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.
In general, Cloudera suggested that HBase was used in a fair number of OEM situations.
6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.
- A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.
Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. Here’s the small amount that I have actually figured out.
As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.
- In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
- Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
- For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).
Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:
- The NoSQL database is likely to have more null values.
- The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.
That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.
Another set of issues arise on the performance side. Many NoSQL systems have indexes, and thus some kind of filtering capability. Some — e.g. MongoDB — have aggregation frameworks as well. So if you’re getting at the data with some combination of a BI tool, ETL tool or ODBC/JDBC drivers — are you leveraging the capabilities in place? Or are you doing the simplest and slowest thing, which is to suck data out en masse and operate on it somewhere else? Getting good answers to those questions is a work-in-progress at best.
Having established that NoSQL data structures cause problems for BI, let’s turn that around. Is there any way that they actually help? I want to say “NoSQL data often comes in hierarchies, and hierarchies are good for roll-up/drill-down.” But the hierarchies that describe NoSQL data aren’t necessarily the same kinds of hierarchies that are useful for BI aggregation, and I’m indeed skeptical as to how often those two categories overlap.
Hierarchies aside, I do think there are use cases for fundamentally non-tabular BI. For example, consider the following scenario, typically implemented with the help of NoSQL today:
- You have more data — presumably machine-generated — than you can afford to keep.
- So you keep time-sliced aggregates.
- You also keep selective details, namely ones that you identified when they streamed in as being interesting in some way.
Visualizing that properly would be very hard in a traditional tabularly-oriented BI tool. So it could end up with NoSQL-oriented BI tools running over NoSQL data stores. Event series BI done right also seems to be quite non-tabular. That said, I don’t know for sure about the actual data structures used under the best event series BI today.
And at that inconclusive point, I’ll stop for now. If you have something to add, please share it in the comments below, or hit me up as per my Contact link above.
I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:
- The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
- Cloudera still uses a figure of 20% of its customers being HBase-centric.
- HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
- With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.
- Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
- As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
- Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)
Cloudera’s views on HBase history — in response to the priorities I brought to the conversation — include:
- HBase initially favored consistency over performance/availability, while Cassandra initially favored the opposite choice. Both products, however, have subsequently become more tunable in those tradeoffs.
- Cloudera’s initial contributions to HBase focused on replication, disaster recovery and so on. I guess that could be summarized as “scaling”.
- Hortonworks’ early HBase contributions included (but were not necessarily limited to):
- Making recovery much faster (10s of seconds or less, rather than minutes or more).
- Some of that consistency vs. availability tuning.
- “Coprocessors” were added to HBase ~3 years ago, to add extensibility, with the first use being in security/permissions.
- With more typical marketing-oriented version numbers:
- HBase .90, the first release that did a good job on durability, could have been 1.0.
- HBase .92 and .94, which introduced coprocessors, could have been Version 2.
- HBase .96 and .98 could have been Version 3.
- The recent HBase 1.0 could have been 4.0.
The HBase roadmap includes:
- A kind of BLOB/CLOB (Binary/Character Large OBject) support.
- Intel is heavily involved in this feature.
- The initial limit is 10 megabytes or so, due to some limitations in the API (I didn’t ask why that made sense). This happens to be all the motivating Chinese customer needs for the traffic photographs it wants to store.
- Various kinds of “multi-tenancy” support (multi-tenancy is one of those terms whose meaning is getting stretched beyond recognition), including:
- Mixed workload support (short-request and analytic) on the same nodes.
- Mixed workload support on different nodes in the same cluster.
- Security between different apps in the same cluster.
- (Still in the design phase) Bottleneck Whack-A-Mole, with goals including but not limited to:
- Scale-out beyond the current assumed limit of ~1200 nodes.
- More predictable performance, based on smaller partition sizes.
- (Possibly) Multi-data-center fail-over.
Not on the HBase roadmap per se are global/secondary indexes. Rather, we talked about projects on top of HBase which are meant to provide those. One is Apache Phoenix, which supposedly:
- Makes it simple to manage compound keys. (E.g., City/State/ZipCode)
- Provides global secondary indexes (but not in a fully ACID way).
- Offers some very basic JOIN support.
- Provides a JDBC interface.
- Offers efficiencies in storage utilization, scan optimizations, and aggregate calculations.
Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip.
There was a lot more to the conversation, but I’ll stop here for two reasons:
- This post is pretty long already.
- I’m reserving some of the discussion until after I’ve chatted with vendors of other NoSQL systems.
- My July 2011 post on HBase offers context, as do the comments on it.
I found yesterday’s news quite unpleasant.
- A guy I knew and had a brief rivalry with in high school died of colon cancer, a disease that I’m at high risk for myself.
- GigaOm, in my opinion the best tech publication — at least for my interests — shut down.
- The sex discrimination trial around Kleiner Perkins is undermining some people I thought well of.
So I want to unclutter my mind a bit. Here goes.
1. There are a couple of stories involving Sam Simon and me that are too juvenile to tell on myself, even now. But I’ll say that I ran for senior class president, in a high school where the main way to campaign was via a single large poster, against a guy with enough cartoon-drawing talent to be one of the creators of the Simpsons. Oops.
2. If one suffers from ulcerative colitis as my mother did, one is at high risk of getting colon cancer, as she also did. Mine isn’t as bad as hers was, due to better tolerance for medication controlling the disease. Still, I’ve already had a double-digit number of colonoscopies in my life. They’re not fun. I need another one soon; in fact, I canceled one due to the blizzards.
Pro-tip — never, ever have a colonoscopy without some kind of anesthesia or sedation. Besides the unpleasantness, the lack of meds increases the risk that the colonoscopy will tear you open and make things worse. I learned that the hard way in New York in the early 1980s.
3. Five years ago I wrote optimistically about the evolution of the information ecosystem, specifically using the example of the IT sector. One could argue that I was right. After all:
- Gartner still seems to be going strong.
- O’Reilly, Gartner and vendors probably combine to produce enough good conferences.
- A few traditional journalists still do good work (in the areas covered by this blog Doug Henschen comes to mind).
- A few vendor folks are talented and responsible enough to add to the discussion. A few small-operation folks — e.g. me — are still around.
Still, the GigaOm news is not encouraging.
4. As TechCrunch and Pando reported, plaintiff Ellen Pao took the stand and sounded convincing in her sexual harassment suit against Kleiner Perkins (but of course she hadn’t been cross-examined yet). Apparently there was a major men-only party hosted by partner Al Gore, a candidate I first supported in 1988. And partner Ray Lane, somebody who at Oracle showed tremendous management effectiveness, evidently didn’t do much to deal with Pao’s situation.
At some point I want to write about a few women who were prominent in my part of the tech industry in the 1980s — at least Ann Winblad, Esther Dyson, and Sandy Kurtzig, maybe analyst/investment banker folks Cristina Morgan and Ruthann Quindlen as well. We’ve come a long way since those days (when, in particular, I could briefly list a significant fraction of the important women in the industry). There seems to be a lot further yet to go.
5. All that said — I’m indeed working on some cool stuff. Some is evident from recent posts. Other may be reflected in an upcoming set of posts that focus on NoSQL, business intelligence, and — I hope — the intersection of the two areas.
6. Speaking of recent posts, I did one on marketing for young companies that brings a lot of advice and tips together. I think it’s close to being a must-read.
- Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
- Continuuity recently changed its name to Cask and went open source.
- Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
- Cask and Cloudera partnered.
- I got a more technical Cask briefing this week.
- App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
- Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.
So far as I can tell:
- Cask’s current focus is to orchestrate job flows, with lots of data mappings.
- This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
- CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
- CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
- … sub-optimal choices in data mapping, database design or workflow composition.
I’d didn’t push the competition point hard (the call was generally a bit rushed due to a hard stop on my side), but:
- Cask thinks it doesn’t have much in the way of exact or head-to-head competitors, but cites Spring and WibiData/Kiji as coming closest.
- I’d think that data integration vendors who use Hadoop as an execution engine (Informatica, Syncsort and many more) would be in the mix as well.
- Cask disclaimed competition with Teradata Revelytix, on the theory that Cask is focused on operational/”real-time” use cases, while Revelytix Loom is focused on data science/investigative analytics.
To reiterate part of that last bullet — like much else we’re hearing about these days, CDAP is focused on operational apps, perhaps with a streaming aspect.
To some extent CDAP can be viewed as restoring the programmer/DBA distinction to the non-SQL world and streaming worlds. That is:
- Somebody creates a data mapping “pattern”.
- Programmers (including perhaps the creator) write to that pattern.
- Somebody (perhaps the creator) tweaks the mapping to optimize performance, or to reflect changes in the underlying data management.
Further notes on CDAP data access include:
- Cask is proud that a pattern can literally be remapped from one data store to another, although I wonder how often that is likely to happen in practice.
- Also, a single “row” can reference multiple data stores.
- Cask’s demo focused on imposing a schema on a log file, something you might do incrementally as you decide to extract another field of information. This is similar to major use cases for schema-on-need and for Splunk.
- For most SQL-like access and operations, CDAP relies on Hive, even to external data stores or non-tabular data. Cask is working with Cloudera on Impala access.
Examples of things that Cask supposedly makes easy include:
- Chunking streaming data by time (e.g. 1 minute buckets).
- Generating database stats (histograms and so on).
Tidbits as to how Cask perceives or CDAP plays with other technologies include:
- Kafka is hot.
- Spark Streaming is hot enough to be on the CDAP roadmap.
- Cask believes that its administrative tools don’t conflict with Cloudera Manager or Ambari, because they’re more specific to an application, job or dataset.
- CDAP is built on Twill, which is a thread-like abstraction over YARN that Cask contributed to Apache. Mesos is in the picture as well, as a YARN alternative.
- Cask is seeing some interest in Flink. (Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded.)
Cask has ~40 people, multiple millions of dollars in trailing revenue, and — naturally — high expectations for future growth. I neglected, however, to ask how that revenue was split between subscription, professional services and miscellaneous. Cask expects to finish 2015 with a healthy two-digit number of customers.
Cask’s customers seem concentrated in usual-suspect internet-related sectors, although Cask gave it a bit of an enterprise-y spin by specifically citing SaaS (Software as a Service) and telecom. When I asked who else seems to be a user or interested based on mailing list activity, Cask mentioned a lot of financial services and some health care as well.
- Cask doesn’t have the obvious .com URL.
I’m on record as believing that:
- Hadoop needs a memory-centric storage grid.
- Tachyon is a strong candidate to fill the role.
- It’s an open secret that there will be a Tachyon company. However, …
- … no details have been publicized. Indeed, the open secret itself is still officially secret.
- Tachyon technology, which just hit 0.6 a couple of days ago, still lacks many features I regard as essential.
- As a practical matter, most Tachyon interest to date has been associated with Spark. This makes perfect sense given Tachyon’s origin and initial technical focus.
- Tachyon was in 50 or more sites last year. Most of these sites were probably just experimenting with it. However …
- … there are production Tachyon clusters with >100 nodes.
As a reminder of Tachyon basics:
- You do I/O with Tachyon in memory.
- Tachyon data can optionally be persisted.
- That “tiered storage” capability — including SSDs — was just introduced in 0.6. So in particular …
- … it’s very primitive and limited at the moment.
- I’ve heard it said that Intel was a big contributor to tiered storage/SSD support. (Solid-State Drives.)
- Tachyon has some ability to understand “lineage” in the Spark sense of term. (In essence, that amounts to knowing what operations created a set of data, and potentially replaying them.)
Beyond that, I get the impressions:
- Synchronous write-through from Tachyon to persistent storage is extremely primitive right now — but even so I am told it is being used in production by multiple companies already.
- Asynchronous write-through, relying on lineage tracking to recreate any data that gets lost, is slightly further along.
- One benefit of adding Tachyon to your Spark installation is a reduction in garbage collection issues.
And with that I have little more to say than my bottom lines:
- If you’re writing your own caching layer for some project you should seriously consider adapting Tachyon instead.
- If you’re using Spark you should seriously consider using Tachyon as well.
- I think Tachyon will be a big deal, but it’s far too early to be sure.
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as:
- 15 certified distributions.
- ~40 certified applications.
- 2000 people trained last year by Databricks alone.
Please note that certification of a Spark distribution is a free service from Databricks, and amounts to checking that the API works against a test harness. Speaking of certification, Ion basically agrees with my views on ODP, although like many — most? — people he expresses himself more politely than I do.
We talked briefly about several aspects of Spark or related projects. One was DataFrames. Per Databricks:
In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
I gather this is modeled on Python pandas, and extends an earlier Spark capability for RDDs (Resilient Distributed Datasets) to carry around metadata that was tantamount to a schema.
SparkR is also on the rise, although it has the usual parallel R story to the effect:
- You can partition data, run arbitrary R on every partition, and aggregate the results.
- A handful of algorithms are truly parallel.
So of course is Spark Streaming. And then there are Spark Packages, which are — and I’m speaking loosely here — a kind of user-defined function.
- Thankfully, Ion did not give me the usual hype about how a public repository of user-created algorithms is a Great Big Deal.
- Ion did point out that providing an easy way for people to publish their own algorithms is a lot easier than evaluating every candidate contribution to the Spark project itself.
I’ll stop here. However, I have a couple of other Spark-related posts in the research pipeline.
7-10 years ago, I repeatedly argued the viewpoints:
- Relational DBMS were the right choice in most cases.
- Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
- There were a variety of specialized use cases in which non-relational data models were best.
Since then, however:
- Hadoop has flourished.
- NoSQL has flourished.
- Graph DBMS have matured somewhat.
- Much of the action has shifted to machine-generated data, of which there are many kinds.
So it’s probably best to revisit all that in a somewhat organized way.
To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of <name, value> pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.
Important distinctions include:
- Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
- If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
- In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
- Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
- Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.
Some major data models can be put into a fairly strict ordering of query desirability by noting:
- The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
- The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
- The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place
Unsurprisingly, that ordering is reversed when it comes to writing data.
- The easiest thing to write to is a data store with no structure.
- Next-easiest is to write to a data store that lets you make up the structure as you go along.
- The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.
And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:
- Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
- Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
- Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.
- You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
- Splunk is cool.
- Use cases for various other kinds of data stores can often be found.
- Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.
And finally, I think in-memory data grids:
- Will be widely used and important.
- Will be used to instantiate multiple data models at once.
- One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
- I’ve long mused about the challenges of getting by without joins. (November, 2010)
- In 2013 I observed that data models will be in perpetual, rapid flux.
- In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
- I surveyed data models and access methods back in 2008.
While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:
- Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
- In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
- That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
- The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.
Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.
Further, the low-cost alternatives for analytic RDBMS are adding up.
- Amazon Redshift has considerable traction.
- Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
- Now Greenplum is in the mix as well.
For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.
By no means do I want to suggest those are the only alternatives.
- Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
- Larger vendors can always slash price in specific deals.
- MonetDB is still around.
But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.
- Greenplum revenue at EMC was problematic from the get-go.
Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as:
- There are various remote services that run Hadoop, most famously Amazon’s Elastic MapReduce.
- You could get Apache Hadoop directly, rather than using the free or paid versions of a vendor distro. But why would you make that choice, unless you’re an internet bad-ass on the level of Facebook, or at least think that you are?
- There will surely always be some proprietary stuff mixed into, for example, IBM’s BigInsights, so as to preserve at least the perception of all-important vendor lock-in.
But the main point stands — big computer companies, such as IBM, EMC (Pivotal) and previously Intel, are figuring out that they can’t bigfoot something that started out as an elephant — stuffed or otherwise — in the first place.
If you think I’m not taking this whole ODP thing very seriously, you’re right.
- It’s a bit eyebrow-raising to see Mike Olson take a “more open source than thou” stance about something, but basically his post about this news is spot-on.
- My take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that post and the joke that ends it?
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add:
- MongoDB acquired the WiredTiger product and company, and continues to sell the product on a standalone basis, as well as bundling a version into MongoDB. This could cause confusion because …
- … the standalone version of WiredTiger has numerous capabilities that are not in the bundled MongoDB storage engine.
- There’s some ambiguity as to when MongoDB first “ships” a feature, in that …
- … code goes to open source with an earlier version number than it goes into the packaged product.
I should also clarify that the addition of WiredTiger is really two different events:
- MongoDB added the ability to have multiple plug-compatible storage engines. Depending on how one counts, MongoDB now ships two or three engines:
- Its legacy engine, now called MMAP v1 (for “Memory Map”). MMAP continues to be enhanced.
- The WiredTiger engine.
- A “please don’t put this immature thing into production yet” memory-only engine.
- WiredTiger is now the particular storage engine MongoDB recommends for most use cases.
I’m not aware of any other storage engines using this architecture at this time. In particular, last I heard TokuMX was not an example. (Edit: Actually, see Tim Callaghan’s comment below.)
Most of the issues in MongoDB write performance have revolved around locking, the story on which is approximately:
- Until MongoDB 2.2, locks were held at the process level. (One MongoDB process can control multiple databases.)
- As of MongoDB 2.2, locks were held at the database level, and some sanity was added as to how long they would last.
- As of MongoDB 3.0, MMAP locks are held at the collection level.
- WiredTiger locks are held at the document level. Thus MongoDB 3.0 with WiredTiger breaks what was previously a huge write performance bottleneck.
In understanding that, I found it helpful to do a partial review of what “documents” and so on in MongoDB really are.
- A MongoDB document is somewhat like a record, except that it can be more like what in a relational database would be all the records that define a business object, across dozens or hundreds of tables.*
- A MongoDB collection is somewhat like a table, although the documents that comprise it do not need to each have the same structure.
- MongoDB documents want to be capped at 16 MB in size. If you need one bigger, there’s a special capability called GridFS to break it into lots of little pieces (default = 1KB) while treating it as a single document logically.
*One consequence — MongoDB’s single-document ACID guarantees aren’t quite as lame as single-record ACID guarantees would be in an RDBMS.
By the way:
- Row-level locking was a hugely important feature in RDBMS about 20 years ago. Sybase’s lack of it is a big part of what doomed them to second-tier status.
- Going forward, MongoDB has made the unsurprising marketing decision to talk about “locks” as little as possible, relying instead on alternate terms such as “concurrency control”.
Since its replication mechanism is transparent to the storage engine, MongoDB allows one to use different storage engines for different replicas of data. Reasons one might want to do this include:
- Fastest persistent writes (WiredTiger engine).
- Fastest reads (wholly in-memory engine).
- Migration from one engine to another.
- Integration with some other data store. (Imagine, for example, a future storage engine that works over HDFS. It probably wouldn’t have top performance, but it might make Hadoop integration easier.)
In theory one can even do a bit of information lifecycle management (ILM), by using different storage engines for different subsets of database, by:
- Pinning specific shards of data to specific servers.
- Using different storage engines on those different servers.
That said, similar stories have long been told about MySQL, and I’m not aware of many users who run multiple storage engines side by side.
The MongoDB WiredTiger option is shipping with a couple of options for block-level compression (plus prefix compression that is being used for indexes only). The full WiredTiger product also has some forms of columnar compression for data.
One other feature in MongoDB 3.0 is the ability to have 50 replicas of data (the previous figure was 12). MongoDB can’t think of a great reason to have more than 3 replicas per data center or more than 2 replicas per metropolitan area, but some customers want to replicate data to numerous locations around the world.
- I occasionally post a few notes about MongoDB use cases, e.g. last May.