WibiData is essentially on the trajectory:
- Started with platform-ish technology.
- Selling analytic application subsystems, focused for now on personalization.
- Hopeful of selling complete analytic applications in the future.
The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post.
*Differences between the companies include:
- WibiData started out with some serious HBase/Hadoop technology, whereas …
- … Causata just changed its underpinnings to HBase/Hadoop …
- … after hiring new, application-oriented leadership.
I know WibiData (client since they had <10 employees) much better than Causata (one conversation ever).
The problem for those vendors and other analytic application aspirants is that it is very hard to offer a complete analytic application. In particular:
- Suppose they want to offer a great solution for, say, website personalization.* It’s hard to do that without offering something that creates complete websites — specifically, complete unique websites. Whoops.
- OK, let’s suppose they solve that problem, drawing a clean line between the personalization and creative parts. Then is it really enough for them to just personalize websites? Shouldn’t they also personalize email? Mobile ads? In-store offers? Shouldn’t that all be tied to campaign design? And by the way, they need the capacity to incorporate almost any kind of data you can imagine, while applying any kind of modeling algorithm that can offer differentiated results.
- On the other hand, suppose they only deliver the common analytic subsystems for various functions? How do they sell that? How do they even demo it? Are they at the mercy of “last mile functionality” partners?
*There are various semantic issues as to whether the correct word is “personalization”, “customization”, etc. In this post, I’m ignoring them.
My proposed answer starts:
- Even though it’s impractical to offer across-the-board, full-featured, full-suite, highly competitive analytic applications …
- … offer something that purports to be a complete analytic app anyway.
Maybe the “complete” app is, from the customer’s standpoint, at least a “good start”. Maybe you really can deliver an awesome application for a narrow area of functionality — and the customer adopts it with confidence, knowing that she can integrate the core technology into a broader suite if she wants to.
As I’m telling the story, the real differentiation is apt to be in the subsystem, not in the finished app. So for a sanity check, let’s consider when would that might be the case. Examples that come to mind include:
- Small-/mid-market, vertical-market BI. The best example of this may be Google Analytics, for website owners and administrators — but that’s most famous as a free product. Perhaps there are also examples in more conventional enterprise-adoption scenarios. (PivotLink for retailers? I’m not sure how mature their application functionality really is.)
- Any of the four scenarios I outlined in my post on third-party analytics. One notable example is stock quote services such as Bloomberg. But that’s really an information-selling business much more than an analytic-functionality one.
- Price-setting analytics — Zilliant, Vendavo, and so on. Those outfits indeed seem to focus on application fit-and-finish as much as on price optimization expertise. But I’d guess that the most successful companies in that market are still in the 10s of millions of annual revenues; for example, Zilliant recently boasted of its 100th customer.
I don’t think any of those cases are sufficient to undermine my conclusions, namely:
- Making a big business from “complete” analytic applications will in most cases require some heretofore undiscovered insights or conceptual breakthroughs (business model or technology as the case may be).
- Analytic application subsystems are where most of the near-term opportunity lies.
- It will likely be wise to offer “complete” analytic applications even so.
Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.
Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:
- Every level of storage (disk, RAM, etc.).
- Indexes, aggregates and raw data alike.
To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:
- Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
- Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
- Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
- Tokenization can help with the fixed-/variable-length tradeoff.
- Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
- Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
- Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.
What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.
So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears – for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:
- If there’s much commonality among the component DBMS, each one would be sub-optimal.
- If there’s little commonality among them, then there’s also little benefit to the combination.
Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.
Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:
- Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
- Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
- Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
- Almost all the extensions come from the DBMS vendors themselves.
- Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
- Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
- While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
- Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
- IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
- Analytic platforms can wind up with all sorts of temporary data structures.
- Analytic DBMS have various ways to reach out and touch Hadoop.
- Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
- Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
- Hadapt combines Hadoop and PostgreSQL.
- DataStax combines Cassandra, Hadoop, and Solr.
- Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
- GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
- Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.
1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).
Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.
2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.
At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.
3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.
4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.
5. I talked recently with Ashish Thusoo of Qubole. Qubole’s initial offering is a Hive-in-the-cloud, started by the guys who invented Hive. Qubole’s coolest new technical feature vs. generic Hive seems to be a disk-based columnar cache that lives with the servers, to help “smooth over the jitters” between Amazon EC2 and S3. Qubole company basics include:
- Founded last year.
- 15 early adopters, generally from mid-sized internet companies. Some of the adopters are already paying.
- 12 employees.
6. In my recent When I am a VC Overlord post, I wrote:
4. I will not fund any software whose primary feature is that it is implemented in the “cloud” or via “SaaS”. A me-too product on a different platform is still a me-too product.
5. I will not fund any pitch that emphasizes the word “elastic”. Elastic is an important feature of underwear and pajamas, but even in those domains it does not provide differentiation.
Cloud/SaaS deployments give you a chance at providing superior ease of use/installation/administration, without compromising functionality — but they don’t automatically guarantee it. It’s hard work to make your customers’ lives easier.*
*This is the second consecutive post in which I’ve used a similar line. I’ll try to stop now. What’s really scary is that I was inspired by the old Frank Perdue ad “It takes a tough man to make a tender chicken.”
7. Ofir Manor of EMC is skeptical about Oracle’s claims for Hybrid Columnar Compression. But he didn’t really dig up that much dirt, except that he seems to think 10X compression is more of a ceiling than the floor that Oracle marketing suggests it is. The money quote is:
Oracle used to provide 3x compression, now it provides 10x compression, so no wonder the best references customers are seeing about 3.4x savings…
That 3X is from Oracle’s Basic Compression, which seems to be a block-level dictionary scheme.
Code generation is most beneficial for queries that execute simple expressions and the interpretation overhead is most pronounced. For example, a query that is doing a regular expression match over each row is not going to benefit from code generation much because the interpretation overhead is low compared to the regex processing time.
Code generation may end up like compression — an architectural feature that DBMS just obviously should have.
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start:
- There are many possibilities for the “right” way to manage analytic data. Generally, these are not the same as the “right” way to write the data, as that choice needs to be optimized for user experience (including performance), reliability, and of course cost.
- I.e., it is usually best to move data from where you write it to where you (at least in part) analyze it.
- Vendors who suggest they have a complete solution for getting data ready to be analyzed are … optimists.
- This specifically includes “magic data stores”, such as fast analytic RDBMS (on which I’m very bullish) or in-memory analytic DBMS (about which I’m more skeptical). They’re great starting points, but they’re not the whole enchilada.
- There are many ways to help with preparing data for analysis. Some of them are well-served by the industry. Some, however, are not.
1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.
2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.
Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.
3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.
4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.
5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.
Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.
- Merv Adrian’s contrast between Hadoop and data integration tours some of the components of ETL suites. (February, 2013)
- Part of why analytic applications are usually incomplete are the issues discussed in this post.
- De-anonymization is an important — albeit privacy-threatening — way of making data more analyzable. (January, 2011)
- I updated my thoughts on Gartner’s Logical Data Warehouse concept earlier this month.
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down.
How big is your database? How big is your budget?
Taken together, these questions tell you which choices are even feasible. Does your database fit into RAM, at a price you can afford? Does it fit onto a single, perhaps large, server? If both answers are “No”, then you need a real scale-out system, querying disk or flash (which itself could be hard to afford). Otherwise, you have more options.
Note that database compression has a big influence on what fits where.
How do you feel about appliances?
Depending on considerations such as database size, the choice of Oracle, Teradata, IBM Netezza, or Microsoft SQL Server may mandate or at least strongly suggest an appliance form factor. For most other analytic DBMS, an appliance is more optional. Are appliances good for you? Bad? Indifferent? Trade-offs include:
- Appliances often involve paying a premium for hardware purchase and/or support.
- Appliances often are easy(ier) to install and manage.
- Appliances are easier to upgrade in some ways (everything’s integrated), but harder in others (less ability to upgrade bottlenecked parts).
- Appliances often don’t play well in the cloud.
How do you feel about the cloud?
Analytic DBMS run better on good hardware and predictable bandwidth (hence all those appliances). These can be hard to find in the cloud. So, not coincidentally, can be analytic DBMS references, although most vendors can muster a few.
If you feel you need to run your analytic RDBMS in the cloud now, check references carefully. If you only are concerned about the cloud as some indefinite future, then you might want to rule out a few appliance-only vendors, but otherwise you probably shouldn’t worry. Cloud hardware and networking are getting better, and RDBMS software vendors are gaining experience in cloud deployments.
What are the size and shape of your workload?
Different analytic databases can have very different kinds of workloads. Tasks include:
- Complex, long-running queries.
- Repetitive reports of varying degrees of complexity.
- Simple queries.
- Large, scheduled loads.
- Continuous or near-continuous/micro-batch loads.
The big issue is — how many of each kind of task need to performed concurrently, and in what combinations? If you’re refreshing 10,000 dashboards, several hundred of which might be getting drill-down queries at once, while trying to do a few scan-heavy queries in the background and some 15-way joins, most analytic DBMS might disappoint you. (Indeed, I’d ask whether you might want to split up that work among two or more systems.) Different DBMS — and different hardware/storage/networking configurations — shine in different scenarios.
How fresh does the data need to be?
Any serious analytic DBMS can be loaded daily or hourly, edge cases perhaps excepted. In most cases 15 minute intervals work as well, or even 5, but check whether those load latencies would interfere with any performance optimizations. But if you want sub-second data freshness, or even several-second — well, that has to be a top-tier architectural issue.
If your analytics are simple enough, it’s appealing to do the immediate-response ones straight from your transactional database. If not, you may need some kind of streaming-replication setup. Usually, I wind up recommending replication approaches that don’t yet have a lot of maturity or references. Tread carefully here.
Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations
To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.
Gartner seems confused about Kognitio’s products and history alike.
- Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
- Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
- Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
- Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
- Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.
Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.
In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:
- I think HP’s troubles are less relevant to HP Vertica than Gartner does.
- In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
- Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
- Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
- Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.
That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.
2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.
*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated
What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:
- ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
- Gartner dings ParAccel for a variety of product weaknesses.
- Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
- Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)
That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.
I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.
*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.
I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.
- Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
- Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
- Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
- Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
- Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
- Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
- Gartner’s view of Exasol seems similar to mine.
- I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
- Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
- Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)
The 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems is out. I’ll split my comments into two posts — this one on concepts, and a companion on specific vendor evaluations.
- Maintaining working links to Gartner Magic Quadrants is an adventure. But as of early February, 2013, this link seems live.
- I also commented on the 2011, 2010, 2009, 2008, 2007, and 2006 Gartner Magic Quadrants for Data Warehouse DBMS.
Let’s start by again noting that I regard Gartner Magic Quadrants as a bad use of good research. On the facts:
- Gartner collects a lot of input from traditional enterprises. I envy that resource.
- Gartner also does a good job of rounding up vendor claims about user base sizes and the like. If nothing else, you should skim the MQ report for that reason.
- Gartner observations about product feature sets are usually correct, although not so consistently that they should be relied on.
When it comes to evaluations, however, the Gartner Data Warehouse DBMS Magic Quadrant doesn’t do as well. My concerns (which overlap) start:
- The Gartner MQ conflates many different use cases into one ranking (inevitable in this kind of work, but still regrettable).
- A number of the MQ vendor evaluations seem hard to defend. So do some of Gartner’s specific comments.
- Some of Gartner’s criteria seemingly amount to “parrots back our opinions to us”.
- As do I, Gartner thinks a vendor’s business and financial strength are important. But Gartner overdoes the matter, drilling down into picky issues it can’t hope to judge, such as assessing a vendor’s “ability to generate and develop leads.” *
- The 2012 Gartner Data Warehouse DBMS Magic Quadrant is closer to being a 1-dimensional ranking than 2-dimensional, in that entries are clustered along the line x=y. This suggests strong correlation among the results on various specific evaluation criteria.
*I may focus more on marketing communications strategy than the whole Gartner database research team combined — but the only way I’d know whether Teradata’s lead gen is better than HP Vertica’s or vice-versa would be if both vendors happened to raise the matter during consulting sessions.
Specific product feature areas Gartner seems to emphasize include:
- Alignment with a “logical data warehouse” strategy.
- Analytic platform features.
- Administrative tools, including workload management.
- “Self-tuning” performance.
- Scale-out capabilities.
Most of this makes sense. But Gartner has been talking about the “logical data warehouse” for a long time without ever seeming to firm up what it is, as evidenced for example by some dueling summaries of the concept. So let’s drill down on the LDW.
I think “logical data warehouse” will wind up like “master data management” — i.e., it will be a goal and a business process, aided but not subsumed by some characteristic software. Beyond that, I’d say that generic, functional, high-performance data federation* software is a pipedream — building it would be as hard as building the mythical single DBMS that gives great functionality and performance, in all use cases, for all kinds of data. Just as DBMS need to be at least somewhat specialized in purpose, data federation software needs to be as well.
*While I disapprove, data virtualization seems to be the term that will win for describing data federation.
When Gartner refers to the “logical data warehouse” capabilities of analytic RDBMS — and the first sentence of the MQ report indeed specifies that the subject is “relational database management systems” — it seems to be looking for two things:
- Built-in data federation/query routing capabilities; i.e., specific features that help the DBMS interoperate with other data stores. But there seems to be little reference to relational federation/ external tables (which many vendors support) or text federation (which vendors with built-in search support, although that would mainly be Oracle, and its search is slow). Rather, this part of LDW is currently all about Hadoop interoperability, with bonus points for mentioning HCatalog.
- Management of multi-structured data. But with limited exceptions, nobody’s doing that well in an analytic RDBMS. And even when they do, that’s pretty much the opposite of the federation that the rest of the logical data warehouse concept seems to be about.
For those and other reasons, referring to the “logical data warehouse” features of an analytic RDBMS is problematic. I imagine Gartner will keep working at the “logical data warehouse” concept until it is more successfully fleshed out. But little weight should be placed on Gartner’s LDW-feature-evaluations of analytic RDBMS at this time.
In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:
- Relational databases are usually still a good default assumption …
- … but increasingly often, the default should be overridden with a more useful alternative.
Reasons to abandon SQL in any given area usually start:
- Creating a traditional relational schema is possible …
- … but it’s tedious or difficult …
- … especially since schema design is supposed to be done before you start coding.
Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.
Sectors where these reasons kick in include but are not limited to:
- Retailing, especially online. Different kinds of products have different kinds of attributes, making a Grand Cosmic Schema rather complex. Examples I’ve blogged about include:
- Amazon relied on an in-memory object-oriented DBMS for its used books inventory lookup back in 2005.
- A Microsoft customer managed book and DVD inventory in XML the same year.
- More recently, 10gen spoke of a wireless telco offering cell phones and service plans in the same product catalog, built over MongoDB.
- Human resources. Employee-centric applications are naturally full of hierarchies, which can be annoying to flatten. Non-relational approaches I’ve blogged about include Workday’s object model and Neo4j’s graph-based contribution.
- Web log analysis. Web logs can be particularly hard to flatten, as per my post on (that sense of) nested data structures.
- More generally, marketing and other applications that maintain detailed profiles of customers or prospects. The information in these profiles is often based on a large variety of marketing campaigns, third-party databases, and analytic exercises. As the inputs pile up, the schemas get ever hairier.
- Electronic medical records. Medical records are one area where non-relational approaches may actually have majority share. I blogged about one example in 2008.
Or to quote a 2008 post,
Conor O’Mahony, marketing manager for IBM’s DB2 pureXML, talks a lot about one of my favorite hobbyhorses — schema flexibility* — as a reason to use an XML data model. In a number of industries he sees use cases based around ongoing change in the information being managed:
- Tax authorities change their rules and forms every year, but don’t want to do total rewrites of their electronic submission and processing software.
- The financial services industry keeps inventing new products, which don’t just have different terms and conditions, but may also have different kinds of terms and conditions.
- The same, to some extent, goes for the travel industry, which also keeps adding different kinds of offers and destinations.
- The energy industry keeps adding new kinds of highly complex equipment it has to manage.
Conor also thinks market evidence shows that XML’s schema flexibility is important for data interchange. For example, hospitals (especially in the US) have disparate medical records and billing systems, which can make information interchange a chore.
*I now call that dynamic schemas.
So, for fear of Frankenschemas, should we flee from RDBMS altogether? Hardly. For social proof, please note:
- Every application area I’ve cited can be and often is handled via relational techniques.
- Some of the non-relational alternatives I’ve mentioned, such as XML or object-oriented DBMS, haven’t enjoyed a lot of traction.
- Even the most successful NoSQL vendors are tiny when compared to the relational behemoths.
More conceptually, I’d say that the advantages of a relational DBMS start:
- In theory and practice alike, the advantages of normalization and joins.
- In theory and practice alike, the advantages of loose coupling between your database design and your application. (I think that’s a cleaner way of saying it than to focus on “reusing” the database, but it amounts to the same thing.)
- In practice, performance and functionality in anything using indexes, even if joins aren’t involved.
- In practice, maturity and functionality in general.
Those aren’t chopped liver.