Skip navigation.

DBMS2

Syndicate content
Choices in data management and analysis
Updated: 13 hours 9 min ago

Multi-model database managers

Mon, 2015-08-24 02:07

I’d say:

  • Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
  • Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
  • Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

  • Operational applications have always needed to accept immediate writes. (Losing data is bad.)
  • Operational applications have always needed to serve small query result sets based on the freshest data.(If your write something into a database, you might need to immediately retrieve it to finish the business operation.)
  • It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
  • The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
  • … most such analysis should look at historical data as well.
  • Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

  • The two-policemen joke seems ever more relevant.
  • My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
  • Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.
Categories: Other

Data messes

Mon, 2015-08-03 03:58

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — :) — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

  • Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
  • Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)

Inconsistency can take multiple forms, including: 

  • Variant names.
  • Variant spellings.
  • Variant data structures (not to mention datatypes, formats, etc.).

Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.

So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.

All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.

Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.

Let’s now consider missing data. In operational cases, there are three main kinds of missing data:

  • Missing values, as a special case of inaccuracy.
  • Data that was only collected over certain time periods, as a special case of changing data structure.
  • Data that hasn’t been derived yet, as the main case of a need for data enhancement.

All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.

Further analytics-mainly data messes can be found in three broad areas:

  • Problems caused by new or changing data sources hit much faster in analytics than in operations, because analytics draws on a greater variety of data.
  • Event recognition, in which most of a super-high-volume stream is discarded while the “good stuff” is kept, is more commonly a problem in analytics than in pure operations. (That said, it may arise on the boundary of operations and analytics, namely in “real-time” monitoring.
  • Analytics has major problems with data scavenger hunts, in which business analysts and data scientists don’t know what data is available for them to examine.

That last area is the domain of a lot of analytics innovation. In particular:

  • It’s central to the dubious Gartner concept of a Logical Data Warehouse, and to the more modest logical data layers I advocate as alternative.
  • It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.)
  • It’s a big part of the story of startups such as Alation or Tamr.
  • In a failed effort, it was part of Greenplum’s pitch some years back, as an aspect of the “enterprise data cloud”.
  • It led to some of the earliest differentiated features at Gooddata.
  • It’s implicit in the some BI collaboration stories, in some BI/search integration, and in ClearStory’s “Data You May Like”.

Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. :) So I’ll leave that subject area for another time.

Categories: Other

SaaS and traditional software from the same vendor?

Mon, 2015-07-20 03:09

It is extremely difficult to succeed with SaaS (Software as a Service) and packaged software in the same company. There were a few vendors who seemed to pull it off in the 1970s and 1980s, generally industry-specific application suite vendors. But it’s hard to think of more recent examples — unless you have more confidence than I do in what behemoth software vendors say about their SaaS/”cloud” businesses.

Despite the cautionary evidence, I’m going to argue that SaaS and software can and often should be combined. The “should” part is pretty obvious, with reasons that start:

  • Some customers are clearly better off with SaaS. (E.g., for simplicity.)
  • Some customers are clearly better off with on-premises software. (E.g., to protect data privacy.)
  • On-premises customers want to know they have a path to the cloud.
  • Off-premises customers want the possibility of leaving their SaaS vendor’s servers.
  • SaaS can be great for testing, learning or otherwise adopting software that will eventually be operated in-house.
  • Marketing and sales efforts for SaaS and packaged versions can be synergistic.
    • The basic value proposition, competitive differentiation, etc. should be the same, irrespective of delivery details.
    • In some cases, SaaS can be the lower cost/lower commitment option, while packaged product can be the high end or upsell.
    • An ideal sales force has both inside/low-end and bag-carrying/high-end components.

But the “how” of combining SaaS and traditional software is harder. Let’s review why. 

Why it is hard for one vendor to succeed at both packaged software and SaaS?

SaaS and packaged software have quite different development priorities and processes. SaaS vendors deliver and support software that:

  • Runs on a single technology stack.
  • Is run only at one or a small number of physical locations.
  • Is run only in one or a small number of historical versions.
  • May be upgraded multiple times per month.
  • Can be assumed to be operated by employees of the SaaS company.
  • Needs, for customer acquisition and retention reasons, to be very easy for users to learn.

But traditional packaged software:

  • Runs on technology the customer provides and supports, at the location of the customer’s choice.
  • Runs in whichever versions customers have not yet upgraded from.
  • Should — to preserve the sanity of all concerned — have only have a few releases per year.
  • Is likely to be operated by less knowledgeable or focused staff than a SaaS vendor enjoys.
  • Can sometimes afford more of an end-user learning curve than SaaS.

Thus, in most cases:

  • Traditional software creates greater support and compatibility burdens than SaaS does.
  • SaaS and on-premises software have very different release cycles.
  • SaaS should be easier for end-users than most traditional software, but …
  • … traditional software should be easier to administer than SaaS.

Further — although this is one difference that I think has at times been overemphasized — SaaS vendors would prefer to operate truly multi-tenant versions of their software, while enterprises less often have that need.

How this hard thing could be done

Most of the major problems with combining SaaS and packaged software efforts can be summarized in two words — defocused development. Even if the features are substantially identical, SaaS is developed on different schedules and for different platform stacks than packaged software is.

So can we design an approach to minimize that problem? I think yes. In simplest terms, I suggest:

  • A main development organization focused almost purely on SaaS.
  • A separate unit adapting the SaaS code for on-premises customers, with changes to the SaaS offering being concentrated in three aspects:
    • Release cadence.
    • Platform support.
    • Administration features, which are returned to the SaaS group for its own optional use.

Certain restrictions would need to be placed on the main development unit. Above all, because the SaaS version will be continually “thrown over the wall” to the sibling packaged-product group, code must be modular and documentation must be useful. The standard excuses — valid or otherwise — for compromising on these virtues cannot be tolerated.

There is one other potentially annoying gotcha. Hopefully, the SaaS group uses third-party products and lots of them; that’s commonly better than reinventing the wheel. But in this plan they need to use ones that are also available for third-party/OEM kinds of licensing.

My thoughts on release cadence start:

  • There should be a simple, predictable release cycle:
    • N releases per year, for N approximately = 4.
    • Strong efforts to adhere to a predictable release schedule.
  • A reasonable expectation is that what’s shipped and supported for on-premises use is 6-9 months behind what’s running on the SaaS service. 3-6 months would be harder to achieve.

The effect would be that on-premises software would lag SaaS features to a predictable and bounded extent.

As for platform support:

  • You have to stand ready to install and support whatever is needed. (E.g., in the conversation that triggered this post, the list started with Hadoop, Spark, and Tachyon.)
  • You have to adapt to customers’ own reasonably-current installations of needed components (but help them upgrade if they’re way out of date).
  • Writing connectors is OK. Outright porting from your main stack to another may be unwise.
  • Yes, this is all likely to involve significant professional services, at least to start with, because different customers will require different degrees of adaptation.

That last point is key. The primary SaaS offering can be standard, in the usual way. But the secondary business — on-premises software — is inherently services-heavy. Fortunately, packaged software and professional services can be successfully combined.

And with that I’ll just stop and reiterate my conclusion:

It may be advisable to offer both SaaS and services-heavy packaged software as two options for substantially the same product line.

Related link

  • Point #4 of my VC overlord post is relevant  — and Point #3 even more so. :)
Categories: Other

Zoomdata and the Vs

Tue, 2015-07-07 17:23

Let’s start with some terminology biases:

So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.

It turns out that:

  • Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
  • Zoomdata depends heavily on Spark.
  • Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
  • Zoomdata assumes data can be in a variety of data stores, including:
    • Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
    • Files (generic HDFS — Hadoop Distributed File System or S3).*
    • NoSQL (MongoDB and HBase were mentioned).
    • Search (Elasticsearch was mentioned among others).
  • Zoomdata also tries to detect data variability.
  • Zoomdata is OEM/embedding-friendly.

*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.

Core aspects of Zoomdata’s technical strategy include: 

  • QlikView/Tableau-style navigation, at least up to a point. (I hope that vendors with a much longer track record have more nuances in their UIs.)
  • Suitable UI for wholly or partially “real-time” data. In particular:
    • Time is an easy dimension to get along the X-axis.
    • You can select current or historical regions from the same graph, aka “data rewind”.
  • Federated query with some predicate pushdown, aka “data fusion”.
    • Data filtering and some GroupBys are pushed down to the underlying data stores — SQL or NoSQL — when it makes sense.*
    • Pushing down joins (assuming that both sides of the join are from the same data store) is a roadmap item.
  • Approximate query results, aka “data sharpening”. Zoomdata simulates high-speed query by first serving you approximate query results, ala Datameer.
  • Spark to finish up queries. Anything that isn’t pushed down to the underlying data store is probably happening in Spark DataFrames.
  • Spark for other kinds of calculations.

*Apparently it doesn’t make sense in some major operational/general-purpose — as opposed to analytic — RDBMS. From those systems, Zoomdata may actually extract and pre-cube data.

The technology story for “data sharpening” starts:

  • Zoomdata more-or-less samples the underlying data, and returns a result just for the sample. Since this is a small query, it resolves quickly.
  • More precisely, there’s a sequence of approximations, with results based on ever larger samples, until eventually the whole query is answered.
  • Zoomdata has a couple of roadmap items for making these approximations more accurate:
    • The integration of BlinkDB with Spark will hopefully result in actual error bars for the approximations.
    • Zoomdata is working itself on how to avoid sample skew.

The point of data sharpening, besides simply giving immediate gratification, is that hopefully the results for even a small sample will be enough for the user to determine:

  • Where in particular she wants to drill down.
  • Whether she asked the right query in the first place. :)

I like this early drilldown story for a couple of reasons:

  • I think it matches the way a lot of people work. First you get to the query of the right general structure; then you refine the parameters.
  • It’s good for exact-results performance too. Most of what otherwise might have been a long-running query may not need to happen at all.

Aka “Honey, I shrunk the query!”

Zoomdata’s query execution strategy depends heavily on doing lots of “micro-queries” and unioning their result sets. In particular:

  • Data sharpening relies on a bunch of data-subset queries of increasing size.
  • Streaming/”real-time” BI is built from a bunch of sub-queries restricted to small time slices each.

Even for not-so-micro queries, Zoomdata may find itself doing a lot of unioning, as data from different time periods may be in different stores.

Architectural choices in support of all this include:

  • Zoomdata ships with Spark, but can and probably in most cases should be pointed at an external Spark cluster instead. One point is that Zoomdata itself scales by user count, while the Spark cluster scales by data volume.
  • Zoomdata uses MongoDB off to the side as a metadata store. Except for what’s in that store, Zoomdata seems to be able to load balance rather statelessly. And Zoomdata doesn’t think that the MongoDB store is a bottleneck either.
  • Zoomdata uses Docker.
  • Zoomdata is starting to use Mesos.

When a young company has good ideas, it’s natural to wonder how established or mature this all is. Well:

  • Zoomdata has 86 employees.
  • Zoomdata has (production) customers, success stories, and so on, but can’t yet talk fluently about many production use cases.
  • If we recall that companies don’t always get to do (all) their own positioning, it’s fair to say that Zoomdata started out as “Cloudera’s cheap-option BI buddy”, but I don’t think that’s an accurate characterization as this point.
  • Zoomdata, like almost all young companies in the history of BI, favors a “land-and-expand” adoption strategy. Indeed …
  • … Zoomdata tells prospects it wants to be an additional BI provider to them, rather than rip-and-replacement.

As for technological maturity:

  • Zoomdata’s view of data seems essentially tabular, notwithstanding its facility with streams and NoSQL. It doesn’t seem to have tackled much in the way of event series analytics yet.
  • One of Zoomdata’s success stories is iPad-centric. (Salesperson visits prospect and shows her an informative chart; prospect opens wallet; ka-ching.) So I presume mobile BI is working.
  • Zoomdata is comfortable handling 10s of millions of rows of data, may be strained when handling 100s of millions of rows, and has been tested in-house up to 1 billion rows. But that’s data that lands in Spark. The underlying data being filtered can be much larger, and Zoomdata indeed cites one example of a >40 TB Impala database.
  • When I asked about concurrency, Zoomdata told me of in-house testing, not actual production users.
  • Zoomdata’s list when asked what they don’t do (except through partners, of which they have a bunch) was:
    • Data wrangling.
    • ETL (Extract/Transform/Load).
    • Data transformation. (In a market segment with a lot of Hadoop and Spark, that’s not really redundant with the previous bullet point.)
    • Data cataloguing, ala Alation or Tamr.
    • Machine learning.

Related link

  • I wrote about multiple kinds of approximate query result capabilities, Zoomdata-like or otherwise, back in July, 2012.
Categories: Other

“Chilling effects” revisited

Sun, 2015-06-14 18:55

In which I observe that Tim Cook and the EFF, while thankfully on the right track, haven’t gone nearly far enough.

Traditionally, the term “chilling effect” referred specifically to inhibitions on what in the US are regarded as First Amendment rights — the freedoms of speech, the press, and in some cases public assembly. Similarly, when the term “chilling effect” is used in a surveillance/privacy context, it usually refers to the fear that what you write or post online can later be held against you. This concern has been expressed by, among others, Tim Cook of Apple, Laura Poitras, and the Electronic Frontier Foundation, and several research studies have supported the point.

But that’s only part of the story. As I wrote in July, 2013,

… with the new data collection and analytic technologies, pretty much ANY action could have legal or financial consequences. And so, unless something is done, “big data” privacy-invading technologies can have a chilling effect on almost anything you want to do in life.

The reason, in simplest terms, is that your interests could be held against you. For example, models can estimate your future health, your propensity for risky hobbies, or your likelihood of changing your residence, career, or spouse. Any of these insights could be useful to employers or financial services firms, and not in a way that redounds to your benefit. And if you think enterprises (or governments) would never go that far, please consider an argument from the sequel to my first “chilling effects” post:

What makes these dangers so great is the confluence of two sets of factors:

  • Some basic facts of human nature and organizational behavior — policies and procedures are biased against risk of “bad” outcomes, because people (and organizations) fear (being caught) making mistakes.
  • Technological developments that make ever more precise judgments as to what constitutes risk, or deviation from “proven-safe” profiles.

A few people have figured at least some of these dangers out. ACLU policy analyst Jay Stanley got there before I did, as did a pair of European Law and Economics researchers. Natasha Lomas of TechCrunch seems to get it. But overall, the chilling effects discussion — although I’m thrilled that it’s gotten even this far — remains much too narrow.

In a tough economy, will the day come that people organize their whole lives to appear as prudent and risk-averse as possible? As extreme as it sounds, that danger should not be overlooked. Plenty of societies have been conformist with much weaker mechanisms for surveillance (i.e., little beyond the eyes and ears of nosy neighbors).

And so I return yet again to my privacy mantra — we need to regulate information use, not just information collection and retention. To quote a third post from that July, 2013 flurry:

  • Governmental use of private information needs to be carefully circumscribed, including in most aspects of law enforcement.
  • Business discrimination based on private information needs in most cases to be proscribed as well.

As for exactly what those regulations should be — that, of course, is a complex subject in itself.

Categories: Other

Hadoop generalities

Wed, 2015-06-10 06:33

Occasionally I talk with an astute reporter — there are still a few left :) — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.

There is a group of questions going around that includes:

  • Is Hadoop overhyped?
  • Has Hadoop adoption stalled?
  • Is Hadoop adoption being delayed by skills shortages?
  • What is Hadoop really good for anyway?
  • Which adoption curves for previous technologies are the best analogies for Hadoop?

To a first approximation, my responses are: 

  • The Hadoop hype is generally justified, but …
  • … what exactly constitutes “Hadoop” is trickier than one might think, in at least two ways:
    • Hadoop is much more than just a few core projects.
    • Even the core of Hadoop is repeatedly re-imagined.
  • RDBMS are a good analogy for Hadoop.
  • As a general rule, Hadoop adoption is happening earlier for new applications, rather than in replacement or rehosting of old ones. That kind of thing is standard for any comparable technology, both because enabling new applications can be valuable and because migration is a pain.
  • Data transformation, as pre-processing for analytic RDBMS use, is an exception to that general rule. That said …
  • … it’s been adopted quickly because it saves costs. But of course a business that’s only about cost savings may not generate a lot of revenue.
  • Dumping data into a Hadoop-centric “data lake” is a smart decision, even if you haven’t figured out yet what to do with it. But of course, …
  • … even if zero-application adoption makes sense, it isn’t exactly a high-value proposition.
  • I’m generally a skeptic about market numbers. Specific to Hadoop, I note that:
    • The most reliable numbers about Hadoop adoption come from Hortonworks, since it is the only pure-play public company in the market. (Compare, for example, the negligible amounts of information put out by MapR.) But Hortonworks’ experiences are not necessarily identical to those of other vendors, who may compete more on the basis of value-added service and technology rather than on open source purity or price.
    • Hadoop (and the same is true of NoSQL) are most widely adopted at digital companies rather than at traditional enterprises.
    • That said, while all traditional enterprises have some kind of digital presence, not all have ones of the scope that would mandate a heavy investment in internet technologies. Large consumer-oriented companies probably do, but companies with more limited customer bases might not be there yet.
  • Concerns about skill shortages are exaggerated.
    • The point of distributing processing frameworks such as Spark or MapReduce is to make distributed analytic or application programming not be much harder than any other kind.
    • If a new programming language or framework needs to be adopted — well, programmers nowadays love learning that kind of stuff.
    • The industry is moving quickly to make distributed systems easier to administer. Any skill shortages in operations should prove quite temporary.
Categories: Other

Teradata will support Presto

Mon, 2015-06-08 03:32

At the highest level:

  • Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
  • Teradata is announcing support for Presto with a classic open source pricing model.
  • Presto will also become, roughly speaking, Teradata’s replacement for Hive.
  • Teradata’s Presto efforts are being conducted by the former Hadapt.

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

  • To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
  • … Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
  • Facebook at various points in time created both Hive and now Presto.
  • Facebook started the Presto project in 2012 and now has 10 engineers on it.
  • Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
  • Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
  • Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project: 

  • Data is pipelined between operators, with no gratuitous writing to disk the way you might have in something MapReduce-based. This is different from the sense of “pipelining” in which one query might keep an intermediate result set hanging around because another query is known to need those results as well.
  • Presto processing is vectorized; functions don’t need to be re-invoked a tuple at a time. This is different from the sense of vectorization in which several tuples are processed at once, exploiting SIMD (Single Instruction Multiple Data). Dan thinks SIMD is useful mainly for column stores, and Presto tries to be storage-architecture-agnostic.
  • Presto query operators and hence query plans are dynamically compiled, down to byte code.
  • Although it is generally written in Java, Presto uses direct memory management rather than relying on what Java provides. Dan believes that, despite being written in Java, Presto performs as if it were written in C.

More precisely, this is a checklist for interactive-speed parallel SQL. There are some query jobs long enough that Dan thinks you need the fault-tolerance obtained from writing intermediate results to disk, ala’ HadoopDB (which was of course the MapReduce-based predecessor to Hadapt).

That said, Presto is a newish database technology effort, there’s lots of stuff missing from it, and there still will be lots of stuff missing from Presto years from now. Teradata has announced contribution plans to Presto for, give or take, the next year, in three phases:

  • Phase 1 (released immediately, and hence in particular already done):
    • An installer.
    • More documentation, especially around installation.
    • Command-line monitoring and management.
  • Phase 2 (later in 2015)
    • Integrations with YARN, Ambari and soon thereafter Cloudera Manager.
    • Expanded SQL coverage.
  • Phase 3 (some time in 2016)
    • An ODBC driver, which is of course essential for business intelligence tool connectivity.
    • Other connectors (e.g. more targets for query federation).
    • Security.
    • Further SQL coverage.

Absent from any specific plans that were disclosed to me was anything about optimization or other performance hacks, and anything about workload management beyond what can be gotten from YARN. I also suspect that much SQL coverage will still be lacking after Phase 3.

Teradata’s basic business model for Presto is:

  • Teradata is selling subscriptions, for which the principal benefit is support.
  • Teradata reserves the right to make some of its Presto-enhancing code subscription-only, but has no immediate plans to do so.
  • Teradata being Teradata, it would love to sell you Presto-related professional services. But you’re absolutely welcome to consume Presto on the basis of license-plus-routine-support-only.

And of course Presto is usurping Hive’s role wherever that makes sense in Teradata’s data connectivity story, e.g. Teradata QueryGrid.

Finally, since I was on the phone with Justin Borgman and Dan Abadi, discussing a project that involved 16 former Hadapt engineers, I asked about Hadapt’s status. That may be summarized as:

  • There are currently no new Hadapt sales.
  • Only a few large Hadapt customers are still being supported by Teradata.
  • The former Hadapt folks would love Hadapt or Hadapt-like technology to be integrated with Presto, but no such plans have been finalized at this time.
Categories: Other

IT-centric notes on the future of health care

Mon, 2015-05-25 23:02

It’s difficult to project the rate of IT change in health care, because:

  • Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
  • … health care is heavily bureaucratic, political and regulated.

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

  • The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
  • Large amounts of machine-generated data, including:
    • The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
      • Most tests exploit electronic technology. Progress in electronics is intense.
      • Biomedical research is itself intense.
      • In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
    • The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.

These vastly greater amounts of data cited above will allow for greatly changed analytics.

  • Right now, medical decisions are made based on research that looks at a few data points each for a specially-recruited sample of patients, then draws conclusions based on simplistic and questionable statistical methods.
  • More sophisticated analytic methods are commonly used, but almost always still to aid in the discovery and formation of hypotheses that will then be validated, if at all, using the bad old analytic techniques.
  • State of the art predictive modeling, applied to vastly more data, will surely yield greatly better results.

And so I believe that health care itself will be revolutionized.

  • Diagnosis will be much more accurate, pretty much across the board, except in those limited areas where it’s already excellent today.
  • Medication regimens will be much more personalized. (Pharma manufacturing may have to change greatly as a result.) So will other treatments. So will diet/fitness regimens.
  • The vulnerable (elderly, hospital patients) will be more accurately and comprehensively monitored. Also, their care will likely be aided by robotics.
  • Some of the same things will be true of infants and toddlers. (In other cases, they get such close attention today that I can’t imagine how it could be greatly increased. :) )

I believe that this will all happen because I believe that it will make health care vastly more successful. And if I’m right about that, no obstacles will be able to prevent it from coming into play — not cost (which will keep going down in a quasi-Moore’s-Law way), not bureaucratic inertia (although that will continue to slow things greatly), and not privacy fears (despite the challenges cited below).

So what are the IT implications of all this?

  • I already mentioned the need for new (or newly-used) kinds of predictive modeling.
  • Probably in association with those, event detection — which in many but not all cases will amount to anomaly detection — will be huge. If one goal is to let the elderly and ailing live independently, but receive help when it’s needed — well, recognizing when that help is needed will be crucial. Similar dynamics will occur in hospitals.
  • And in support of that, there will be great amount of monitoring, and hence strong demands upon sensors and recognition. Potentially, all five human senses will be mimicked, among others. These technologies will become even more important in health care if I’m right that robotics will play a big role.
  • Data quality will be a major challenge, especially in the doctors’-notes parts of health records. Reasons start:
    • Different medical professionals might evaluate the same situation differently; diagnosis is a craft rather than a dumb, repeatable skill.
    • If entries are selected from a predefined set of options, none may be a perfect match to the doctor’s actual opinion.
    • Doctors often say what’s needful to have their decisions (care, tests, etc.) approved, whether or not it precisely matches what they really think. Thus, there are significant incentives to enter bad data.
    • Free-text data is more central to health care than to many other application areas, and text data is inherently dirty.
    • Health records are decades later than many other applications in moving from paper to IT.
  • Data integration problems will also be and indeed already are huge, because different health care providers have addressed the tough challenges of record-keeping in different ways.

As for data management — well, almost everything discussed in this blog could come into play.

  • A person’s entire medical record resembles the kind of mess increasingly often dumped these days into NoSQL — typically MongoDB, Cassandra, or HBase.
  • There are plenty of business-transaction records in the mix, of the kind that have long been managed by RDBMS.
  • There are a whole lot of diverse machines in the mix, and managing the data to keep such a menagerie running is commonly the job of Splunk or streaming-enhanced Hadoop.
  • There’s a lot of free text in medical records. Also images, video and so on.
  • Since graph analytics is used in research today, it might at some point make its way into clinical use.

Finally, let me say:

  • Data-driven medicine cannot live up to its potential unless researchers can investigate data sets comprising private information of large numbers of people.
  • Researchers will not have the appropriate permissions unless privacy law moves toward a basis in data use, rather than exclusively regulating data possession.

Related links

Categories: Other

MemSQL 4.0

Wed, 2015-05-20 03:41

I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:

  • MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
  • … but quickly positioned with “We also do ‘real-time’ analytic processing” …
  • … and backed that up by adding a flash-based column store option …
  • … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
  • There’s also a JSON option.

The main new aspects of MemSQL 4.0 are:

  • Geospatial indexing. This is for me the most interesting part.
  • A new optimizer and, I suppose, query planner …
  • … which in particular allow for serious distributed joins.
  • Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
  • Usual-suspect stuff including:
    • More SQL coverage (I forgot to ask for details).
    • Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
    • Surely some general Bottleneck Whack-A-Mole.

There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.

Before MemSQL 4.0, distributed joins were restricted to the easy cases:

  • Two tables are distributed (i.e. sharded) on the same key.
  • One table is small enough to be broadcast to each node.

Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:

  • Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
  • MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
  • The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
  • MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
  • MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.

To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.

In other connectivity notes:

  • MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
  • MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.

Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.

And now to the geospatial stuff. I thought I heard:

  • A “point” is actually a square region less than 1 mm per side.
  • There are on the order of 2^30 such points on the surface of the Earth.

Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)

Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:

  • In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
  • Hilbert curves seem to be in vogue, including at MemSQL.
  • Nice properties of Hilbert space-filling curves include:
    • Numbers near each other always correspond to points near each other.
    • The converse is almost always true as well.*
    • If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)

*You could say it’s true except in edge cases … but then you’d deserve to be punished.

Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:

  • Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
  • A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
  • … a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.

As for company metrics — MemSQL cites >50 customers and >60 employees.

Related links

Categories: Other

Notes on analytic technology, May 13, 2015

Wed, 2015-05-13 20:38

1. There are multiple ways in which analytics is inherently modular. For example:

  • Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
  • The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
  • Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.

Also, analytics is inherently iterative.

  • Everything I just called “modular” can reasonably be called “iterative” as well.
  • So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”

If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.

2. In 2011, I wrote, in the context of agile predictive analytics, that

… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.

I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. 

3. Speaking of Gartner, Mark Beyer tweeted

In data management’s future “hybrid” becomes a useless term. Data management is mutable, location agnostic and services oriented.

I replied

And that’s why I launched DBMS2 a decade ago, for “DataBase Management System SERVICES”. :)

A post earlier this year offers a strong clue as to why Mark’s tweet was at least directionally correct: The best structures for writing data are the worst for query, and vice-versa.

4. The foregoing notwithstanding, I continue to believe that there’s a large place in the world for “full-stack” analytics. Of course, some stacks are fuller than others, with SaaS (Software as a Service) offerings probably being the only true complete-stack products.

5. Speaking of full-stack vendors, some of the thoughts in this post were sparked by a recent conversation with Platfora. Platfora, of course, is full-stack except for the Hadoop underneath. They’ve taken to saying “data lake” instead of Hadoop, because they believe:

  • It’s a more benefits-oriented than geek-oriented term.
  • It seems to be more popular than the roughly equivalent terms “data hub” or “data reservoir”.

6. Platfora is coy about metrics, but does boast of high growth, and had >100 employees earlier this year. However, they are refreshingly precise about competition, saying they primarily see four competitors — Tableau, SAS Visual Analytics, Datameer (“sometimes”), and Oracle Data Discovery (who they view as flatteringly imitative of them).

Platfora seems to have a classic BI “land-and-expand” kind of model, with initial installations commonly being a few servers and a few terabytes. Applications cited were the usual suspects — customer analytics, clickstream, and compliance/governance. But they do have some big customer/big database stories as well, including:

  • 100s of terabytes or more (but with a “lens” typically being 5 TB or less).
  • 4-5 customers who pressed them to break a previous cap of 2 billion discrete values.

7. Another full-stack vendor, ScalingData, has been renamed to Rocana, for “root cause analysis”. I’m hearing broader support for their ideas about BI/predictive modeling integration. For example, Platfora has something similar on its roadmap.

Related links

  • I did a kind of analytics overview last month, which had a whole lot of links in it. This post is meant to be additive to that one.
Categories: Other