Skip navigation.


Syndicate content
Choices in data management and analysis
Updated: 4 days 13 hours ago

Necessary complexity

Sat, 2014-04-19 02:17

When I’m asked to talk to academics, the requested subject is usually a version of “What should we know about what’s happening in the actual market/real world?” I then try to figure out what the scholars could stand to hear that they perhaps don’t already know.

In the current case (Berkeley next Tuesday), I’m using the title “Necessary complexity”. I actually mean three different but related things by that, namely:

  1. No matter how cool an improvement you have in some particular area of technology, it’s not very useful until you add a whole bunch of me-too features and capabilities as well.
  2. Even beyond that, however, the simple(r) stuff has already been built. Most new opportunities are in the creation of complex integrated stacks, in part because …
  3. … users are doing ever more complex things.

While everybody on some level already knows all this, I think it bears calling out even so.

I previously encapsulated the first point in the cardinal rules of DBMS development:

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1. 

In particular:

  • Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
  • Mixed workload management is harder than you’re assuming it is.
  • Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

My recent post about MongoDB is just one example of same.

Examples of the second point include but are hardly limited to:

BDAS and Spark make a splendid example as well. :)

As to the third point:

Bottom line: Serious software has been built for over 50 years. Very little of it is simple any more.

Related links

Categories: Other

MongoDB is growing up

Thu, 2014-04-17 02:56

I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:

  • An actual automation and management user interface, as opposed to the current management style, which is almost entirely via scripts (except for the monitoring UI).
    • That’s scheduled for public beta in May, and general availability later this year.
    • It will include some kind of integrated provisioning with VMware, OpenStack, et al.
    • One goal is to let you apply database changes, software upgrades, etc. without taking the cluster down.
  • A reasonable backup strategy.
    • A snapshot copy is made of the database.
    • A copy of the log is streamed somewhere.
    • Periodically — the default seems to be 6 hours — the log is applied to create a new current snapshot.
    • For point-in-time recovery, you take the last snapshot prior to the point, and roll forward to the desired point.
  • A reasonable locking strategy!
    • Document-level locking is all-but-promised for MongoDB 2.8.
    • That means what it sounds like. (I mention this because sometimes an XML database winds up being one big document, which leads to confusing conversations about what’s going on.)
  • Security. My eyes glaze over at the details, but several major buzzwords have been checked off.
  • A general code rewrite to allow for (more) rapid addition of future features.

Of course, when a DBMS vendor rewrites its code, that’s a multi-year process. (I think of it at Oracle as spanning 6 years and 2 main-number releases.) With that caveat, the MongoDB rewrite story is something like:

  • Updating has been reworked. Most of the benefits are coming later.
  • Query optimization and execution have been reworked. Most of the benefits are coming later, except that …
  • … you can now directly filter on multiple indexes in one query; previously you could only simulate doing that by pre-building a compound index.
  • One of those future benefits is more index types, for example R-trees or inverted lists.
  • Concurrency improvements are down the road.
  • So are rewrites of the storage layer, including the introduction of compression.

Also, you can now straightforwardly transform data in a MongoDB database and write it into new datasets, something that evidently wasn’t easy to do before.

One thing that MongoDB is not doing is offer any ODBC/JDBC or other SQL interfaces. Rather, there’s some other API — I don’t know the details — whereby business intelligence tools or other systems can extract views, and a few BI vendors evidently are doing just that. In particular, MicroStrategy and QlikView were named, as well as a couple of open source usual-suspects.

As of 2.6, MongoDB seems to have a basic integrated text search capability — which however does not rise to the search functionality level that was in Oracle 7.3.2. In particular:

  • 15 Western languages are supported with stopwords, tokenization, etc.
  • Search predicates can be mixed into MongoDB queries.
  • The search language isn’t very rich; for example, it lacks WHERE NEAR semantics.
  • You can’t tweak the lexicon yourself.

And finally, some business and pricing notes:

  • Two big aspects of the paid-versus-free version of MongoDB (the product line) are:
    • Security.
    • Management tools.
  • Well, actually, you can get the management tools for free, but only on a SaaS basis from MongoDB (the company).
    • If you want them on premises or in your part of the cloud, you need to pay.
    • If you want MongoDB (the company) to maintain your backups for you, you need to pay.
  • Customer counts include:
    • At least 1000 or so subscribers (counting by organization).
    • Over 500 (additional?) customers for remote backup.
    • 30 of the Fortune 100.

And finally, MongoDB did something many companies should, which is aggregate user success stories for which they may not be allowed to publish full details. Tidbits include:

  • Over 100 organizations run clusters with more than 100 nodes. Some clusters exceed 1,000 nodes.
  • Many clusters deliver hundreds of thousands of operations per second (combined read and write).
  • MongoDB clusters routinely store hundreds of terabytes, and some store multiple petabytes of data. Over 150 clusters exceed 1 billion documents in size. Many manage more than 100 billion documents.
Categories: Other

The worst database developers in the world?

Wed, 2014-04-16 00:45

If the makers of MMO RPGs (Massive Multi-Player Online Role-Playing Games) aren’t quite the worst database application developers in the world, they’re at least on the short list for consideration. The makers of Guild Wars didn’t even try to have decent database functionality. A decade later, when they introduced Guild Wars 2, the database-oriented functionality (auction house, real-money store, etc.) would crash for days at a time. Lord of the Rings Online evidently had multiple issues with database functionality. Now I’m playing Elder Scrolls Online, which on the whole is a great game, but which may have the most database screw-ups of all.

ESO has been live for less than 3 weeks, and in that time:

1. There’s been a major bug in which players’ “banks” shrank, losing items and so on. Days later, the data still hasn’t been recovered. After a patch, the problem if anything worsened.

2. Guild functionality has at times been taken down while the rest of the game functioned.

3. Those problems aside, bank and guild bank functionality are broken, via what might be considered performance bugs. Problems I repeatedly encounter include:

  • If you deposit a few items, the bank soon goes into a wait state where you can’t use it for a minute or more.
  • Similarly, when you try to access a guild — i.e. group — bank, you often find it in an unresponsive state.
  • If you make a series of updates a second apart, the game tells you you’re doing things too quickly, and insists that you slow down a lot.
  • Items that are supposed to “stack” appear in 2 or more stacks; i.e., a very simple kind of aggregation is failing. There are also several other related recurring errors, which I conjecture have the same underlying cause.

In general, it seems like that what should be a collection of database records is really just a list, parsed each time an update occurs, periodically flushed in its entirety to disk, with all the performance problems you’d expect from that kind of choice.

4. Even stupider are the in-game stores, where fictional items are sold for fictional money. They have an e-commerce interface that is literally 15+ years out of date — items are listed with VERY few filtering options, and there is no way to change the sort. But even that super-primitive interface doesn’t work; in particular, filter queries frequently return incorrect empty-set responses.

5. Much as in other games, over 10 minutes of state changes can be lost.

Except perhaps for #5, these are all functions that surely are only loosely coupled to the rest of the game. Hence the other difficulties of game scaling and performance should have no bearing on them. Hence there’s no excuse for doing such a terrible job of development on large portions of gameplay functionality.

Based on job listings, ESO developer Zenimax doesn’t see database functionality as a major area to fix. This makes me sad.

Categories: Other

NoSQL vs. NewSQL vs. traditional RDBMS

Fri, 2014-03-28 08:09

I frequently am asked questions that boil down to:

  • When should one use NoSQL?
  • When should one use a new SQL product (NewSQL or otherwise)?
  • When should one use a traditional RDBMS (most likely Oracle, DB2, or SQL Server)?

The details vary with context — e.g. sometimes MySQL is a traditional RDBMS and sometimes it is a new kid — but the general class of questions keeps coming. And that’s just for short-request use cases; similar questions for analytic systems arise even more often.

My general answers start:

  • Sometimes something isn’t broken, and doesn’t need fixing.
  • Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
  • Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.

In particular, migration away from legacy DBMS raises many issues: 

  • Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
  • Your staff’s programming and administrative skill-sets.
  • Your investment in DBMS-related tools.
  • Your supply of hockey tickets from the vendor’s salesman.

Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.

Commonly, the good reason to change DBMS is one or more of:

  • Programming model. Increasingly often, dynamic schemas seem preferable to fixed ones. Internet-tracking nested data structures are just one of the reasons.
  • Performance (scale-out). DBMS written in this century often scale out better than ones written in the previous millennium. Also, DBMS with fewer features find it easier to scale than more complex ones; distributed join performance is a particular challenge.
  • Geo-distribution. A special kind of scale-out is geo-distribution, which is sometimes a compliance requirement, and in other cases can be a response time nice-to-have.
  • Other stack choices. Couchbase gets a lot of its adoption from existing memcached users (although they like to point out that the percentage keeps dropping). HBase gets a lot of its adoption as a Hadoop add-on.
  • Licensing cost. Duh.

NoSQL products commonly make sense for new applications. NewSQL products, to date, have had a harder time crossing that bar. The chief reasons for the difference are, I think:

  • Programming model!
  • Earlier to do a good and differentiated job in scale-out.
  • Earlier to be at least somewhat mature.

And that brings us to the 762-gigabyte gorilla — in-memory DBMS performance – which is getting all sorts of SAP-driven marketing attention as a potential reason to switch. One can of course put any database in memory, providing only that it is small enough to fit in a single server’s RAM, or else that the DBMS managing it knows how to scale out. Still, there’s a genuine category of “in-memory DBMS/in-memory DBMS features”, principally because:

  • In-memory database managers can and should have a very different approach to locking and latching than ones that rely on persistent storage.
  • Not all DBMS are great at scale-out.

But Microsoft has now launched Hekaton, about which I long ago wrote:

I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.

My level of knowledge about Hekaton hasn’t improved in the interim; still, it would seem that in-memory short-request database management is not a reason to switch away from Microsoft SQL Server. Oracle has vaguely promised to get to a similar state one of these years as well.

Of course, HANA isn’t really a short-request DBMS; it’s an analytic DBMS that SAP plausibly claims is sufficiently fast and feature-rich for short-request processing as well.* It remains to be seen whether that difference in attitude will drive enough sustainable product advantages to make switching make sense.

*Most obviously, HANA is columnar. And it has various kinds of integrated analytics as well.

Related links

Categories: Other

DBMS2 revisited

Sun, 2014-03-23 05:52

The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:

  • Task-appropriate data managers. Much of this blog is about task-appropriate data stores, so I won’t say more about them in this post.
  • Drastic limitations on relational schema complexity. I think I’ve vindicated on that one by, for example:
    • NoSQL and dynamic schemas.
    • Schema-on-read, and its smarter younger brother schema-on-need.
    • Limitations on the performance and/or allowed functionality of joins in scale-out short-request RDBMS, and the relative lack of complaints about same.
    • Funky database design from major Software as a Service (SaaS) vendors such as Workday and
    • A whole lot of logs.

I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.

For starters, let me say:

  • I’ve mocked the concept of “logical data warehouse” in the past, for its implausible grandiosity, but Gartner’s thoughts on the subject are worth reviewing even so.
  • I generally hear that internet businesses have SOAs (Service-Oriented Architectures) loosely coupling various aspects of their systems, and this is going well. Indeed, it seems to be going well that it’s not worth talking about, and so I’m unclear on the details; evidently it just works. However …
  • … evidently these SOAs are not set up for human real-time levels of data freshness.
  • ETL (Extract/Transform/Load) is criticized for two reasons:
    • People associate it with the kind of schema-heavy relational database design that’s now widely hated, and the long project cycles it is believed to be bring with it.
    • Both analytic RDBMS and now Hadoop offer the alternative of ELT, in which the loading comes before the transformation.
    • There are some welcome attempts to automate aspects of ETL/ELT schema design. I’ve written about this at greatest length in the context of ClearStory’s “Data Intelligence” pitch.
    • Schema-on-need defangs other parts of the ETL/ELT schema beast.
    • If you have a speed-insensitive problem with the cost or complexity of your high-volume data transformation needs, there’s a good chance that Hadoop offers the solution. Much of Hadoop’s adoption is tied to data transformation.

Next, I’d like to call out what is generally a non-problem — when a query can go to two or more systems for the same information, which one should it tap? In theory, that’s a much harder problem in theory than ordinary DBMS optimization. But in practice, only the simplest forms of the challenge tend to arise, because when data is stored in more than one system, they tend to have wildly different use cases, performance profiles and/or permissions.

So what I’m saying is that most traditional kinds of data integration problems are well understood and on their way to being solved in practice. We have our silos; data is replicated as needed between silos; and everything is more or less cool. But of course, as traditional problems get solved, new ones arise, and those turn out to be concentrated among real-time requirements.

“Real-time” of course means different things in different contexts, but for now I think we can safely partition it into two buckets:

  • Human real-time — fast enough so that it doesn’t make a human wait.
  • Machine real-time — as fast as ever possible, because machines are racing other machines.

The latter category arises in the case of automated bidding, famously in high-frequency securities trading, but now in real-time advertising auctions as well. But those vertical markets aside, human real-time integration generally is fast enough.

Narrowing the scope further, I’d say that real-time transactional integration has worked for a while. I date it back to the initially clunky EAI (Enterprise Application Integration) vendors of the latter 1990s. The market didn’t turn out to be that big, but neither did the ETL market, so it’s all good. SOAs, as previously noted, are doing pretty well.

Where things still seem to be dicier is in the area of real-time analytic integration. How can analytic processing be tougher in this regard than transactional? Two ways. One, of course, is data volume. The second is that it’s more likely to involve machine-generated data streams. That said, while I hear a lot about a BI need-for-speed, I often suspect it of being a want-for-speed instead. So while I’m interested in writing a more focused future post on real-time data integration, there may be a bit of latency before it comes out.

Categories: Other

Wants vs. needs

Sun, 2014-03-23 05:51

In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:

What users value in the product-buying process is quite different from what they value once a product is (being) put into use.

Here are some thoughts about how that comes into play today.

Wants outrunning needs

1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.

2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.

Even so, the want for speed keeps growing stronger. :)

3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.

4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.

5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.

6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.

7. Some enterprises still insist on keeping their IT operations on-premises. In a number of cases, that perceived need is hard to justify.

8. Conversely, I’ve steered clients away from data warehouse appliances and toward, say, Vertica, because they had a clear desire to be cloud-ready. However, I’m not aware that any of those companies ever actually deployed Vertica in the cloud.

Needs ahead of wants

1. Enterprises often don’t realize how much their lives can be improved via a technology upgrade. Those queries that take 6 hours on your current systems, but only 6 minutes on the gear you’re testing? They’d probably take 15 minutes or less on any competitive product as well. Just get something reasonably modern, please!

2. Every application SaaS vendor should offer decent BI. Despite their limited scope, dashboards specific to the SaaS application will likely provide customer value. As a bonus, they’re also apt to demo well.

3. If your customer personal-identity data that resides on internet-facing systems isn’t encrypted – why not? And please don’t get me started on passwords that are stored and mailed around in plain text.

4. Notwithstanding what I said above about elasticity being overrated, buyers often either underrate their needs for concurrent usage, or else don’t do a good job of testing concurrency. A lot of performance disappointments are really problems with concurrency.

5. As noted above, it’s possible to underrate one’s need for boring old SQL goodness.

Wants and needs in balance

1. Twenty years ago, I thought security concerns were overwrought. But in an internet-connected world, with customer data privacy and various forms of regulatory compliance in play, wants and needs for security seem pretty well aligned.

2. There also was a time when ease of set-up and installation were to be underrated. Not any more, however; people generally understand its great importance.

Categories: Other

Real-time analytics for everybody, uniquely from us!!

Tue, 2014-03-18 21:54

In my latest post, I noted that

The “real-time analytics” gold rush I called out last year continues.

I also recently mocked the slogan

Analytics for everybody!

So when I saw today an email subject line

[Vendor X] to announce real-time analytics for everyone …

I laughed. Indeed, I snorted so loudly that Linda — who was on a different floor of our house — called to check that I was OK. :)

As the day progressed, I had a consulting call with a client, and what did I see on the first substantive slide? There were references to:

broader audience


real-time data analysis

The trends — real or imaginary — are melting into each other!

Categories: Other

Notes and comments, March 17, 2014

Mon, 2014-03-17 01:09

I have ever more business-advice posts up on Strategic Messaging. Recent subjects include pricing and stealth-mode marketing. Other stuff I’ve been up to includes:

The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.

Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing. :D

The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.

Basho had a major change in leadership. A Twitter exchange ensued. :) Joab Jackson offered a more sober — figuratively and literally — take.

Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.

*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.

WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.

Disclosure: My fingerprints are all over that deal.

In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.

I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.

I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.

I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.

*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.

Categories: Other

Splunk and inverted-list indexing

Thu, 2014-03-06 06:55

Some technical background about Splunk

In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):

Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.

It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.

I also wrote:

The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.

That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.

I further wrote:

Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.

Splunk’s ILM story turns out to be simple indeed.

  • As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
  • Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.

Finally, I wrote:

I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.


I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.

The point of what I in October, 2013 called

a high(er)-performance data store into which you can selectively copy columns of data

and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.

Inverted-list indexing

Inverted list technology is confusing for several reasons, which start: 

  • It has two names that — rightly or wrongly — are used fairly interchangeably: inverted index and inverted list.
  • Inverted indexes have played different roles at different times. in particular:
    • They were the architecture of the best pre-relational general-purpose DBMS, namely ADABAS, Datacom/DB, and Model 204.
    • They are the core architecture of text search.
    • They are the architecture of certain document- or object-oriented DBMS, such as MarkLogic.
    • They are the core architecture of Splunk. :)

What’s more, inverted list technology can take several different forms.

  • In the simplest case, for each of many keywords, the inverted index lists the documents that contain it. Splunk does a form of this, where the “keyword” is the field — i.e. name — in a (field, value) pair.
  • Another option is to store, for each keyword or name, not just document_IDs, but additional information.
    • In the case of (field, value) pairs, the value can be stored. Splunk sometimes does that too.
    • In the case of text documents, the index can store the position(s) in the document that the word occurs. This is irrelevant to Splunk.
  • When you list all the records that have a certain field in them, and the list mentions the values, you’re getting pretty close to having a column-group NoSQL DBMS (e.g. Cassandra or HBase). Indeed, you might even be on your way to a columnar RDBMS; after all, SAP HANA grew out of a text indexing system.

Splunk, HPAS, and inverted indexes

With all that background, we can finally summarize Splunk’s “High Performance Analytic Store” story.

  • Splunk’s classic data store is an inverted list system that:
    • Tracks (field, value) pairs for a few fields that are always the same, such as Source_System.
    • Otherwise tracks fields only.
  • Splunk HPAS is an inverted list system that tracks (field, value) pairs for arbitrary fields. This gives much higher performance for queries that SELECT on or GROUP BY those fields.
  • As of Splunk 6, Splunk Classic and Splunk HPAS are tightly and almost transparently integrated.

While I haven’t probed for full specifics, I did gather:

  • Queries execute against both data stores at once, without any syntax change. At least, they do if you press some button; that’s the “almost” in the transparency.
  • ­HPAS time-slices the data it stores by the same time intervals that Splunk Classic does. Hence for each time range, integrated Splunk can interrogate the HPAS first and, if it can’t answer, go to the slower traditional Splunk store.
  • There are two basic ways to populate the HPAS:
    • As the data streams in.
    • Via the result sets of Splunk queries. Splunk talks as if this is the preferred way, which fits with Splunk’s long-time argument that it’s nice not to have to make any schema choices before you start streaming the data in.
Categories: Other

Analytics for everybody!

Wed, 2014-03-05 15:39

For quite some time, one of the most frequent marketing pitches I’ve heard is “Analytics made easy for everybody!”, where by “quite some time” I mean “over 30 years”. “Uniquely easy analytics” is a claim that I meet with the greatest of skepticism.*  Further confusing matters, these claims are usually about what amounts to business intelligence tools, but vendors increasingly say “Our stuff is better than the BI that came before, so we don’t want you to call it ‘BI’ as well.”

*That’s even if your slide deck doesn’t contain a picture of a pyramid of user kinds; if there actually is such a drawing, then the chance that I believe you is effectively nil.

All those caveats notwithstanding, there are indeed at least three forms of widespread analytics:

  • Fairly standalone, eas(ier) to use business intelligence tools, sometimes marketed as focusing on “data exploration” or “data discovery”.
  • Charts and graphs integrated or at least well-embedded into production applications. This technology is on a long-term rise. But in some sense, integrated reporting has been around since the invention of accounting.
  • Predictive analytics built into automated systems, for example ad selection. This is not what is usually meant by the “easy analytics” claim, and I’ll say no more about it in this post.

It would be nice to say that the first two bullet points represent a fairly clean operational/investigative BI split, but that would be wrong; human real-time dashboards can at once be standalone and operational.

Often, the message “Our BI is easy to use by everybody, unlike every other BI offering in the past 40 years” is unsupported by facts; vendors just offer me-too BI technology and falsely claim it’s something special. But sometimes there is actual substance, usually in one or more aspects of time-to-answer. For example:

  • Sometimes the BI itself has a particularly good interface for navigation.
  • I think it’s still possible to be differentiated in mobile BI delivery.
  • It’s definitely still possible to be differentiated in real-time/streaming BI interfaces.
  • Sometimes the visible BI is just part of a specialized stack, whose other elements make it much easier to set up working UI than in the traditional model.
    • Some claims along these lines are bogus, drawing false comparisons to worst-case scenarios in which enterprises take a year or two setting up their first-ever data warehouse.
    • Some of these claims, however, are more legitimate, at least to the extent that the stack includes leading-edge smart data integration, schema-on-need data management, and so on.

One items I’m leaving off the list is the capability to easily design charts, graphs or whole dashboards. When BI vendors add that functionality, they often present it as an industry innovation; but it’s been years since I saw something in that vein beyond the me-too.

Categories: Other

Confusion about metadata

Sun, 2014-02-23 00:50

A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.

“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:

  • Data about data structure. This is the classical sense of the term. But please note:
    • In a relational database, structural metadata is rather separate from the data itself.
    • In a document database, each document might carry structure information with it.
  • Other inputs to core data management functions. Two major examples are:
    • Column statistics that inform RDBMS optimizers.
    • Value ranges that inform partition pruning or, more generally, data skipping.
  • Inputs to ancillary data management functions — for example, security privileges.
  • Support for human decisions about data — for example, information about authorship or lineage.

What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.

And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:

  • Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
  • Some document databases store structural metadata right with the document data itself.
  • Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
  • Actual text documents carry the structure imposed by grammar and syntax.

Related links

Categories: Other

MemSQL 3.0

Mon, 2014-02-10 14:38

Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are:

  • MemSQL has historically been an in-memory row store, which as of last year scales out.
  • It turns out that the MemSQL row store actually has two table types. One is scaled out. The other — called “reference” — is replicated on every node.
  • MemSQL has now added a third table type, which is columnar and which resides in flash memory.
  • If you want to keep data in, for example, both the scale-out row store and the column store, you’d have to copy/replicate it within MemSQL. And if you wanted to access data from both versions at once (e.g. because different copies cover different time periods), you’d likely have to do a UNION or something like that.

*MemSQL’s first columnar offering sounds pretty basic; for example, there’s no columnar compression yet. (Edit: Oops, that’s not accurate. See comment below.) But at least they actually have one, which puts them ahead of many other row-based RDBMS vendors that come to mind.

And to hammer home the contrast:

  • IBM, Oracle and Microsoft, which all sell row-based DBMS meant to run on disk or other persistent storage, have added or will add columnar options that run in RAM.
  • MemSQL, which sells a row-based DBMS that runs in RAM, has added a columnar option that runs in persistent solid-state storage.
Categories: Other

Distinctions in SQL/Hadoop integration

Sun, 2014-02-09 12:50

Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):

  • Are the SQL engine and Hadoop:
    • Necessarily on the same cluster?
    • Necessarily or at least most naturally on different clusters?
  • How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
    • HDFS (Hadoop Distributed File System)?
    • Hadoop MapReduce?
    • HCatalog?
  • How, if at all, is the SQL engine invoked by Hadoop?

In particular:

  • If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
    • A way of making data transfer maximally parallel.
    • Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
  • If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.

Let’s go to some examples.

Hive is the closest example of SQL/Hadoop integration known. Hive executes a somewhat low-grade dialect of SQL — HQL (Hive Query Language) — via very standard Hadoop: Hadoop MapReduce, all HDFS file formats, etc. HCatalog is an enhancement/replacement for the Hive metadata store. HQL is just another language that can be used to write (parts of) Hadoop jobs.

Impala is Cloudera’s replacement for Hive. Impala is and/or is planned to be much like Hive, but much better, for example in performance and in SQL functionality. Impala has its own custom execution engine, including a daemon on every Hadoop data node, and seems to run against a variety of but not all HDFS file formats.

Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez.

Teradata SQL-H is an RDBMS-Hadoop connector that uses HCatalog, and plans queries across the two clusters. Microsoft Polybase is like SQL-H, but it seems more willing than Teradata or Teradata Aster to (optionally) coexist on the same nodes as Hadoop.

Hadapt runs on the Hadoop cluster, putting PostgreSQL* and other software on each Hadoop data node. It has two query engines, one that invokes Hadoop MapReduce (the original one, still best for longer-running queries) and one that doesn’t (more analogous to Impala). When last I looked, Hadapt didn’t query or update against the HDFS API, but there was an interesting future in preloading data from HDFS into Hadapt PostgreSQL tables, and I think that Hadapt’s PostgreSQL tables are technically HDFS files. I don’t think Hadapt makes much use of HCatalog.

*Hacked to allow Hadapt to offer more than just SQL/Hadoop integration.

Splice Machine is a new entrant (public beta is imminent) that has put Apache Derby over an HBase back end. (Apache Derby is the former Cloudscape, an embeddable Java RDBMS that was acquired by Informix and hence later by IBM.) Splice Machine runs on your Hadoop nodes as an HBase coprocessor. Its relationship to non-HBase parts of Hadoop is arm’s-length. I wish this weren’t called “SQL-on-Hadoop”.

Related links

  • Dan Abadi and Dave Dewitt opined last June about how to categorize Hadapt and Polybase.
  • My most detailed discussions of Impala and Stinger were last June and August, respectively.
Categories: Other

Some stuff I’m thinking about (early 2014)

Sun, 2014-02-02 12:51

From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:

Other stuff on my mind includes but is not limited to:

1. Certain categories of buying organizations are inherently leading-edge.

  • Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
  • US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
  • Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
  • … as have national-security agencies …
  • … as have pharmaceutical research departments.

Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.

2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.

  • Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
  • I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
  • I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
  • I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.

3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.

Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.

Analytic RDBMS prices are still shaking out.

4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.

One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.

Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.

5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)

6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.

Data management is getting ever more complex.

Categories: Other

Spark and Databricks

Sun, 2014-02-02 12:50

I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.

  • Spark is very new. All Spark adoption is recent.
  • Databricks was founded to commercialize Spark. It is very much in stealth mode …
  • … except insofar as Databricks folks are going out and trying to drum up Spark adoption. :)
  • Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated. :)
  • Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
  • Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.

The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:

  • Spark is a distributed execution engine for analytic processes …
  • … which works well with Hadoop.
  • Spark is distinguished by a flexible in-memory data model …
  • … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
  • Intended analytic use cases for Spark include:
    • SQL data manipulation.
    • ETL-like data manipulation.
    • Streaming-like data manipulation.
    • Machine learning.
    • Graph analytics.

Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.

*A new Spark task requires a thread, not a whole Java Virtual Machine.

Everybody agrees that machine learning is a top Spark use case. In particular:

  • Cloudera sees machine learning as the major area of Spark adoption to date.
  • Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
  • Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
  • There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.

I believe data transformation is a major Spark use case as well.

  • Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
  • I have one client (ClearStory) using Spark that way, and a second that’s likely to.
  • It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.

Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:

  • The actual technology is a form of micro-batching. I plan to learn more about that in the future.
  • Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
  • Mike Franklin knows a lot about streaming.

Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:

  • Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
  • I am told that in general the Storm community is not all that vibrant.
  • Various aspects of Storm’s technology are disappointing people.

Other notes on Spark use cases include:

  • Impala-loving Cloudera doesn’t plan to support Shark. Duh.
  • Cloudera also won’t at first support any Spark predictive modeling add-on.
  • Ion’s other company, Conviva, is doing some real-time decisioning in Spark.

Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.

*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)

And finally, some metrics and so on:

  • Databricks has between 10 and 20 employees.
  • Spark has >100 individual contributors from >25 different companies.
  • There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
  • The Spark meet-up group in San Francisco has >1500 members signed up.
  • Various Spark users and subprojects are identified on the Apache Spark pages.

Related link

  • Most of the current substance on Databricks’ website is in its blog.
Categories: Other

More on public policy

Sat, 2014-02-01 05:35

Occasionally I take my public policy experience out for some exercise. Last week I wrote about privacy and network neutrality. In this post I’ll survey a few more subjects.

1. Censorship worries me, a lot. A classic example is Vietnam, which basically has outlawed online political discussion.

And such laws can have teeth. It’s hard to conceal your internet usage from an inquisitive government.

2. Software and software related patents are back in the news. Google, which said it was paying $5.5 billion or so for a bunch of Motorola patents, turns out to really have paid $7 billion or more. Twitter and IBM did a patent deal as well. Big numbers, and good for certain shareholders. But this all benefits the wider world — how?

As I wrote 3 1/2 years ago:

The purpose of legal intellectual property protections, simply put, is to help make it a good decision to create something.

Why does “securing … exclusive Right[s]” to the creators of things that are patented, copyrighted, or trademarked help make it a good decision for them to create stuff? Because it averts competition from copiers, thus making the creator a monopolist in what s/he has created, allowing her to at least somewhat value-price her creation.

I.e., the core point of intellectual property rights is to prevent copying-based competition. By way of contrast, any other kind of intellectual property “right” should be viewed with great suspicion.

That Constitutionally-based principle makes as much sense to me now as it did then. By way of contrast, “Let’s give more intellectual property rights to big corporations to protect middle-managers’ jobs” is — well, it’s an argument I view with great suspicion.

But I find it extremely hard to think of a technology industry example in which development was stimulated by the possibility of patent protection. Yes, the situation may be different in pharmaceuticals, or for gadgeteering home inventors, but I can think of no case in which technology has been better, or faster to come to market, because of the possibility of a patent-law monopoly. So if software and business-method patents were abolished entirely – even the ones that I think could be realistically adjudicatedI’d be pleased.

3. In November, 2008 I offered IT policy suggestions for the incoming Obama Administration, especially: 

  1. Pick the right Chief Technology Officer.
  2. Fix the government technology contracting process in general.
  3. Fix the air traffic control system in particular.
  4. Generally take a businesslike approach to government IT. Obama’s focus on making government “transparent” and searchable would be just one byproduct of that effort.
  5. Continue to beef up internal search and knowledge management (remember the FBI agent who guessed the 9/11 plans, but couldn’t communicate his ideas to anybody who cared).
  6. Write privacy laws of the sort that will, for example, allow electronic health records to be adopted without great fear of misuse. (I have some strong opinions as to what form those laws should take.)
  7. Drastically beef up math education!! (Science too, but math is especially important.) This takes leadership to convince people it’s CRUCIAL to be numerate, perhaps even more than it takes specific policy initiatives. Little else is as important.


… we need an experienced technology implementation leader to:

  • Recommend major changes in government IT contracting. Right now, information technology is bought at the wrong level of granularity, too coarse and too fine at once. Private sector CIOs make broad technology architecture decisions, then make incremental purchases as needed. Public sector IT managers, however, are generally compelled to make purchases on a “project” basis, which allows neither the sanity of broad-scale planning nor the economies and adaptability of just-in-time acquisition.
  • Establish best practices in a broad range of IT areas. Obama’s “transparency” initiative involves pushing the state of the art in public-facing technology for search, query, and audio/video, at a minimum. Other areas of major technical challenge include internal search, knowledge management, and social networking; disaster robustness; planning in the face of political budgeting uncertainty; numbers-based management without the benefit of a profit/loss statement … and the list could easily be twice as long.
  • Interact with the private sector. From electronic health records to the general supply chain, there are huge opportunities for public/private interoperability, quite apart from the obvious customer/vendor relationships the government has with the IT industry.
  • Improve training, recruiting, and retention. Anywhere government needs employees whose skills are also in high demand in the private sector, government pay scales cause difficulties. IT is a top area for that problem. Outstanding leadership is needed to overcome it.

Little of that actually happened.

Kudos if you noticed the link — which I herewith repeat — to what I wrote about privacy in 2006. :)

In particular — and even after the fiasco — I think few voters or legislators understand how incredibly broken government IT contracting is. Almost all major projects go through a five-stage process:

  • Specify.
  • Bid.
  • Select.
  • Complain.
  • Adjudicate.

Re-competes usually follow as well.

And so government IT is subject to extreme forms of two inevitable project killers:

  • Waterfall methodology.
  • Delay.

Procurement cycles take years, and in the worst cases decades. Project specifications are often fixed until the next procurement, which is often 7-10 years down the road. This, to put it mildly, is the opposite of agility, and widespread project failure ensues.

Categories: Other

The report of Obama’s Snowden-response commission

Mon, 2014-01-27 14:14

In response to the uproar created by the Edward Snowden revelations, the White House commissioned five dignitaries to produce a 300-page report, released last December 12. (Official name: Report and Recommendations of The President’s Review Group on Intelligence and Communications Technologies.) I read or skimmed a large minority of it, and I found enough substance to be worthy of a blog post.

Many of the report’s details fall in the buckets of bureaucratic administrivia,* internal information security, or general pabulum. But the commission started with four general principles that I think have great merit.

*One big item — restrict the NSA to foreign intelligence, and split off domestic cyber defense into a separate organization.

The United States Government must protect, at once, two different forms of security: national security and personal privacy.

… It might seem puzzling, or a coincidence of language, that the word “security” embodies such different values. But the etymology of the word solves the puzzle; there is no coincidence here. In Latin, the word “securus” offers the core meanings, which include “free from care, quiet, easy,” and also “tranquil; free from danger, safe.”

Key point: The report rejects any idea that national security concerns should run roughshod over individual liberty.

The central task is one of risk management; multiple risks are involved, and all of them must be considered. …

  • Risks to privacy;
  • Risks to freedom and civil liberties, on the Internet and elsewhere;
  • Risks to our relationships with other nations; and
  • Risks to trade and commerce, including international commerce.

… If people are fearful that their conversations are being monitored, expressions of doubt about or opposition to current policies and leaders may be chilled, and the democratic process itself may be compromised.

… These points make it abundantly clear that if officials can acquire information, it does not follow that they should do so.

I am always pleased when policy makers recognize that the key issue is chilling effects upon the exercise of ordinary freedoms; the report made that point multiple times, footnoting both Sonia Sotomayor and the 1970s Church Commission. (Search the document for chill to see where.)

The idea of “balancing” has an important element of truth, but it is also inadequate and misleading.

… The purposes of surveillance must be legitimate. If they are not, no amount of “balancing” can justify surveillance. For this reason, it is exceptionally important to create explicit prohibitions and safeguards, designed to reduce the risk that surveillance will ever be undertaken for illegitimate ends.

Exceptionally important indeed.

The government should base its decisions on a careful analysis of consequences, including both benefits and costs (to the extent feasible).

Government officials, even more than other large-organization employees, have the tendency to avoid job failure at all costs. This goes triple when they work on life-and-death issues. Even so, sometimes security can be pursued with too much vigor, and much of the United States’ post-9/11 history directly bears that out.

And here’s the part I like best of all (emphasis mine):

We recommend that, if the government legally intercepts a communication under section 702 … and if the communication either includes a United States person as a participant or reveals information about a United States person:

(1) any information about that United States person should be purged upon detection unless it either has foreign intelligence value or is necessary to prevent serious harm to others;

(2) any information about the United States person may not be used in evidence in any proceeding against that United States person;

I’ve felt for years that a deciding issue in the preservation of liberty will be what kinds of information are admissible in court, or otherwise may be used to hurt people. All safeguards on data collection and retention notwithstanding, huge datasets will be created and maintained. Continued liberty requires careful limitation of how they may be used against us.

Related links

Categories: Other

Net neutrality and sponsored data — a middle course

Mon, 2014-01-27 08:36

Thanks to a court decision that overturned some existing regulations, network neutrality is back in the news. Most people think the key issue is whether

  • Telecommunication companies (e.g. wireless and/or broadband services providers) should be allowed to charge …
  • … other internet companies (website owners, game companies, streaming media providers, etc., collectively known as edge providers) for …
  • … shipping data to internet service consumers in particularly attractive ways.

But I think some forms of charging can be OK — albeit not the ones currently being discussed — and so the question should instead be how the charges are designed.

When I wrote about network neutrality in 2006-7, the issue was mainly whether broadband providers would be allowed to ship different kinds of data at different speeds or reliability. Now the big controversy is whether mobile data providers should be allowed to accept “sponsorship” so as to have certain kinds of data not count against mobile data plan volume caps. Either way:

  • The “anything goes” strategy has obvious free-market appeal.
  • But proponents of network neutrality regulation — such as Fred Wilson and Nilay Patel — point out a major risk: By striking deals that smaller companies can’t imitate, large, established “edge provider” services may strangle upstart competitors in their cribs.

I think the anti-discrimination argument for network neutrality has much merit. But I also think there are some kinds of payment structure that could leave the playing field fairly level. Imagine, if you will, that:

  • Consumers are charged for data, speed of connection, reliability of delivery, or anything else, but …
  • … internet companies have the ability to absorb those charges on consumers’ behalf, but can only do so …
  • one interaction at a time, with no volume discounts, via an automated system that is open to everybody.

Such a system is surely technologically feasible — indeed, it is at least as feasible as the online advertising networks that already exist. Further, it would be possible for the system to have nice features such as:

  • Telcos could implement forms of peak load pricing, for those times when their network capacity actually is under stress.
  • “Edge provider” internet companies could pay subsidies only on behalf of certain consumers, where those consumers are selected in all the complex ways that advertisements are currently targeted.

In such a setup, which discrimination fears would or would not be realized?

  • Startups that hope to get adoption first and monetize second might face the cash cost of actually paying their users to try their services. Sorry. But at least they could target their spend on whoever they viewed as being the most important potential adopters.
  • Large vendors could not negotiate preferential pricing, reciprocal deals, or anything like that. At least, they couldn’t do so directly.
  • Discrimination by type of service – for example telcos trying to hamstring communications services that compete with their own offerings – could be staved off, via fairly lightweight regulatory oversight of the ways pricing plans are structured.
  • Regulators could head off sneaky “sweetheart deals” between big “edge provider” companies and telcos in much the same way.

I have no great objections to extreme net neutrality; behemoth oligopolist telcos should be among the last companies to cry “Un-free markets, boo-hoo-sob!!” But as internet pipes are increasingly used for telephony, streaming media or even medical consultations, drawing quality-of-service distinctions could have a certain merit. And so, for reasons similar to those I outlined in 2007, I still lean toward the partial network neutrality described above.

Related links

  • Wired articulated some of the dangers of a no-net-neutrality world.
  • Tech Republic mapped part of the legal and political net neutrality morass.
Categories: Other

The games of Watson

Thu, 2014-01-09 14:57

IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.

I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:

  • MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
  • Cyc is connected to the computer science community’s standard unit of bogosity.

*Much of it, I’m ashamed to say, with my help, back in my stock analyst days.

AI is something of an umbrella category, often just meaning “Computerized stuff that we don’t know how to do yet”, or ” … only recently figured out how to do.” Automated decision-making is an aspect of AI, for example, but so also is natural language recognition. It used to be believed that most AI should be approached in the same way:

  • Come up with a clever way to represent knowledge.
  • Match the actual situation against the knowledge.
  • Produce a smart result.

But that template unfortunately proved disappointing time after time. The problem was typically that not enough knowledge could in practice be represented, and thus well-informed automated decisions could not be made. In particular, there was a “first step fallacy,” in which a demo system would solve a “toy problem”, but robust real-life systems never emerged.

Of course, there are exceptions to this general rule of disappointment; for example, Teknowledge and its fellow over-hyped expert system technology vendors of the 1980s (Intellicorp, Inference, et al.) did get a few solid production references. But the ones I remember best (e.g. American Express credit, United Airlines seat pricing, some equipment maintenance scheduling) were often for use cases that we’d now address in more straightforwardly mathematical ways.

Watson is generally promoted as helping with decision-making, but that message has to be scrutinized carefully. So far as I’ve been able to guess, the true core technology of IBM Watson is extracting knowledge from text — or primarily from text — and representing it in some way that is reasonably useful in answering natural language queries. The hope would then be to eventually achieve a rich enough knowledge base to support the Star Trek computer. But automated decision-making doesn’t just require knowledge; it also requires decision-making rules. And if Watson is significantly ahead of the 1980s decisioning state of the art (Rete, backward chaining, etc.), I’m not aware of how.

So if Watson is going to accomplish anything soon, it will probably be in areas where serious decision-making chops aren’t needed. Indeed, the application areas that I’ve seen mentioned for the past or near term are mainly:

  • Playing Jeopardy! That’s pretty simple from a decision-making standpoint.
  • Advising on treatments for a specific disease (not actually built yet). As noted above, that’s 1970s-level decisioning.
  • Knowledge extraction from medical research articles. That has very little to do with decisioning, and incidentally sounds a lot like what SPSS (before it was acquired by IBM) and Temis were already doing years ago.
  • Natural-language customer interaction. That may not involve any decisioning at all.

Returning to the point that Watson’s core technology is probably natural language, it seems fair to say that IBM these days is probably better at the text mining side than at speech understanding. Evidence I’m thinking of includes:

  • That seems to be what IBM itself is saying on its speech recognition page.
  • I also recall IBM’s natural language recognition projects being regarded as not going well in the late 1990s. (Project Penelope, I believe, although I can’t confirm that via googling.)
  • IBM’s LanguageWare sounded more oriented to text mining in 2008.
  • IBM bought SPSS, which had decent text mining technology.

And while this is too old to really count as evidence, IBM had a famously unsuccessful language recognition deal with Artificial Intelligence Corporation way back in 1983-4.*

*Yeah, I helped raise money for AICorp too, and also for Symbolics. As you might imagine, my investment banking trophies do not have pride of place on my desk.

One last observation — text mining has a very mixed track record. Watson will have to go far beyond predecessor text technologies to become nearly the big deal IBM is suggesting it will be.

Related links

Categories: Other

Notes on memory-centric data management

Fri, 2014-01-03 03:35

I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as

DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)

By way of contrast:

Hybrid memory-centric DBMS is our term for a DBMS that has two modes:

  • In-memory.
  • Querying and updating (or loading into) persistent storage.

These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)

Two other sources of confusion are:

With all that said, here’s a little update on in-memory data management and related subjects.

  • I maintain my opinion that traditional databases will eventually wind up in RAM.
  • At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
  • Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
  • It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
  • I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.

And finally,

  • I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.
Categories: Other