Skip navigation.

DBMS2

Syndicate content
Choices in data management and analysis
Updated: 7 hours 15 min ago

Some stuff I’m working on

7 hours 30 min ago

1. I have some posts up on Strategic Messaging. The most recent are overviews of messaging, pricing, and positioning.

2. Numerous vendors are blending SQL and JSON management in their short-request DBMS. It will take some more work for me to have a strong opinion about the merits/demerits of various alternatives.

The default implementation — one example would be Clustrix’s — is to stick the JSON into something like a BLOB/CLOB field (Binary/Character Large Object), index on individual values, and treat those indexes just like any others for the purpose of SQL statements. Drawbacks include:

  • You have to store or retrieve the JSON in whole documents at a time.
  • If you are spectacularly careless, you could write JOINs with odd results.

IBM DB2 is one recent arrival to the JSON party. Unfortunately, I forgot to ask whether IBM’s JSON implementation was based on IBM DB2 pureXML when I had the chance, and IBM hasn’t gotten around to answering my followup query.

3. Nor has IBM gotten around to answering my followup queries on the subject of BLU, an interesting-sounding columnar option for DB2.

4. Numerous clients have asked me whether they should be active in DBaaS (DataBase as a Service). After all, Amazon, Google, Microsoft, Rackspace and salesforce.com are all in that business in some form, and other big companies have dipped toes in as well.

I’m skeptical that one can succeed both in that market and in selling database software, for reasons including:

  • Nobody I can think of has done so.
  • The value propositions are different.
    • DBaaS is about having administration be so easy that you the customer doesn’t need to worry about it.
    • Database software is about one or more of:
      • Development ease.
      • Price/performance/throughput.
      • Big-enterprise/legacy-vendor considerations.

I’m also skeptical about service-only DBaaS strategies, because users will naturally resist vendor lock-in.

But despite all my skepticism, DBaaS is an area I should probably learn more about.

5. I plan to spend more time looking at machine learning and other advanced analytics. I doubt they’ll soon match the past few years’ hype about “big data analytics”, but even the reality of modern analytics looks like it’s getting more interesting. Ditto if somebody has an interesting twist on more traditional predictive analytics.

6. Three years ago,  I wrote:

  • It is inevitable* that governments and other constituencies will obtain huge amounts of information, which can be used to drastically restrict everybody’s privacy and freedom.
  • To protect against this grave threat, multiple layers of defense are needed, technical and legal/regulatory/social/political alike.
  • One particular layer is getting insufficient attention, namely restrictions upon the use (as opposed to the acquisition or retention) of data.

*And indeed in many ways even desirable

It is now frighteningly obvious that the US is becoming a high-surveillance society. The Boston Marathon bombing added three new elements to an already snowballing trend:

I need to write more about privacy.

Categories: Other

It’s time to change around Monash Research’s mailing lists

Fri, 2013-05-03 03:42

Email delivery of posts has been screwed up; multiple people tell me they haven’t gotten their email for months. (In the future, please tell me of such difficulties!) So it’s time for a change, and I’m asking for your advice as to what you’d suggest for our mailing list.

Yes, I’m asking via a blog post, even thought the core problem is that people who want to see my posts via e-mail aren’t getting them. Please work with me on this anyway. :)

My two basic questions are:

  • What should be the frequency of delivery? To date, it’s been nightly (at least in theory).
  • What delivery technology should be used? To date, it’s been FeedBlitz.

1. The nightly scheduling has been an artifact of an RSS-to-email link that no longer seems stable. So I’m thinking of just manually pasting each post into a list email, in which case:

  • Posts could be sent without delay.
  • Every post would be delivered by separate mail. (As opposed to having only one post per night be mailed, while others just get linked to.)

It’s a bit more work for me, but probably nothing dire. Does lower latency sound good to everybody? :)

2. The main technical options seem to be:

  • Free services oriented to discussion lists, such as Yahoo Groups, but set to announce-only. These have very basic functionality.
  • Commercial services oriented to marketing email lists, such as Aweber or MailChimp. Does anybody have favorable or unfavorable experience with particular services? Most vendors surely use one or another, but it’s tough to guess which they’ve selected just based on their spam and pabulum informative communications, given the customizability those services provide.

Any thoughts would be most welcomed.

3. And while I’m at it — what I should I do for social/sharing buttons? Presumably, if I included buttons that made it easy for you to tweet links to my posts, submit them to Hacker News, etc., more of you would do so. Which specific options would you like to use?

  • Twitter?
  • LinkedIn?
  • Google +?
  • Facebook?
  • Slashdot?
  • Hacker News?
  • dzone?
  • Digg?

Anything else? I’d like to omit the more dubious possibilities, as offering everything could be a lot of clutter …

Categories: Other

More on Actian/ParAccel/VectorWise/Versant/etc.

Mon, 2013-04-29 05:50

My quick reaction to the Actian/ParAccel deal was negative. A few challenges to my views then emerged. They didn’t really change my mind.

Amazon Redshift

Amazon did a deal with ParAccel that amounted to:

  • Amazon got a very cheap license to a limited subset of ParAccel’s product …
  • … so that it could launch a service called Amazon Redshift.
  • Amazon also invested in ParAccel.

Some argue that this is great for ParAccel’s future prospects. I’m not convinced.

No doubt there are and will be Redshift users, evidently including Infor. But so far as I can tell, Redshift uses very standard SQL, so it doesn’t seed a ParAccel market in terms of developer habits. The administration/operation story is similar. So outside of general validation/bragging rights, Redshift is not a big deal for ParAccel.

OEMs and bragging rights

It’s not just Amazon and Infor; there’s also a MicroStrategy deal to OEM ParAccel — I think it’s the real ParAccel software in that case — for a particular service, MicroStrategy Wisdom. But unless I’m terribly mistaken, HP Vertica, Sybase IQ and even Infobright each have a lot more OEMs than ParAccel, just as they have a lot more customers than ParAccel overall.

This OEM success is a great validation for the idea of columnar analytic RDBMS in general, but I don’t see where it’s an advantage for ParAccel vs. the columnar leaders.

Concurrency

As I admitted in the comment thread to my first Actian/ParAccel post, I’m confused about what kind of concurrent usage ParAccel can really support. The data I have, e.g. in the link immediately above, is not conclusive. Googling suggests that VectorWise was at one user per core a couple of years ago, supportive of my hypothesis that it doesn’t have some big concurrency edge on ParAccel. But to repeat — I don’t really know.

DBMS acquisitions in the past

My history blog on DBMS acquisitions yielded more favorable examples than I was expecting. (Of course, I omitted a lot of small and boring failures.) And DBMS conglomerates are the rule more than the exception, with IBM, Sybase, Teradata and Oracle all adopting acquisition-aided multi-DBMS strategies, at least to some extent.

That said, Sybase is the main example of a vendor of a slow-growth DBMS (Adaptive Server Enterprise) doing well with a faster-growing one (Sybase IQ). Perhaps not coincidentally, Actian’s latest management team draws significantly on Sybase. So yes; ParAccel is now owned by a company run by guys who know something about selling columnar DBMS.

But the whole thing would be more convincing if Ingres had shown more life under Actian’s ownership, or indeed at any point in the past 20 years. My bottom line is that Actian was floundering badly in the DBMS market 1 1/2 years ago, and not a lot of favorable news has emerged in the interim — except, quite arguably, for the management changes and acquisitions themselves.

Categories: Other

Goodbye VectorWise, farewell ParAccel?

Thu, 2013-04-25 17:59

Actian, which already owns VectorWise, is also buying ParAccel. The argument for why this kills VectorWise is simple. ParAccel does most things VectorWise does, more or less as well. It also does a lot more:

  • ParAccel scales out.
  • ParAccel has added analytic platform capabilities.
  • I don’t know for sure, but I’d guess ParAccel has more mature management/plumbing capabilities as well.

One might conjecture that ParAccel is bad at highly concurrent, single-node use cases, and VectorWise is better at them — but at the link above, ParAccel bragged of supporting 5,000 concurrent connections. Besides, if one is just looking for a high-use reporting server, why not get Sybase IQ?? Anyhow, Actian hasn’t been investing enough in VectorWise to make it a major market player, and they’re unlikely to start now that they own ParAccel as well.

But I expect ParAccel to fail too. Reasons include:

  • ParAccel’s small market share and traction.
  • The disruption of any acquisition like this one.
  • My general view of Actian as a company.

2 years after being acquired, Vertica — which conceptually has always been ParAccel’s closest competitor — has finally taken major hits on engineering staffing. Even so, I expect HP Vertica to reopen what was once a large technology and momentum gap vs. ParAccel.

My views on Actian start:

  • Actian is attempting to build a database software conglomerate on the cheap, starting with Ingres, ParAccel, VectorWise, Pervasive (itself a small conglomerate) and Versant.
  • Actian hasn’t accomplished much with Ingres, its original acquisition.
  • Actian hasn’t accomplished much with VectorWise.
  • Actian’s brief, embarrassing pivot away from database software was a joke. (The comments at that link also show VectorWise’s positioning as very different in September, 2011 than it is now.)
  • I’ve had some very bad experiences with Actian management, although it seems to have largely turned over since then.
  • I can’t identify the folks to make this work at the acquired pieces either (even though I think well of a few of them, e.g. Mike Hoskins and Rick Glick).

I.e., building a database conglomerate is hard, and Actian isn’t up to the challenge.

Actian has three main paths it can follow for synergy:

  • Acquire a lot of pieces and flip the whole thing for more money to a foolish buyer. This strategy worked splendidly for Autonomy, and to some extent for Sybase as well. But it’s a longshot, and not necessarily a win for customers even if investors do well.
  • Sell a bunch of disparate products through the same sales force. Tough to execute. And at best it raises sales coverage up to the level of that for the most successful product — and Actian doesn’t really have successful new products.
  • Integrate the technologies. Blech. You don’t integrate DBMS with wildly different architectures, as Informix died trying in the 1990s.

I don’t see enough opportunity there for the whole thing to work out, with sales synergy being the best opportunity to prove me wrong.

Related links

Categories: Other

Analytic application themes

Thu, 2013-04-25 02:41

I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.

1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.

Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.

Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:

  • There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
  • Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.

2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:

  • Customer interaction
  • Network and sensor monitoring
  • Game and mobile application back-ends

Also arising fairly frequently are:

  • Algorithmic trading
  • Anti-fraud
  • Risk measurement
  • Law enforcement/national security
  • Healthcare
  • Stakeholder-facing analytics

I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.

3. Much of customer interaction revolves around recommendation and personalization. In connection with that I’ll remind you:

  • Multiple sources say that 5 millisecond response is a real need. Srini Srinivasan explained why in a January comment.
  • The results of the recommendation and personalization can be delivered in many different ways — product recommendations, ads, special offers, email, snail mail, call center scripts and more. This is the paradigmatic example for my skepticism about complete analytic applications.

4. Networks and sensors emit the epitome of machine-generated data. Data sources include web logs, network logs (in the IT sense), telecommunication networks, other utilities (e.g. electric), vehicle fleets, and more. Application themes include:

  • Human monitoring, via some kind of real-time business intelligence view. I hear about that a lot.
  • Various kinds of automated response. (Security is an obvious example.)
  • Integration with other kinds of application, data source, or use case.

As one example of the last point, Oliver Ratzesberger told me years ago that eBay had up-to-the-minute BI cubes integrating customer response and log data, for the purpose of quickly detecting technology problems. Acunu recently told me that similar applications are one of their sales focuses.

5. In another example, games and mobile applications can be a lot like websites in terms of the analytics that support them (all the more so if we’re talking about games with in-app purchases). Two special features come up repeatedly, however — leaderboards for games, and geospatial data sent by mobile devices.

6. Algorithmic trading is flashy because of the sums of money involved, and because of what is often hyper-low latency; I’ve even heard 50 microseconds, and that’s a slightly out of date figure for a sequence of several atomic operations. But otherwise it’s not one of the more interesting areas to me, for at least two reasons:

  • It depends on a lot of latency-specific stuff, such as hand-crafted hardware.
  • The participants are secretive — understandably so as they’re literally in a race with each other –and don’t reveal much.

Another reason I don’t study it much is that high-frequency trading could be devastated at any time by some simple regulatory changes.

7. I finally figured out one of the big drivers for better risk analysis. Banks need to keep capital lying around to cover a fraction of the risk they take on. If they can estimate the risk more precisely, and come up with a lower number, then they need to keep less capital. That’s a lot like finding large bags of money.

8. Anti-fraud applications arise in many industries, with many different kinds of data and latency requirement. For example:

  • Insurers don’t want to pay bogus claims. They usually have weeks to think about that problem.
  • Telcos don’t want to provision services for customers who will defraud them. They have to decide at call-center speed.
  • Similarly, retailers don’t want to accept bogus returns.
  • Stockbrokers don’t want rogue traders to defeat their controls. A lot of data and analysis go into that mission, as billions of dollars — literally — can be at stake.

9. And finally, the recent Boston Marathon bombing has brought law-enforcement/anti-terrorism applications to the fore. The Boston Globe criticized difficulties in information sharing, but the money quote is:

The FBI followed up by checking government databases and looking for things such as “derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history,” according to FBI Supervisory Agent Jason J. Pack. “The FBI also interviewed Tamerlan Tsarnaev and family members. The FBI did not find any terrorism activity.”

Neither the telephone intercept nor the web-surfing tracking is a capability the government routinely admits, unless there was something like a wiretap order that I so far haven’t seen reported.

Related links

Categories: Other

MemSQL scales out

Tue, 2013-04-23 02:56

The third of my three MySQL-oriented clients I alluded to yesterday is MemSQL. When I wrote about MemSQL last June, the product was an in-memory single-server MySQL workalike. Now scale-out has been added, with general availability today.

MemSQL’s flagship reference is Zynga, across 100s of servers. Beyond that, the company claims (to quote a late draft of the press release):

Enterprises are already using distributed MemSQL in production for operational analytics, network security, real-time recommendations, and risk management.

All four of those use cases fit MemSQL’s positioning in “real-time analytics”. Besides Zynga, MemSQL cites penetration into traditional low-latency markets — financial services (various subsectors) and ad-tech.

Highlights of MemSQL’s new distributed architecture start:

  • There are two kinds of MemSQL node — “aggregator” and “leaf”.
    • Aggregators are a kind of head node. You can have a bunch of them.
    • Leafs run full single-server MemSQL. You can have a bunch of them too.
  • MemSQL has two query optimizers. One kind runs on the aggregator nodes, and thinks about the whole cluster. The other runs on the leafs, and only thinks about its own node.
  • Much of the join and aggregation work is done on the aggregator nodes, but I didn’t pursue that issue in much detail.
  • It is good policy — and supported — to replicate small dimension/reference tables across the cluster. These are replicated to aggregator and leaf nodes alike. (This tells us that some joins are indeed done on the leafs. ;) )
  • MemSQL replication can be synchronous or asynchronous. It can be used for high availability.

Also:

  • MemSQL writes (whether primary or replicated) go to a buffer. The buffer size can be 0 or positive, in a tradeoff of durability vs. the likelihood of a disk I/O bottleneck.
  • MemSQL has many virtual nodes on each physical (leaf) node. (This is pretty much an industry-standard best practice, as it helps with elasticity, recovery from node failure, and so on.)
  • Compression is still a future feature.
  • So is online schema change.
  • Leaf nodes have cost-based optimizers.
  • MemSQL’s aggregator (cluster-wide) optimizer is mainly heuristic, but is supposed to get more cost-based in future releases.
  • In some releases it will be possible to keep MemSQL running while upgrading the software. But that’s not a promise for releases that change how replication works.

And which not-easily-parallelized aggregate did MemSQL implement first? The same one Platfora did — COUNT DISTINCT.

Categories: Other

Notes on TokuDB and GenieDB

Mon, 2013-04-22 04:07

Last week, I edited press releases back-to-back-to-back for three clients, all with announcements at this week’s Percona Live. The ones with embargoes ending today are Tokutek and GenieDB.

Tokutek’s news is that they’re open sourcing much of TokuDB, but holding back hot backup for their paid version. I approve of this strategy — “doesn’t lose data” is an important feature, and well worth paying for.

I kid, I kid. Any system has at least a bad way to do backups — e.g. one that involves slowing performance, or perhaps even requires taking applications offline altogether. So the real points of good backup technology are:

  • To keep performance steady.
  • To make the whole thing as easy to manage as possible.

GenieDB is announcing a Version 2, which is basically a performance release. So in lieu of pretending to have much article-worthy news, GenieDB is taking the opportunity to remind folks of its core marketing messages, with catchphrases such as “multi-regional self-healing MySQL”. Good choice; indeed, I wish more vendors would adopt that marketing tactic.

Along the way, I did learn a bit more about GenieDB. In particular:

  • GenieDB is now just backed by a hacked version of InnoDB (no more Berkeley DB Java Edition).
  • Why hacked? Because GenieDB appends a Lamport timestamp to every row, which somehow leads to a need to modify how indexes and caching work.
  • Benefits of the chamge include performance and simpler (for the vendor) development.
  • An arguable disadvantage of the switch is that GenieDB no longer can use Berkeley DB’s key-value interface — but MySQL now has one of those too.

I also picked up some GenieDB company stats I didn’t know before — 9 employees and 2 paying customers.

Related links

Categories: Other

Notes on Teradata systems

Mon, 2013-04-15 00:53

Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:

  • Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
  • This year the figure is around 40%.
  • The 6700 is based on Intel’s Sandy Bridge.
  • Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
  • Teradata has now largely switched over to InfiniBand.

Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:

  • Teradata Unified Data Architecture.
  • Fabric-based computing, even though this isn’t really about storage.
  • Teradata SQL-H.

The upshot is that Teradata has at least 6 kinds of rack or cabinet it wants to sell you — along with software to connect them — of which it really thinks you should get at least 3:

  • The 4 main Teradata-software appliances:
    • Active Enterprise Data Warehouse (the new 6700). Teradata thinks every sufficiently large enterprise should have one of these.
    • Extreme Performance Appliance (Teradata 4xxx), based on solid-state drives (which are also used in the 6xxx systems). At least I think so; the 4xxx wasn’t in the most recent slide deck I saw.
    • Data Warehouse Appliance (Teradata 2700).
    • Extreme Data Appliance (Teradata 1650).
  • The Teradata Aster Big Analytics Appliance, running Aster and Hadoop software. Teradata basically thinks everybody should have one of these too.
  • A separate cabinet for special-purpose “Teradata Managed Servers”. While there’s some space for Managed Servers in other Teradata appliances, Teradata now offers so many such capabilities that it thinks you will likely need a separate rack for those as well. These include (partial list):
    • Viewpoint system management.
    • Backup.
    • Teradata Unity.
    • Data movement, which is not the same thing as Teradata Unity.
    • Data loading, which is yet something else.
    • Generic compute (notably, to run SAS).

Even that doesn’t exhaust the possibilities:

  • The 36 InfiniBand ports Teradata can fit into a cabinet aren’t enough, it suggests and presumably will sell you free-standing Mellanox switches as an alternative.
  • That slide deck split the Big Analytics Appliance back out into Aster and Hadoop options.
  • There also seems to be a SAS-specific modeling appliance.

And you can have — or in some cases must have — Teradata Managed Server nodes in other kinds of Teradata appliance.

Finally, Teradata also offers a stand-alone single- or several-node Teradata 670 Data Mart Appliance, notes on which include:

  • The Teradata 670′s entry price is under $1/2 million, if you want to use it as your first Teradata system (something that evidently is happening, mainly outside the Americas).
  • Another use for the Teradata 670 is for physical — as opposed to virtual — data mart spin-out.
  • The primary use for the Teradata Data Mart Appliance, however, seems to be test/development for larger Teradata systems.
  • The Teradata Data Mart Appliance is one of the options for placing in a separate managed-server Teradata rack.

Related links

Categories: Other

Teradata SQL-H

Mon, 2013-04-15 00:46

As vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. Rather, SQL-H is a transparency interface that makes Hadoop data responsive to the same code that would work on Teradata Aster …

*Speaking of confusion — Teradata Aster seems to use the spellings SQL/MR and SQL-MR interchangeably.

… except that now there’s also a SQL-H for regular Teradata systems as well. While it has the same general features and benefits as SQL-H for Teradata Aster, the details are different, since the underlying systems are.

I hope that’s clear. :)

Categories: Other

Introduction to Deep Information Sciences and DeepDB

Sat, 2013-04-13 22:33

I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:

  • DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
  • DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
    • DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
    • Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
    • DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
    • The Deep guys have plans and designs for scale-out — transparent sharding and so on.

*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.

Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:

Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.

Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.

Among the 10 people listed as part of Deep Information Sciences’ team, I noticed 2 who arguably had DBMS industry experience, in that they worked at virtualization vendor Virtual Iron, and stayed on for a while after Virtual Iron was bought by Oracle. One of them, Chief Scientist & Architect Tom Hazel, also was at Akiban for a few months, where he did actually work on a DBMS. Other Deep Information Sciences notes include:

  • Deep has 25 or so people in all.
  • Deep had a recent $10 million funding round.
  • Deep Information Sciences is the former Cloudtree, which as of February, 2011 was pursuing quite a different strategy. (Evidently there was a pivot.) Deep was founded in 2010.
  • There are 2 paying customers for DeepDB, even though it’s still in beta, and 8 trials. A similar number of trials and strategic partners are queued up.
  • DeepDB general availability is expected later this quarter.

Although our call was blessedly technical, we didn’t have a chance to go through the DeepDB architecture in great detail. That said, DeepDB seems to store data in all of 3 ways:

  • An in-memory row store.
  • An on-disk row store with a very different architecture.
  • Indexes, which can also serve as a column store.

Notes on that include:

  • DeepDB’s in-memory row store is designed to manage single rows as much as possible, rather than pages. Indeed, there are “aspects of tries”, although we didn’t drill down into what exactly that meant.
  • Indexes are streamed to disk no less than once every 15 seconds, by default, and perhaps with latency as low as 10 milliseconds.
  • Perhaps the most important point I didn’t grasp is “segments”. The data and indexes on disk are stored in segments, which can be of different sizes, and which may each carry some summary data/metadata/whatever. Somehow, this is central to DeepDB’s design.
  • In what is evidently a design focus, DeepDB tries to get the benefit of “in-memory data” that isn’t actually taking up RAM. B-trees can point at rows that aren’t actually in memory. Segments evicted from cache can leave some metadata or summary data behind.
  • DeepDB’s compression story seems to be a work in progress.
    • There’s prefix compression already, at least in the indexes, which Deep just calls “compaction”.
    • Other compression is working in the lab, but not scheduled for Version 1.0.
      • Block compression seems to be in play.
      • Delta compression was mentioned once
      • Dictionary compression wasn’t mentioned at all.
    • DeepDB apparently will keep compressed data in cache, then decompress it to operate on it.
    • Different segments can be compressed/uncompressed differently.
  • DeepDB’s on-disk row store is append-only. Time-travel is being worked on. While I forgot to ask, it seems likely that DeepDB has MVCC (Multi-Version Concurrency Control). :)

And finally: DeepDB in its current form is a “drop-in” InnoDB replacement, but not necessarily bug-compatible.

Categories: Other

Some notes on new-era data management, March 31, 2013

Mon, 2013-04-01 02:44

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

  • Workloads and use cases vary greatly.
  • In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

  • It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
  • Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
  • In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
  • Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included:

  • MongoDB’s tunable consistency seems really interesting, with numerous choices available at the program-statement level.
  • All rumored performance problems notwithstanding, Ron asserts that MongoDB often “kicks butt” in actual proof-of-concept (POC) bake-offs.
  • Ron cites “12 different language bindings” as a key example of developer functionality giving 10gen an advantage vs. Ron’s previous employer MarkLogic.
  • 10gen is working hard on management tools, security, and so on.
  • Ron claims that the “MongoDB loses data” knock is a relic of the distant — i.e. 1-2 years ago — past.
  • We had the same “Who needs joins?” discussion that I used to have with MarkLogic — Ron’s former company — and which MarkLogic has since disavowed. ;)
  • There’s nothing special about MongoDB’s b-tree indexes. (I mention that because Tokutek thinks it offers a faster MongoDB indexing option.)

While this wasn’t a numbers-oriented conversation, business highlights included:

  • A lot of MongoDB’s competition is RDBMS — Oracle, SQL Server, MySQL, etc.
  • MongoDB’s top NoSQL competitor is Cassandra. 10gen sees less Couchbase than before, and also less HBase than Cassandra.
  • There’s yet another favorable MongoDB soft metric — 50,000 registrants for free online education, 2/3 outside the US.

I can add that anecdotal evidence from other industry participants suggests there’s a lot of MongoDB mindshare.

Specific traditional-enterprise use cases we discussed focused on combining data from heterogeneous systems. Specifically mentioned were:

  • Reference data/360-degree customer view.
  • Reference data about securities.
  • Aggregation of analytic results from various analytic systems across an enterprise. (For risk management).

DBAs’ roles in development

A lot of marketing boils down to “We don’t need no stinking DBAs!!!” I’m thinking in particular of:

  • NoSQL.
  • Hadoop and/or exploratory BI* messaging that positions against the alleged badness of “traditional data warehousing”.

*See in particular the comments to that post.

The worst-case data warehousing scenario is indeed pretty bad. It could feature:

  • Much internal discussion and politicking to determine the One True Way to view various data fields, with …
  • … lots of ongoing bureaucratic safeguards in the area of data governance.
  • Long additional efforts in the area of  performance tuning.
  • Data integration projects up the wazoo.

But if the goal is just to grab some data from an existing data warehouse, perhaps add in some additional data from the outside, and start analyzing it — well, then there are many attempted solutions to that problem, including from within the analytic RDBMS world. The question is whether the data warehouse administrators try to help — which usually means “Here’s your data; now go away and stop bothering me!” — or whether they focus on “business prevention”.

Meanwhile, on the NoSQL side:

  • The smart folks at WibiData felt the need for schema-definition tools over HBase.
  • Per Ron Avnur, MongoDB users are clamoring for consistency-rule specification via an administrative (rather than programmatic) UI.

It’s the old loose-/tight-coupling trade-off. Traditional relational practices offer a clean interface between database and code, but bundle the database characteristics for different applications tightly together. NoSQL tends to tie the database for any one app tightly to that app, at the cost of difficulties if multiple applications later try to use the same data. Either can make sense, depending on (for example):

  • How it seems natural to organize your development and data administration talent.
  • Whether the app is likely to survive long enough that you’ll want to run many other applications against the same database.
Categories: Other

Platfora at the time of first GA

Tue, 2013-03-26 04:50

Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.

In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.

Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed

  • Platfora incrementally batch-loads data from Hadoop into its own bare-bones SQL data store, and does BI against that. That data store:
    • Of course wants to run in-memory whenever possible …
    • … but also has a significant disk-based aspect.
    • Is true-columnar on disk and in memory alike.
    • Stores all columns from a given row on the same nodes.
  • Specifically, Platfora builds star-schema data marts, called “lenses”. To avoid data bloat on the Platfora servers:
    • Two lenses with the same data often only store it once.
    • The data for a given lens can be “evicted” if it won’t be needed for a while. (But the specifications for the lens are of course kept in case you want to rebuild it later.)

Notes on Platfora’s Hadoop ETL (Extract/Transform/Load) include:

  • The basic idea is that you periodically re-run a job to pick up incremental changes since the last load.
  • Right now that’s just a cron job or something. Platfora plans to add scheduling features imminently.*
  • Platfora is sensitive to Hive partitioning.
  • Platfora can run filters and so on to extract non-Hive data (the more common case).

*But in a sad comment on Hadoop’s workload management capabilities, Platfora doesn’t expect these features to be much used, at least at first.

Platfora’s aggregation story goes something like this:

  • If an aggregate can be updated incrementally — for example a count or sum — Platfora probably will maintain it for you and update it on load.
  • Ditto if it can be maintained almost incrementally — for example an average.
  • Platfora also does Distinct calculations, even though those have to be worked through on its own servers.

As you would expect, Version 1 of the Platfora data store has various limitations, such as:

  • Platfora Version 1 can’t do much with arrays or (other) nested data structures — it just transforms them into JSON strings.
  • Platfora’s SQL support is limited.
  • The Platfora data store has a “fat head” master (but at least that head is multi-node).

Naturally, Platfora hopes to fix these issues down the road.

Finally, a few company notes:

  • Platfora has had 20 beta users, mainly but not entirely among online businesses.
  • Platfora has close to 50 people.
  • Platfora is currently focused on US direct sales, relying on inbound leads.
Categories: Other

Appliances, clusters and clouds

Sat, 2013-03-23 23:05

I believe:

  • The trend to clustered computing is sustainable.
  • The trend to appliances is also sustainable.
  • The “single” enterprise cluster is almost as much of a pipe dream as the single enterprise database.

I shall explain.

Arguments for hosting applications on some kind of cluster include:

  • If the workload requires more than one server — well, you’re in cluster territory!
  • If the workload requires less than one server — throw it into the virtualization pool.
  • If the workload is uneven — throw it into the virtualization pool.

Arguments specific to the public cloud include:

  • A large fraction of new third-party applications are SaaS (Software as a Service). Those naturally live in the cloud.
  • Cloud providers have efficiencies that you don’t.

That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.

Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.

To put all that more simply:

  • Moving existing applications to new platforms often isn’t worth the trouble.
  • Many needs can be best met by single, physically local devices.

Appliances are a natural form factor for single-purpose computing. It is reasonable to characterize as “appliances” — in the computing sense of the term — medical equipment, vehicles, cash machines, cash registers, enterprise security devices, home entertainment, exercise machines and, yes, refrigerators; computers, in some form, can be found almost anywhere. But appliances also are a convenient way to package enterprise systems — configurations will be correct, installation will be simpler, and fortunate software-centric appliance vendors may capture margins on hardware sales and support. And the idea of SaaS-like continuous updates to your enterprise systems seems much more reasonable in the case of a locked-down appliance-like configuration.

Circling back to the beginning, I’d say there are multiple reasons not to expect all your computing to be done on a single cluster:

  • You might want to use appliances don’t fit into that cluster.
  • You might want to use SaaS offerings that don’t fit into that cluster.
  • The efficiency gains from using a single cluster aren’t that much greater than the gains from using a few of them.
  • You might want different parts of your computing work to be done in-house and in the public cloud.
  • You might want different parts of your data to be kept in different countries.
  • Different kinds of work might fit better onto differently-configured nodes, and current cloud/cluster technology doesn’t do a wonderful job with heterogeneity.
  • A lot of computing is so inherently small and local that it shouldn’t be clustered at all. :)

Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.

Categories: Other

Essential features of exploration/discovery BI

Sat, 2013-03-23 22:57

If I had my way, the business intelligence part of investigative analytics — i.e. , the class of business intelligence tools exemplified by QlikView and Tableau — would continue to be called “data exploration”. Exploration what’s actually going on, and it also carries connotations of the “fun” that users report having with the products. By way of contrast, I don’t know what “data discovery” means; the problem these tools solve is that the data has been insufficiently explored, not that it hasn’t been discovered at all. Still “data discovery” seems to be the term that’s winning.

Confusingly, the Teradata Aster library of functions is now called “Discovery” as well, although thankfully without the “data” modifier. Further marketing uses of the term “discovery” will surely follow.

Enough terminology. What sets exploration/discovery business intelligence tools apart? I think these products have two essential kinds of feature:

  • Query modification.
  • Query result revisualization.*

Here’s what I mean.

*I’d wanted to call this re-presentation. But that would have been … pun-ishing. :)

The canonical form of query modification is:

  • There’s a scatter plot or other graphical data visualization.
  • You select a rectangular area on the graph.
  • A new visualization is drawn.

That capability is much more useful in systems that allow you to change how the data is visualized, both:

  • Before you select a subset of the results (so you can choose which visualization is easiest to select from).
  • After you’ve made the selection (it would be silly to stay in a monthly bar chart if you’ve just selected a single month).

Other forms of query modification, such as faceted drill-down or parameterization, don’t depend as heavily on flexible revisualization. Perhaps not coincidentally, they’ve been around longer in some form or other than have the QlikView/Tableau/Spotfire kinds of interfaces. But at today’s leading edge, query modification and query result revisualization are joined at the hip.

What else is important for these tools?

  • Good UI design, of course.
  • Speed — split seconds matter.
  • Most of the same features that matter for business intelligence tools with other kinds of UI.

Please note that speed is a necessary condition for exploratory BI, not a sufficient one; a limited UI that responds really fast is still a limited UI.

As for how the speed is achieved — three consistent themes are columnar storage, compression, and RAM. Beyond that, the details vary significantly from product to product, and I won’t try to generalize at this time.

Related links

Categories: Other

DBMS development and other subjects

Sun, 2013-03-17 23:29

The cardinal rules of DBMS development

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1. 

In particular:

  • Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
  • Mixed workload management is harder than you’re assuming it is.
  • Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

DBMS with Hadoop underpinnings …

… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.

But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.

MarkLogic …

… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.

As for MarkLogic’s Enterprise NoSQL messaging — it basically equates “NoSQL” to “short-request dynamic-schema“, and in 2013 I have little quarrel with that definition.

RDBMS-oriented Hadoop file formats are confusing

I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.

Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:

  • Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
  • What’s the nested data structure story? (It seems there is one.)
  • What’s the compression story?

Come to think of it, the name “Parquet” suggests that either:

  • Rows and columns are mixed together.
  • Somebody has the good taste to be a Celtics fan.

Whither analytic platforms?

I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.

I think that problems include:

  • Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
  • But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
  • … a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
  • More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.

Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.

Related links

Categories: Other

Dataset management

Sun, 2013-03-17 23:28

I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:

  • Metadata management in a structured-file context.
  • Lineage/provenance, auditing, and similar stuff.

Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. :) Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.

My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:

  • A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
  • But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.

As for the specific products, both of which you might want to check out:

  • Cloudera Navigator:
    • Is one product from a leading Hadoop company.
    • Assumes you use Cloudera’s flavor of Hadoop.
    • Is generally available.
    • Starts with auditing (lineage coming soon).
  • Revelytix Loom:
    • Is the main product of a small metadata management company.
    • Is distro-agnostic.
    • Is in beta.
    • Already does lineage.
Categories: Other

Hadoop execution enhancements

Mon, 2013-03-11 04:21

Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:

  • Yahoo is a big YARN user.
  • There are other — paying — YARN users.
  • YARN general availability is now targeted for well before the end of 2013.

Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others.  The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

This is similar to the approach of BDAS Spark:

Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.

although Tez won’t match Spark’s richer list of primitive operations.

More specifically, there will be six primitive Tez operations:

  • HDFS (Hadoop Distributed File System) input and output.
  • Sorting on input and output (I’m not sure why that’s two operations rather than one).
  • Shuffling of input and output (ditto).

A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.

I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:

  • The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
  • Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
  • Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
Categories: Other

Open source strategies

Fri, 2013-03-01 04:53

From time to time I advise a software vendor on how, whether, or to what extent it should offer its technology in open source. In summary, I believe:

  • The formal differences between “open source” and “closed source” strategies are of secondary importance.
  • The attitudinal and emotional differences between “open source” and “closed source” approaches can be large.
  • A pure closed source strategy can make sense.
  • A closed source strategy with important open source aspects can make sense.
  • A pure open source strategy will only rarely win.

Here’s why.

An “open source software” business model and strategy might include:

  • Software given away for free.
  • Demand generation to encourage people to use the free version of the software.
  • Subscription pricing for additional proprietary software and support.
  • Direct sales, and further marketing, to encourage users of the free stuff to upgrade to a paid version.

A “closed source software” business model and strategy might include:

  • Demand generation.
  • Free-download versions of the software.
  • Subscription pricing for software (increasingly common) and support (always).
  • Direct sales, and associated marketing.

Those look pretty similar to me.

Of course, there can still be differences between open and closed source. In particular:

  • Open source can help with sales to enterprises that don’t trust a new vendor to keep progressing.
  • Open source can hurt with sales to enterprises that jump at the opportunity to do what they want, themselves, for “free” and — which in some cases is important to them — in secret.
  • Open source has fewer pricing option than closed.

Summing up the story so far, then, closed source is a superior strategy to open, except and to the extent that your are forced down the open route. More precisely, any advantages to an open source strategy can also be captured by having a hybrid open/closed strategy that emphasizes the closed part.

So what part of the story haven’t I told yet? Mainly, it’s open source marketing. Open source can seem virtuous and/or cool — to users, influencers, or even your own engineers. But while that’s true of people, it’s less true of companies, which are unlikely to spend a lot of money on the basis of coolness or virtue. Rather, the strictest believers in acquiring open source software do so precisely because it’s something for which they don’t have to pay, or pay much.

Further, some people think pro bono is a business strategy, because if you build up enough users, monetization can eventually follow. In the cases of more-or-less explicit advertising, pro bono really does work. I give away the content of this blog; in return, people contact me from time to time and offer to buy my services — with “sales cycles” so short as to be unworthy of the name. Fun ensues, and profit. The connection is even clearer in the case of traditional mass media, or of internet services such as Twitter and Facebook. But when what you’re selling and giving away are both technology, the pro bono story has to be something like “We’ll get you hooked on the free stuff, then charge you for the rest.”

That may be great for games, but how does it work for professional software? There are some special cases, mainly:

  • Your product can be used by awesomely impressive internet companies that, while refusing to pay for software themselves, validate it for adoption by lesser organizations that indeed are willing to pay. This has worked for multiple projects started by those companies themselves, such as Hadoop and memcached, but only one I think of that wasn’t — MySQL.
  • You can let users gain attachment to your free stuff, then sell your whole company to somebody who now wants to sell them other stuff, presumably closed source (or hardware), or who just is impressed by the awesomeness of your technology. This strategy has produced a very small number of great exits — XenSource, arguably Nicira (although Nicira itself disagrees), maybe a couple of others.

But in most cases, the strategy loops back what I described at the top of this post:

  • A free core product, which may be genuinely valuable to some/most users, and which certainly offers them a great opportunity to test the technology, plus …
  • … a chargeable/proprietary add-on, which is required for the most serious work, …
  • … or else just support.

There aren’t actually a lot of major examples in the “just support” camp* — the main ones who come to mind are Red Hat, 10gen, and Hortonworks, and two of those three are for products that were open source projects long before the respective companies were founded. And so we’re right back to an Enterprise Edition/Community Edition split.

*Or “mainly just support” — as per my recent post on Hadoop distributions, almost everybody offers SOMETHING proprietary.

This all still leaves an attitudinal distinction among (in decreasing order of open source rah-rah virtue):

  1. Build and promote a great free product. One of these years, get around to building and promoting a great chargeable one as well.
  2. Build and promote both a great free product and a great chargeable one.
  3. Build and promote a great chargeable product, and give a subset of it away for free. That subset should be good too.
  4. Build and promote a great chargeable product, and give a crappy subset away for free.

I think #3 makes the most sense. #4 is bad because I don’t believe in promoting or distributing crappy products even for free. #2 is too big a challenge to tackle, in technology and marketing alike. And #1 is only for the most patient vendors with the deepest of pockets.

There’s also the possibility of open sourcing software and then making your main revenue from being the best hosting company for it. But to date that has worked mainly for Automattic.

Finally — what about open source as a development strategy? Well, there are indeed some projects with multiple sets of major contributors — Linux, R, Hadoop, Postgres and so on. But for projects that originate with a single sponsoring vendor, my general observation still stands:

  • Open source software commonly gets community contributions for connectors, adapters, and (national) language translations.
  • But useful contributions in other areas are much rarer.

Related links

  • The open/closed source distinction is central to only a few of the issues on our strategy and execution worksheets, mainly the ones influenced by pricing. However, it is at least slightly relevant to a considerable fraction of them.
  • I glossed over the free-like-speech/free-like-beer distinction a bit; hopefully my usage was clear in context.
Categories: Other

Hadoop distributions

Wed, 2013-02-27 05:41

Elephants! Elephants!
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Three elephants went out to play
Etc.

–  Popular children’s song

It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:

  • Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
    • Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
    • Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
    • Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
  • Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
    • Cloudera seems more willing to do that than Hortonworks.
  • Different distro providers may choose different sets of Apache Hadoop subprojects to include.
    • Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
  • Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
    • Hortonworks markets from a “more open source than thou” stance, even though:
      • It is not a purist in that regard.
      • That marketing message is often communicated by Hortonworks’ very closed-source partners.
    • Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
    • Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
    • I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
  • Optionally, third parties’ code can be provided, open or closed source as the case may be.

Most of the same observations could apply to Hadoop appliance vendors.

Besides code, Hadoop distribution providers commonly offer support. The Hadoop support situation is confused, largely because:

That said:

  • One should distinguish between, say, Tier 1 and Tier 3 support.
  • Since most serious Hadoop development is done by Cloudera and Hortonworks, those two vendors are by far the best qualified to do Tier 3+ support.
  • Since Cloudera has the most Hadoop market share to date, it also has the most Hadoop support experience (any and all tiers).
  • Some of the other contenders are huge companies that presumably know how to support enterprise customers. This includes both distro providers and others (e.g. Oracle, which sells a Cloudera-based appliance and handles Tier 1 support for that itself).

And finally, reasons that come to mind for choosing particular distributions include:

  • Cloudera
    • Cloudera Manager is (relatively speaking) mature.
    • Cloudera Navigator seems promising.
    • Cloudera has the most experienced Hadoop services operation.
    • Cloudera has the development “axe” in some parts of Hadoop and is second only to Hortonworks in the others.
    • Cloudera has lots of partner support.
    • Cloudera is the best-funded company whose main business is Hadoop.
  • Hortonworks
    • With the arguable exception of Cloudera, Hortonworks has much more Hadoop expertise than any other outfit, including the development “axe” in a variety of areas.
    • Hortonworks has lots of partner support.
    • Hortonworks is the second-best-funded company whose main business is Hadoop.
    • Because of its low reliance on proprietary code, Hortonworks has great “escapability”, and correspondingly weak pricing power vs. its customers.
  • Intel
    • Intel’s Hadoop performance hacks may be legit.
    • Intel was evidently early in supporting Chinese Hadoop users.
  • EMC/Pivotal/Greenplum
  • MapR
    • At one point MapR seemed to have a performance advantage. I don’t know whether that’s still the case.
  • IBM
    • Some believe that IBM removes obstacles, and grants blessings of prosperity and wisdom.
Categories: Other

Greenplum HAWQ

Mon, 2013-02-25 15:40

My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.

The basic idea seems to be much like what I mentioned a few days ago  — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.

and

In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.

The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.

Categories: Other