I talked with Teradata about a bunch of stuff yesterday, including this week’s announcements in in-database predictive modeling. The specific news was about partnerships with Fuzzy Logix and Revolution Analytics. But what I found more interesting was the surrounding discussion. In a nutshell:
- Teradata is finally seeing substantial interest in in-database modeling, rather than just in-database scoring (which has been important for years) and in-database data preparation (which is a lot like ELT — Extract/Load/transform).
- Teradata is seeing substantial interest in R.
- It seems as if similar groups of customers are interested in both parts of that, such as:
This is the strongest statement of perceived demand for in-database modeling I’ve heard. (Compare Point #3 of my July predictive modeling post.) And fits with what I’ve been hearing about R.
*That’s very similar to the list of sectors for SAS HPA.
**To support their extremely high focus on product quality, semiconductor manufacturers have been using state-of-the-art analytic tools for at least 30 years.
In-database modeling is a performance feature, and performance can have several kinds of benefit, which may be summarized as “cheaper”, “better”, and “previously impractical”. My impression is that in-database modeling is pretty far toward the “previously impractical” end of the spectrum; enterprises don’t adopt a new way of predictive modeling until they want to create models that the old way can’t get done.
Basically, I think that models are increasingly:
- Richer and more diverse than before. (see for example Point #5 of my July predictive modeling post.)
- Developed in a more experimental and quickly-iterative way than before.
I think the first point pretty much implies the second, but the converse isn’t as clear; one can tweak old-style models in quick-turnaround fashion even more easily than one can develop the more complex newer styles.
And finally: I’m not hearing that modeling — even when it’s parallel and in-database fast — is commonly done on a complete many-terabyte dataset. It’s not a question I always remember to ask; for example, I didn’t bring it up with Teradata. But when I do, I rarely hear of models being trained on more than a few terabytes of data each.
I’ve posted a lot about surveillance and privacy intrusion. Even so, I have a few more things to say.
1. Surveillance and privacy intrusion do, of course, have real benefits. That’s a big part of why I advocate a nuanced approach to privacy regulation. Several of those benefits are mentioned below.
2. Nobody’s opinion about privacy rules should be based on the exact state of surveillance today, for at least two reasons:
- The disclosures keep coming.
- Technology keeps changing.
In particular, people may not realize how comprehensive surveillance will get, due largely to the “internet of things”. The most profound reason — and this will take decades to fully play out — is that we’re headed toward a medical revolution in which our vital signs are more or less continually monitored as they go about their business. Such monitoring will, of course, provide a very detailed record of people’s activities and perhaps even states of mind. Further, vehicle movements will all be tracked and our mobile devices will keep noting our location, in each case for multiple reasons.
3. I agree with the argument that better profiling would lead to less annoying mass screening — even though I also agree that much of what passes for important screening is really just wasteful security theater.
4. Indeed, there are one or two areas in which privacy protections actually go too far. The definite one is research. Medical research could benefit greatly from cross-patient analyses of medical records that are sadly prohibited due to patient information privacy rules. I conjecture that similar concerns may arise in other research domains as well.
The less definite one is general bureaucratic hassle. People are kept away from the hospital bedsides of their loved ones for fear they’ll overhear something affecting another patient’s privacy. And any officious bureaucrat who wants to stonewall an information request can usually find a privacy excuse for doing so.
5. The Eleventh Commandment states: Thou shalt not get caught. I believe that that principle, historically, has been one of the greatest protections against illegal surveillance and its consequences. There isn’t really that much the government can do to the people it snoops on before its misdeeds get too big to ignore — although 10s of 1000s of innocent people on the No-Fly list might take a less sunny view.
Unfortunately, we can’t just assume that “free” countries will stay safely free. My late mother lost her grandparents to the Nazis and a career to McCarthyism. “If you’ve done nothing wrong, you have nothing to worry about” is a very bad argument going forward, even if it’s substantiated by evidence in the recent past.
6. And finally: Some years ago, I was almost alone within the industry in raising privacy-related issues. But things sure have changed. TechCrunch — propelled by Mike Arrington — offers a good survey of the current level of concern.
It’s probably not coincidental that Arrington has a background in law or that I have one in public policy. But the time has come for everybody in the technology industry to worry about privacy, not just those of us who are oriented toward legal issues. It’s a huge mess; we helped make it; we are now responsible for helping to clean it up.
First, some quick history.
- I first heard of KXEN 7-8 years ago from Roman Bukary, then of SAP. He positioned KXEN as an easy-to-embed predictive modeling tool, which was getting various interesting partnerships and OEM deals.
- Returning those near-roots, KXEN is being bought (Q4 expected close) by SAP.
- I say “near roots” because KXEN’s original story had something to do with SVMs (Support Vector Machines).
- But that was already old news back in 2006, and KXEN had pivoted to a simpler and more automated modeling approach. Presumably, this ease of modeling was part of the reason for KXEN’s OEM/partnership appeal.
However, I don’t want to give the impression that KXEN is the second coming of Crystal Reports. Most of what I heard about KXEN’s partnership chops, after Roman’s original heads-up, came from Teradata. Even KXEN itself didn’t seem to see that as a major part of their strategy.
And by the way, KXEN is yet another example of my observation that fancy math rarely drives great enterprise software success.
KXEN’s most recent strategies are perhaps best described by contrasting it to the vastly larger SAS.
- SAS is built around a programming language for statisticians. KXEN tries to automate away many of the steps that SAS experts would program.
- This goes to the extreme that statistically-astute businesspeople are supposed to be able to use KXEN themselves. (However, it’s a general rule — dating back to the 1970s — that marketing claims of “programmers/technologists/experts aren’t needed” tend to be more aspirational than accurate.)
- SAS tries to offer every statistical and machine learning algorithm under the sun. KXEN is pretty focused on a single statistical approach.
- KXEN has followed SAS into offering applications. (It’s also a general rule that predictive modeling “apps” tend to be more in the way of quick-starts than complete products.)
- KXEN has recently tried to sell into markets where SAS isn’t strong, for example internet companies.
That all sounds a bit like a disruption narrative, but KXEN CEO John Ball never gave me the impression he thought strongly in those terms. And indeed KXEN never disrupted much of anything.
So what will SAP do with KXEN? Integrating predictive modeling and business intelligence is both important and difficult. So I imagine they’ll try, but I won’t hold my breath for great short-term success.
The bigger win could come on the application side. I’m skeptical about “analytic applications”, because it’s so tough to build complete ones. But let’s imagine an application that had elements of:
- Database query and update.
- Reporting and perhaps other BI.
- Predictive modeling.
That would seem more plausible, because it allows the analytic aspects to be smaller and more circumscribed.
As for which specific application areas could use predictive components, the usual suspects are:
- Above all, marketing and CRM (Customer Relationship Management).
- Risk, although KXEN is nowhere near handling hardcore Basel III compliance, Monte Carlo techniques, or anything like that.
- Quality, especially if we include maintenance as part of quality.
I imagine SAP will start trying to integrate KXEN in some of those areas.
Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include:
- The object/relational DBMS previously mentioned.
- The new DMLs (Data Manipulation Languages) or APIs previously mentioned over key-value stores.
- The SQL interfaces offered for a considerable number of non-SQL systems — Intersystems Cache’, MarkLogic, Hadoop (and thus HBase) and many more.
- Text search interfaces for a variety of DBMS.
- The JSON/MongoDB-compatibility interfaces that are popping up for multiple DBMS, e.g. DB2 or MarkLogic.
- FoundationDB, previously mentioned.
So will these trends take over the DBMS world?
Developing a multi-purpose DBMS is extremely difficult, and even harder if it’s layered.
- Developing any kind of DBMS is very hard.
- Developing a multi-purpose DBMS is harder yet. Try, for example, to imagine a caching and memory-management subsystem that’s optimal for multiple datatypes and DMLs at once.
- Layering carries performance costs. The best-case performance scenario is when you can optimize the flow of data all the way from client-server connection down to persistent storage, and back. Layering interferes with that.
But on the plus side, it can be great to have one DBMS handle multiple kinds of data.
- Almost irrespective of product category, there are obvious benefits to buying, installing and administering one thing that can meet multiple needs.
- Further, there are major use cases for manipulating the same data in different ways. For example:
- Almost any kind of large object is likely to have tabular metadata attached.
- Many kinds of database can, at times, be usefully addressed via full-text search.
- In scenarios where you incrementally derive and enhance data, it’s natural to want to keep everything in the same place. (That also helps with lineage, security and so on.) But derived data may be structured very differently than the raw data it’s based on.
And by the way — the more different functions a DBMS performs, the more they may need to be walled off from each other. In particular, I’ve long argued that it’s a best practice for e-commerce sites to manage access control, transactions, and interaction data in at least two separate databases, and preferably in three. General interaction logs do not need the security or durability that access control and transactions do, and there can be considerable costs to giving them what they don’t need. A classic example is the 2010 Chase fiasco, in which recovery from an Oracle outage was delayed by database clutter that would have fit better into a NoSQL system anyway. Building a single DBMS that refutes my argument would not be easy.
So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.
- Layered and multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.
- Exadata-like tiering in an otherwise integrated system seems like a smart way to avoid the traditional shared-everything vs. shared-nothing tradeoffs. Tiering could also be a good way to combine the ever more numerous kinds of storage — dish, flash, multiple levels of cache, etc.
- Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …
- … strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.
- Stores CDRs (Call Detail Records), many or all of which are collected via …
- … some kind of back door into the AT&T switches that many carriers use. (See Slide 2.)
- Has also included “subscriber information” for AT&T phones since July, 2012.
- Contains “long distance and international” CDRs back to 1987.
- Currently adds 4 billion CDRs per day.
- Is administered by a Federal drug-related law enforcement agency but …
- … is used to combat many non-drug-related crimes as well. (See Slides 21-26.)
Other notes include:
- The agencies specifically mentioned on Slide 16 as making numerous Hemisphere requests are the DEA (Drug Enforcement Agency) and DHS (Department of Homeland Security).
- “Roaming” data giving city/state is mentioned in the deck, but more precise geo-targeting is not.
I’ve never gotten a single consistent figure, but typical CDR size seems to be in the 100s of bytes range. So I conjecture that Project Hemisphere spawned one of the first petabyte-scale databases ever.
Hemisphere Project unknowns start:
- Is that “back door into AT&T switches” inference really reliable? (I’m basing it on just a few words in the deck, and such decks can have inaccuracies in them.)
- Just which calls’ metadata is currently being collected?
- How long has this approximate rate of CDR collection been going on; can we just extrapolate back from the current 4 billion calls/day?
It seems that a primary use case for Project Hemisphere is to guess what phone numbers baddies are using, especially those of disposable “burner” cell phones that are otherwise very hard to trace. (The key benefit mentioned to such analysis is that those new phones can then be tapped.) There aren’t many details as to how the phone numbers are inferred, but since almost nothing is initially known about the target phone numbers except calling patterns, those are surely a huge part of the puzzle. In particular, it doesn’t seem to have been disclosed which other databases, if any, are linked into the analysis. There is no hint in the deck that the Hemisphere program directly collects telephone call contents. Rather, it’s used to help determine which telephone numbers to tap.
The government apparently trains its people to keep Hemisphere secret, to the point of lying about it, even though Slide 2 states that Hemisphere is “an unclassified program”.
- Slide 8-12 generally emphasize the Hemisphere program’s secrecy.
- Slide 10 seems to advocate outright deception. Specifically — and this is both complicated and ironic — it seems to say that the government should get subpoenas for information it already had without subpoena, so that those subpoenas can be the claimed source of the information when applying for yet other subpoenas.
So it seems as if Hemisphere is yet another example of the pattern:
- The US government has long lied about how far it invades privacy …
- … and about the assistance it receives from the telecom/technology industry in doing so.
- Little tangible harm has been done by those invasions, except to those who clearly deserved it.
Up to a point, this is reassuring. But it still bodes badly for a future in which there are many more ways surveillance can be used to hurt us than were possible before.
The general Tokutek strategy has always been:
- Write indexes efficiently, which …
- … makes it reasonable to have more indexes, which …
- … lets more queries run fast.
But the details of “writes indexes efficiently” have been hard to nail down. For example, my post about Tokutek indexing last January, while not really mistaken, is drastically incomplete.
Adding further confusion is that Tokutek now has two product lines:
- TokuDB, a MySQL storage engine.
- TokuMX, in which the parts of MongoDB 2.2 that roughly equate to a storage engine are ripped out and replaced with Tokutek code.
TokuMX further adds language support for transactions and a rewrite of MongoDB’s replication code.
So let’s try again. I had a couple of conversations with Martin Farach-Colton, who:
- Is a Tokutek co-founder.
- Stayed in academia.
- Is a data structures guy, not a database expert per se.
The core ideas of Tokutek’s architecture start:
- There’s a tree of what serve as indexes, much as in a B-tree. The ultimate leaf nodes store actual data.
- Operations to alter the database — update, insert, schema change, etc. — sends messages to buffers at the appropriate nodes.
- The messages are resolved when buffers are flushed.
- The buffers are flushed just-in-time.
- The buffers of messages are themselves indexed. (Otherwise, determining which buffers contain information relevant to a particular query might require slow and tedious scans.)
A central concept is the interplay between the buffers and the write load.
- Except when buffers are flushed, writes go just to the buffers, and presumably are append-only.
- Buffers are flushed rarely — on average when they’re almost 25% full.
Early on Tokutek made the natural choice to flush buffers when they were touched by a query, but now buffers are just flushed when the total buffer pool runs out of space, fullest buffer first.
This all raises the question — what is a “message”? It turns out that there are a lot of possibilities. Four of the main ones are:
- Insert. The payload is the contents of the inserted row.
- Delete. The payload is the ID of the row being deleted. Since Tokutek is MVCC, a delete message is really an instruction to ignore a row that’s still there.
- Upsert. An upsert is an insert or update, to be determined after the system figures out if there’s a row already in place. So the payload to an upsert message is the payload to an insert, plus enough information to handle the update case. Martin stressed that Tokutek upserts do not require a query to check whether the row already exists, and hence can be 1-2 orders of magnitude faster than upserts in conventional RDBMS.
- Schema change. These are global, broadcast to every node. (And so schema changes can be done while the database is online.)
Since messages are a big part of what’s stored at a node, and they can have a variety of formats, columnar compression would be hard to implement. Instead, Tokutek offers a variety of standard block-level approaches.
A natural question to ask about OLTP (OnLine Transaction Processing) and other short-request DBMS is “When are there locks and latches?” Four cases came up:
- When a buffer is being flushed.
- When a node is being split. (As in the case of B-tree systems.)
- When a transaction requires row locks.
- When MySQL mandates table locks, for whatever arcane reasons it does so.
I forgot to ask whether the locks at buffer flushing time cause performance hiccups.
Other notes include:
- Martin believes that Tokutek read and write performance are both fairly optimal, but at the cost of some CPU cycles. By way of contrast, B-trees have optimal read performance, but can be slow to write.
- I gather Tokutek tried multiple strategies with similar characteristics, with the deciding factor being the difficulties in each approach in coding up database features such as ACID or MVCC (MultiVersion Concurrency Control), or in achieving concurrency.
- Tokutek also hacked special optimizations to be competitive in cases where B-trees are especially fast (the case mentioned was sequential insertions).
- Default node size is 4 megabytes.
- It seems that the branching factor is in line with a Bε-tree, rather than a B-tree, where ε is approximately the square root of the number of keys at a node. (I was confused by that part, but fortunately it seemed inessential.)
And finally — Tokutek has been slow to offer MySQL scale-out, but with the MongoDB version, scale-out is indeed happening. One would think that data could just be distributed among nodes in one of the usual ways, with all the indexes pertaining to that data stored at the same node as the data itself. So far as I can tell, that’s pretty close to being exactly what happens.
When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.
- The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
- Stream processing (Storm) is next in line.
- Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
- Charles is also seeing at least POC interest in Spark.
- But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system
HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.
Another good subject was offloading work to Hadoop, in a couple different senses of “offload”:
- From general-purpose data stores, mainly RDBMS, analytic or otherwise. This sounds similar to Hortonworks’ views about efficiency-oriented offloading; batch work can be moved to Hadoop, saving costs and/or getting more mileage from costs that are already sunk into expensive legacy installations. The top targets here are large, centralized systems, with Teradata being a clear #1 and IBM mainframes a probable #2, but anything from Oracle to newer parallel analytic RDBMS is fair game.
- From the specialized data stores associated with fuller technology stacks. The example I had in mind was Splunk; Charles added Palantir, HP Arcsight and, in the past, Endeca. The idea here is that Hadoop is used to organize and/or index data the way those products’ native data stores would, but in higher volumes than they are (cost-)effective for.
On a pickier note, I encouraged Charles to push back against Hortonworks’ arguments for ORC vs. Parquet. His first claim was that ORC at this time only works under Hive, while Parquet can also be used for Hive, MapReduce, etc. (Edit: But see Arun Murthy’s comment below.) I suspect this is a case where Hortonworks and Cloudera should just get over themselves, and either agree on a file format or wind up each supporting both of them. There’s a lot of DBMS-like tooling in Hadoop’s future, and I have to think it will work better — or at least run faster — if it can make reliable assumptions about how data is actually stored.
- Developed by Cloudera.
- An Apache incubator project.
- Slated to be rolled into CDH — Cloudera’s Hadoop distribution — over the next couple of weeks.
- Only useful with Hive in Version 1, but planned to also work in the future with other Hadoop data access systems such as Pig, search and so on.
- Lacking in administrative scalability in Version 1, something that is also slated to be fixed in future releases.
Apparently, Hadoop security options pre-Sentry boil down to:
- Kerberos, which only works down to directory or file levels of granularity.
- Third-party products.
Sentry adds role-based permissions for SQL access to Hadoop:
- By server.
- By database.
- By table.
- By view.
for a variety of actions — selections, transformations, schema changes, etc. Sentry does this by examining a query plan and checking whether each step in the plan is permissible.
What Sentry doesn’t have is cell-based security, for which Charles perceives relatively little demand. I agree, but also note that traditional RDBMS implementations of cell-based security — notably Oracle Label Security — can have unpleasant performance consequences. From there, I segued the discussion to Accumulo. Unlike Hortonworks, Cloudera sees Accumulo demand strictly in the Federal government, where Accumulo is baked into some major reference architectures.
Charles also walked me through the use cases for some security requests he does frequently hear:
- Encryption at rest is important for compliance, for example for credit card numbers.
- Masking is also of particular interest for credit card numbers.
- Audit arises frequently for Sarbanes-Oxley compliance, and also in financial services (not necessarily for compliance).
- View-based security — a big point of Sentry — is usually to satisfy internal (i.e. non-regulatory) policies.
- Other issues in regulatory compliance (July, 2012)
Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden* didn’t bother showing up — include, in no particular order:
- Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer.
- As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
- Hortonworks used a figure of ~75 subscription customers. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
- … a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
- Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
- Other Hortonworks reseller partners seem more important in terms of helping customers consumer HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
- Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.
*Speaking of CEO Bearden, an interesting note from Derrick’s piece is that Bearden is quoted as saying “I started this company from day one …”, notwithstanding that the now-departed Eric Baldeschwieler was founding CEO.
In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social.
- These use cases can be any of a true new application, an enhancement to an existing application, or a general investigative analytics environment.
- This adoption is typically driven by a line-of-business group, but IT is a key influencer, and IT usually winds up running the project.
- Overall, this accounts for 70% of Hortonworks’ business by some metric.
The other 30% Hortonworks sees is efficiency-oriented — i.e., a cheaper way to store and/or process data.
- Hortonworks assigns ELT (Extract/Load/Transform) to this group. Based in part on a subsequent conversation with Cloudera, I gather that batch ELT offload — especially but not only from large Teradata installations — is a significant fraction of the total.
- “Data lake” and similar buzzwords fall into this group, as does “re-architecting”.
- Hortonworks asserts that adopters from the 70% rapidly move to this kind of use as well, while Teradata customers typically start out in this part.
- Unsurprisingly, this part is IT all the way.
One customer apparently estimates its fully burdened Hadoop costs at $900/terabyte/year.
Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera.
And finally: One of my favorite things to ask is “When you win, why do win?” — at least when I think the vendor won’t just reiterate their core marketing messages. Hortonworks gave a great, threefold answer:
- Its relationships with Teradata, Microsoft, et al.
- Its promise that it can get specific customer-requested features into Apache Hadoop on a specific timeframe. (Yes, the Contribution Olympics are still with us.)
- Its claim of greater experience with truly huge clusters — not just Yahoo, but I don’t know who its other examples are.
- A few weeks ago, I talked with Hortonworks at length about technology and other subjects.
- Has been a best-selling, award-winning novelist.
- Is superbly connected in the writing world. (Two terms as a director of the Author’s Guild, past president of Novelists, Inc., etc.)
- Taught college courses on both English and neurobiology.
- Was a top-two independent expert on search engines (her only peer was Danny Sullivan).
- Wrote better SQL than I did.
In other words, she’s no dummy.
I emphasize that because she’s my source about some screw-ups at Amazon.com and other online booksellers that at first seem a little hard to believe. In no particular order:
- Publisher-submitted price changes (specifically, temporary price cuts and then reversion to usual levels) are a massive industry problem, because certain online sellers don’t propagate them promptly, and Amazon then price-matches down to other sellers’ levels.
- Barnes and Noble had a two week (!) outage posting sales results for publishers, at least for some accounts, after which a whole lot of sales wound up being posted the same day,
- Metadata assigning books to categories on Amazon.com was recently lost (and in some cases spuriously created). Around the same time …
- … aggregate author-facing sales ranking were borked as well.
- Weeks after Linda uploaded a new book cover image to Kobo, they’re still using the old one.
My basic takeaway is — the whole thing’s a mess.
What could explain all this? Technically, I doubt it’s any one or two things. Online booksellers smaller than Barnes and Noble may generally lack development resources. Barnes and Noble evidently can’t get its data silo connectivity act together (among many other technical shortcomings). Amazon probably suffered multiple snafus — part was surely an “upgrade” gone wrong — since different kinds of raw and derived data got corrupted.
But I do have one business explanation for it all — contempt for suppliers. To these booksellers, independent author/publishers are small suppliers. And computer systems that face small suppliers are commonly awful. (The same goes for business practices.) The meta-reason that so many publisher-facing systems are so bad in the bookselling world is probably just that online booksellers don’t make it a priority for them to be any better.
For years I’ve argued three points about privacy intrusions and surveillance:
- Privacy intrusions are a huge threat to liberty. Since the Snowden revelations started last June, this view has become more widely accepted.
- Much of the problem is the very chilling effects they can have upon the exercise of day-to-day freedoms. Fortunately, I’m not as alone in saying that as I once feared. For example, Christopher Slobogin made that point in a recent CNN article, and then pointed me to a paper* citing other people echoing it, including Sonia Sotomayor.
- Liberty can’t be effectively protected just by controls on the collection, storage, or dissemination of data; direct controls are needed on the use of data as well. Use-based data controls are much more robust in the face of technological uncertainty and change than possession-based ones are.
Since that last point is still very much a minority viewpoint,** I’ll argue it one more time below.
*There are actually two papers at the same link. The first 17 pages contain the one I cited as supporting the chilling effects point, and …
**… the second paper is a rare case of somebody else making the use-based controls argument.
Whether or not you personally believe that terrorism is a Big Scary Deal, a largish fraction of your fellow citizens long will. After all:
- Terrorism strikes fear. (That’s the essence of the word’s original definition.)
- Asymmetric warfare — the other current definition of terrorism — is the most practical way for most adversaries to threaten the US and other wealthy countries. (It works even better against poorer ones, actually.)
- Broad-brush defenses against terrorism — airport checkpoints and so on — are distressingly inefficient.
- Highly targeted defenses against terrorism rely on, yes, surveillance.
The obvious conclusion is: Anti-terrorism-oriented surveillance will be with us for a long time. Privacy controls will not be accepted if they (seem to) much hamper governments’ attempts to forestall terrorist acts. That eliminates the possibility of sweeping “Keep the government in the dark” kinds of laws.
Privacy observers nonetheless hope that data-flow controls alone can strike the needed balance between:
- Anti-terrorism and other uses of official surveillance.
- Our need to avoid surveillance’s chilling effects.
But I think their hope is vain, since technology is now much too complex and fast-changing for such rules ever to be gotten right. In particular:
- There are many kinds of highly intrusive monitoring technology, and they’re changing fast.
- There are many kinds of possibly-useful analytics, and they’re being added to fast.
- The analysis process is fundamentally investigative; you don’t know what works until after you’ve tried it, and hence you don’t know what kinds of data are most useful to you.
- It’s especially hard to predict what uses will be found for which combinations of data — and there are increasingly many kinds of data to combine.
The biggest point that most privacy commentators underestimate may be this: Monitoring of our daily activities is on track to become utterly pervasive, and foregoing this monitoring would require sacrificing a large fraction of future technological progress. Most of what we do leaves electronic trails, and most of the rest will before long. For example:
- Our financial transactions are already tracked, the few remaining cash ones excepted.
- Our reading and other media consumption are increasingly tracked. Paper books and broadcast TV are giving way to e-readers, websites, and streaming video.
- Our communications are increasingly tracked. That’s been the focus of news revelations the past couple of months. Communications metadata are definitely being tracked and turned over to the government; contents of electronic communications may well be winding up in government hands as well.
- Our physical movements and responses are becoming subject to much more tracking than is widely understood. Consider, for example:
- Your cell phone knows where you are, and numerous apps share that information.
- Police car cameras, traffic light cameras and so on, when combined with automated license plate recognition, are increasingly tracking vehicle locations and movements.
- The same goes for electronic toll payments and vehicles’ onboard sensors. And when autonomous vehicles (i.e. electronic drivers) mature, everything will be centrally tracked.
- Security cameras and the like track pedestrians as well. What’s more, in-store cameras are being deployed to track details of shoppers’ movements and attention, much as attention is finely tracked online.
- Fitbit is just the beginning. Future healthcare will rely on 24×7 medical monitoring of our actions and physiological responses.
And whatever data is gathered, it all — or at least all its significant bits — will be collected and analyzed in the cloud, where nosy governments will find it easy to access.
The story gets more confusing yet. Besides the Vs of “big data” itself — volume, velocity, variety and so on — there are also the vagaries of “data science”. For the purposes of this discussion, it is reasonable to caricature modern analytics as “gather a lot of data; shake vigorously; and see what conclusions fall out”. My point in saying that is — you don’t know the consequences of letting somebody have some data until after they’ve thrown a range of machine learning techniques at it. And so, for several reasons relating to to the difficulty of technological analysis, lawmakers, regulators and judges don’t have a realistic hope of establishing appropriate rules about possession of data, because they can’t predict what the consequences of those rules will turn out to be.
It’s always been the case that lawmakers are a bit slow in adapting to new technologies, while judges don’t prohibit privacy intrusions until the needed laws are (somewhat belatedly) written. I hope I’ve shown that, with the intensity of the technological change and the fears of terrorism, the gap this time will be much wider. But the story gets worse yet, because there already are instances in which legal enforcement of privacy has gone too far. First, there are the cases when privacy is used as pretext for bureaucratic or other official nonsense. I’ve vented about that in the past over the case of medical care and HIPAA; police harassment of citizen observers may be a more serious problem, although that depends on how jurisprudence eventually shakes out. Second, medical research is seriously restricted by privacy regulations. Depending on how privacy rules shake out, it is easy to imagine other forms of research — including national security or anti-terrorism! — being inhibited as well.
I don’t think that possession-based data controls can overcome these myriad challenges. So why am I hopeful that use-based ones can? Well, consider the use-based privacy control guidelines I recently offered:
- Probabilistic profiling data should rarely be admissible in court.
- Current rules against discrimination by employers, insurers, and credit granters should be strengthened.
- “Attention” data such as website visits should rarely be admissible in court.
- Private communications of all kinds should be … private.
- Criminal and other investigations should very rarely, if ever, be allowed to “look through walls”.
Maybe what I’m suggesting are exactly the right rules; maybe they aren’t. But in any case, they — or rules like them — don’t depend upon the specific kinds of data source or analytic technique covered. And so they can be robust against unforeseen developments in the collection, retention or analysis of data.
My clients at Aerospike are coming out with their Version 3, and as several of my clients do, have encouraged me to front-run what otherwise would be the Monday embargo.
I encourage such behavior with arguments including:
- “Nobody else is going to write in such technical detail anyway, so they won’t mind.”
- “I’ve done this before. Other writers haven’t complained.”
- “In fact, some other writers like having me go first, so that they can learn from and/or point to what I say.”
- “Hey, I don’t ask for much in the way of exclusives, but I’d be pleased if you threw me this bone.”
Aerospike 2′s value proposition, let us recall, was:
… performance, consistent performance, and uninterrupted operations …
- Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
- Uninterrupted operation is a core Aerospike design goal, and the company says that to date, no Aerospike production cluster has ever gone down.
The major support for such claims is Aerospike’s success in selling to the digital advertising market, which is probably second only to high-frequency trading in its low-latency demands. For example, Aerospike’s CMO Monica Pal sent along a link to what apparently is:
- a video by a customer named Brightroll …
- … who enjoy SLAs (Service Level Agreements) such as those cited above (they actually mentioned five 9s)* …
- … at peak loads of 10-12 million requests/minute.
*I haven’t watched the video, but Monica helpfully included a small amount of transcript.
Monica also updated Aerospike’s business highlights as (with some editing by me):
- Headcount – 50 and hiring.
- # of production customers – fairly high double digits, all paying.
- Biggest database – 30TB and growing.
- Most customers are at 1-4TB of unique data; most replicate 2x; many also replicate across data centers.
- Pricing – Free Community Edition with 2 servers and 200GB data. Enterprise Edition priced per terabyte and per datacenter, unlimited nodes per cluster, unlimited number of clusters, pay only for unique data, not replicas. Most start at $50k.
However, Aerospike 2 didn’t have much in the way of data manipulation options. Aerospike described its eponymous product as a key-value store, although I gather it was possible to look into the values up to a point; specifically:
- Aerospike has always had integer and string datatypes to extract.
- Aerospike has always had “bins”, which are like columns but don’t require consistent datatypes from one record to the next … or indeed any datatypes at all.
- By “bins … are like columns” I mean that you can retrieve a projection just on a bin.
Aerospike 3 adds more data manipulation features. Notes on that start:
- Aerospike 3 adds more datatypes, with a strong emphasis on nesting.
- Aerospike 3 adds secondary indexes.
- Aerospike 3 adds Lua UDFs (User-Defined Functions).
- Aerospike assures me that its DBMS-experienced development team consists of a lot more than the one-man Russell Sullivan acqui-hire.
Secondary indexes in Aerospike 3 just work with strings and integers, but Aerospike doesn’t dispute my opinion that it would be nice to index the (values inside the) new datatypes as well.
Specifically, Aerospike 3 adds four new datatypes, which it calls “complex”:
- Key-value pairs, for some reason called “maps”.
- Sets, of data of any datatype. As you might imagine, sets are unordered and have unique values.
- Lists, of data of any datatype. As you might imagine, lists are ordered.
- Stacks, of data of any datatype. There are lists with much better performance, but with the update functionality limitations you’d imagine from the name “stack”.
Aerospike also added what it calls “large” datatypes, which is a pointer-like way to link blocks. The point of those is to work around what is otherwise the record size limit (typically 128 Kb), so as to tie together all the information on, say, a single user.* Aerospike gives the impression that customers have been custom-building datatypes of these kinds all along, with stacks being especially popular.
*When I heard “collect all a user’s interaction data in one place”, the first thing I thought of was WibiData.
Notes on the Lua UDFs include:
- They are heavily pipelined. Even so …
- … in the interest of speed, there is no real node-to-node data movement. Aggregations get finished on the client.
- Hence, even though there are primitives called map() and reduce(), which mean about what you’d think they would …
- … Aerospike was mercifully easy to persuade not to call this a form of MapReduce.
So yes — Aerospike 3 may be regarded as Aerospike’s version of support for real-time analytics.