DBMS2
New England Database Summit (January 28, 2010)
New England Database Day has now, in its third year, become a “Summit.” It’s a nice event, providing an opportunity for academics and business folks to mingle. The organizers are basically the local branch of the Mike Stonebraker research tree, with this year’s programming head being Daniel Abadi. It will be on Thursday, January 28, 2010, once again in the Stata Center at MIT. It would be reasonable to park in the venerable 4/5 Cambridge Center parking lot, especially if you’d like to eat at Legal Seafood afterwards.
So far there are two confirmed speakers — Raghu Ramakrishnan of Yahoo and me. My talk title will be something like “Database and analytic technology: The state of the union”, with all wordplay intended.
There’s more information at the official New England Database Summit website. There’s also a post with similar information on Daniel Abadi’s DBMS Musings blog.
Comments on a fabricated press release quote
My clients at Kickfire put out a press release last week quoting me as saying things I neither said nor believe. The press release is about a “Queen For A Day” kind of contest announced way back in April, in which users were invited to submit stories of their data warehouse problems, with the biggest sob stories winning free Kickfire appliances. The fabricated “quote” reads:
As we went through the contest entries in detail, it was readily apparent that today’s data warehousing solutions are either massively expensive or non-existent,” said Curt Monash, Founder of Monash Research. “Clearly, there is major dual-market opportunity for a product such as the Kickfire appliance that can not only provide an affordable data warehousing solution to small companies; but can also target larger companies that have made an initial investment in high-end solutions, yet still need to add some affordable query processing power in other areas of the organization.”
In reality:
- I spent a few minutes reviewing summaries of eight stories selected by Kickfire from the entrants, and emailed comments back to Kickfire about them. I have no further role to play in the contest.
- The part of the “quote” that slams Kickfire’s competitors is not reflective of my views.
- The “market opportunity” is in line with the positioning I’ve encouraged Kickfire to adopt. A good shorthand for it is the “Sybase IQ market.” In essence I see Kickfire as an interesting Sybase IQ alternative. But Sybase IQ is a formidable competitor, and there are many other competitors as well. This is hardly an untapped market ripe for Kickfire’s plucking.
I’m satisfied that this is all a case of lousy marketing execution – something Kickfire has a history of — rather than deliberate deception. Kickfire has recently turned over its VP of Marketing (twice) and PR resource (at least once). Scott Humphrey, Kickfire’s new outside PR guy, says he was incorrectly told by his predecessor that the press release and quote in question had been approved, and put it out without fact-checking. I believe him. I hope Kickfire CEO Bruce Armstrong will be able to add stronger marketing leadership soon. Bruce seems aware of the need, and is making reasonable marketing strategy decisions himself in the mean time, so there’s some basis for optimism.
And by the way – I don’t let vendors write press release quotes for me anyway. I let them edit in precise product names and so on, but otherwise the words are mine. The last occasion on which I recall bending this policy was inadvertent and over a year ago, when Greenplum emailed something to me — which was genuinely similar to my opinion — while I was on the phone with Aster at a particularly frenzied time, and I didn’t immediately realize the words weren’t my own.
Boston Big Data Summit keynote outline
Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I’m posting them below.
The top two points from Q&A probably were:
- Big Data and the cloud actually have relatively little to do with each other, a few exceptions notwithstanding, especially if the data is in a shared-nothing DBMS (as opposed to, say, a MapReduce-oriented file cluster). Two principal reasons are:
- Redistributing data from node to node is a little slow, undermining some of the elasticity benefits of the cloud.
- Getting data into the cloud in the first place is a lot slow.
- The NoSQL movement is a lot like the Ron Paul campaign — it consists of people who are dissatisfied with the status quo, whose dissatisfaction has a lot to do with insufficient liberty and/or excessive expenditure, and who otherwise don’t have a whole lot in common with each other.
Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.
Quick introduction
- Big Data vs. cloud
- How big is Big Data?
- At the low end of that range, there’s little you can’t do with conventional technology if you have:
- An unlimited budget for hardware
- An unlimited budget for software
- An unlimited budget for people, especially Oracle DBAs
Big Data in OLTP
- Hard-core OLTP
- Focus of DBMS technology for a long-time
- Big budgets because each transaction has significant value
- Tough to get users to change technologies
- Lighter-weight OLTP
- Classic example = web companies
- Big ones — retail-oriented ones (eBay, Amazon) partially excepted — rolled their own technology stacks
- Reluctant to give money to anybody
- Open source, etc.
- Difficulty finding market
- Product vs. feature
- Clustering/HA/DR/whatever
- Ditto cloud enablement
- True products haven’t found much traction yet
- Product vs. feature
- Classic example = web companies
Analytic Big Data use cases
- Kinds of data for analytics
- More of same != big
- More detail and/or new kinds
- Complete data sets
- Transactions
- Call details
- Tick/trade history
- Web clickstreams
- Network event logs
- Other machine-generated data
- CAM bottom line
- Anything human-generated should and will be retained in its entirety
- Quantities of machine-generated data retained should and will grow roughly in line w/ computing cost reductions (Moore’s Law, etc.)
- Analytic uses of Big Data
- Analytics is mainly about three things
- Problem detection
- Customer relationship improvement
- (Those overlap when the customer relationship is bad)
- Financial statements on steroids
- Main kinds of analytics
- What BI vendors traditionally sell
- General reporting and dashboards
- Ad-hoc query (now driven from those reports and dashboards)
- Planning (allegedly integrated with BI)
- Research
- Ad hoc relational query (worth mentioning twice because it drives so much of the market)
- Data mining
- Most web search and web mining
- Operational/near-real-time
- Archiving/compliance
- What BI vendors traditionally sell
- What gets Big?
- Mainly research and archiving
- But when reporting or operational get Big, you have really interesting computing problems
- Analytics is mainly about three things
Technology issues and trends
- Moore’s Law
- CPUs — All about cores, hence parallelism is key
- RAM
- SSDs – hence replace disks
- Sensors – hence generate lots more data
- Kryder’s Law
- But rotational speeds up only 12.5X since Eisenhower Administration
- Hence solid-state memory (or RAM) will soon take over
- In the mean time, I/O bottlenecks have had to beaten
- Hence sequential scans
- Hence index-light architectures
- Hence columnar
- DBMS “overhead”
- Raw license and maintenance fees – software increasing fraction of total
- OLTP vestiges – locking and all that
- DBAs
- People costs = huge fraction of total
- Index-lightness addresses
- So does appliance
- Many people don’t really know how to write SQL
- Configuration
- Appliance/tightly-balanced
- Netezza
- Teradata earlier
- Greenplum/Sun
- Oracle
- IBM
- Microsoft/Madison
- Commodity/do what you want
- Vertica
- Greenplum now
- Infobright, Aster and others
- MapReduce-oriented file systems
- Extreme rigidity is silly
- Teradata, Oracle have both signaled moving to more modularity
- Big driver of that = heterogeneous storage
- Cheap disk
- Expensive disk
- Solid-state
- RAM
- CPU/storage ratio is even more of a driver
- Appliance/tightly-balanced
Theoretically defensible ways to segment the market
- Latency requirements
- High availability and low latency go together
- Query types
- Simultaneous users for same
- Database size
- Budget
Actual segments right now
- Utter ADW/EDW
- Data mart
- Size
- Naturally columnar vs. naturally row-based
- Operational/frontline
- Less dramatic/smaller EDW
Calpont’s InfiniDB
Since its inception, Calpont has gone through multiple management teams, strategies, and investor groups. What it hadn’t done, ever, is actually shipped a product. Last week, however, Calpont introduced a free/open source DBMS, InfiniDB, with technical details somewhat reminiscent of what Calpont was promising last April. Highlights include:
- Like Infobright, Calpont’s InfiniDB is a columnar DBMS consisting of a MySQL front end and a columnar storage engine.
- Community edition InfiniDB runs on a single server.
- One of commercial/enterprise edition InfiniDB’s main claims to fame will be MPP support.
- There’s no announced time frame for commercial edition InfiniDB.
- InfiniDB’s current compression story is dictionary/token only, with decompression occurring before joins are executed. Improvement is a roadmap item.
- Indeed, InfiniDB has many roadmap items, a few of which can be found here. Also, a great overview of InfiniDB’s current state and roadmap can be found in this MySQL Performance Blog thread. (And follow the links there to find performance discussions of other free analytic DBMS.)
- One thing InfiniDB already has that is still a roadmap item for Infobright is the ability to run a query across multiple cores at once.
- One thing free InfiniDB has that Infobright only offers in its Enterprise Edition is ACID-compliant Insert/Update/Delete. (Note: I wish people would stop saying that Infobright Enterprise Edition isn’t ACID-compliant, since that point was cleared up a while ago.)
- InfiniDB has no indexes or materialized views.
- However, InfiniDB’s retrieval is expedited by something called “Extents,” which sounds a lot like Netezza’s zone maps.
Being on vacation, I’ll stop there for now. (If it weren’t for Tropical Storm/ depression Ida, I might not even be posting this much until I get back.)
Aster Data 4.0 and the evolution of “advanced analytic(s) servers”
Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”
Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:
- Aster is offering a slick vision for integrating big-database management and general analytic processing on the same MPP cluster, under the not-so-slick name “Data-Application Server.”
- Aster is also offering a sophisticated vision for workload management.
In addition, Aster has matured nCluster in various ways, for example cleaning up a performance problem with single-row updates.
Highlights of the Aster “Data-Application Server” story include:
- At its core, the Aster “Data-Application Server” is the Aster nCluster MPP analytic DBMS, enhanced with basic application server functionality (I didn’t ask for details of that part), running on the same nCluster worker nodes that answer SQL queries.
- Thus, Aster is eliminating a lot of the data movement that plagues three-tier architectures and other less-integrated approaches.
- The Aster “Data-Application Server” further offers integrated workload management for applications and queries; more on that below.
- The Aster “Data-Application Server” requires applications to be parallelized and invoked via Aster’s SQL/MapReduce.
- As befits a MapReduce-based system, the Aster “Data-Application Server” lets you write your applications in lots of different languages (the usual suspects, and it also does .NET).
- The Aster “Data-Application Server” runs applications in their own process spaces, protecting the DBMS server from crashes and other damaging behavior.
- The Aster “Data-Application Server” allows applications to manage memory themselves, persistently, and not just via relational constructs. Thus, if you want your application to maintain a graph, mini rules engine, and/or finite state machine, you can, without doing SQL contortions.
In a compelling proof point for the Aster Data-Application Server’s slickness, Aster has leapfrogged Teradata and Netezza in the extent to which SAS functionality is integrated into Aster’s DBMS. (Aster and SAS both say that you can do full SAS modeling in parallel on Aster, but even so I wouldn’t be surprised to discover there were some parts of SAS’ system that turned out to be exceptions.) Of course, Aster is hardly the only analytic DBMS vendor to have the idea of explicitly enhancing general analytic processing; that’s why we see lots of MapReduce announcements, and it’s also why Teradata enhanced its UDFs (User-Defined Functions) to have some kind of persistent memory.* But I don’t know of anybody else whose approach is quite so elegant and general at this time.
*Unfortunately, I don’t yet know much about Teradata’s UDF enhancements. I neglected to drill down on Global Persistent Memory when it was mentioned a couple of times at Teradata Partners last week, and Teradata was unable to accommodate my request this week for a rapid follow-up briefing on the subject.
Aster’s approach to workload management is similarly stylish. The idea is:
- Lots of variables are available to be taken into account (e.g., user role, expected query duration, actual duration of a running query, etc.)
- SQL statements can be written against any of these variables.
- The SQL statements serve as rules to set query/task priorities.
- There seem to be a few different ways to measure priority, including explicit allocation of CPU or I/O resources, as well as more conventional “This group of queries is gets higher priority than that one” kinds of metrics.
- The whole thing provides integrated workload management for queries, applications, load jobs, data redistribution, and so on.
Right now the interface is – well, you’re manipulating a SQL table. A more conventional workload management GUI is slated for the second quarter of 2010.
Discussing subjects such as mirroring and ILM (Information Lifecycle Management) with Aster can be tricky, as Aster uses the word “partition” in confusing ways. Anyhow, Aster has a few different levels of compression, and the ability to apply different levels of compression to different partitions, to change compression levels via ALTER TABLE, and to alter (presumably increase) compression on the fly when doing online backup. Aster is also part of a growing trend to eschew RAID, instead doing mirroring in its own software. (Other examples of this strategy would be Vertica, Oracle Exadata/ASM, and Teradata Fallback.) Prior to nCluster 4.0, this caused a problem, in that the block sizes for mirroring were so large as to create a lag in transactional updating. But Aster says this problem is now solved, and indeed claims that nCluster 4.0 is superior to most rivals in transactional efficiency.
And finally, while I was talking w/ Aster Data anyway, I checked up on cloud and MapReduce customer penetration. The answers were:
- Aster has two serious production cloud users, both of which have been disclosed for a while, namely:
- ShareThis, which runs Aster nCluster on Amazon EC2
- Didit, which runs Aster nCluster on AppNexus
- Outside of those two, Aster sees some cloud use for test, development, prototyping, etc.
- Every single Aster customer uses SQL/MapReduce — i.e., they invoke MapReduce via Aster nCluster SQL queries.
- Some of those customers use MapReduce for ETL, some use it for actual analytics.
A question on MDX performance
An enterprise user wrote in with a question that boils down to:
What are reasonable MDX performance expectations?
MDX doesn’t come up in my life very much, and I don’t have much intuition about it. E.g., I don’t know whether one can slap an MDX-to-SQL converter on top of a fast analytic RDBMS and go to town. What’s more, I’m heading off on vacation and don’t feel like researching the matter myself in the immediate future.
So here’s the long form of the question. Any thoughts?
I have a general question on assessing the performance of an OLAP technology using a set of MDX queries. I would be interested to know if there are any benchmark MDX performance tests/results comparing different OLAP technologies (which may be based on different underlying DBMS’s if appropriate) on similar hardware setup, or even comparisons of complete appliance solutions. More generally, I want to determine what performance limits I could reasonably expect on what I think are fairly standard servers.
In my own work, I have set up a star schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000. In ad hoc analytics, is it expected that any query against such a dataset should return a result within a minute or two (i.e. before a user gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table or caching mechanism)? Or is that level of performance only expected with a high end massively parallel software/hardware solution? The server specs I’m testing with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3 Gb/s SAS drive, running Windows Server 2003 (x64).
I realise that caching of query results and pre-aggregation mechanisms can significantly improve performance, but I’m coming from the viewpoint that in purely exploratory analytics, it is not possible to have all combinations of dimensions calculated in advance, in addition to being maintained.
Teradata’s nebulous cloud strategy
As the pun goes, Teradata’s cloud strategy is – well, it’s somewhat nebulous. More precisely, for the foreseeable future, Teradata’s cloud strategy is a collection of rather disjointed parts, including:
- What Teradata calls the Teradata Agile Analytics Cloud, which is a combination of previously existing technology plus one new portlet called the Teradata Elastic Mart(s) Builder. (Teradata’s Elastic Mart(s) Builder Viewpoint portlet is available for download from Teradata’s Developer Exchange.)
- Teradata Data Mover 2.0, coming “Soon”, which will ease copying (ETL without any significant “T”) from one Teradata system to another.
- Teradata Express DBMS crippleware (1 terabyte only, no production use), now available on Amazon EC2 and VMware. (I don’t see where this has much connection to the rest of Teradata’s cloud strategy, except insofar as it serves to fill out a slide.)
- Unannounced (and so far as I can tell largely undesigned) future products.
Teradata openly admits that its direction is heavily influenced by Oliver Ratzesberger at eBay. Like Teradata, Oliver and eBay favor virtual data marts over physical ones. That is, Oliver and eBay believe that the ideal scenario is that every piece of data is only stored once, in an integrated Teradata warehouse. But eBay believes and Teradata increasingly agrees that users need a great deal of control over their use of this data, including the ability to import additional data into private sandboxes, and join it to the warehouse data already there.
The Teradata Elastic Mart(s) Builder Viewpoint portlet automates the inclusion of outside data. If you’re already an authorized Teradata data warehouse user, you can fill in a very short form (three or so fields) and add authorization to import outside data, e.g. from a .CSV file. No fuss, little bother. Trivial as that sounds, when you combine it with Teradata’s pre-existing robust workload management tools, it creates a pretty good virtual data mart story.
Spinning out and maintaining consistency with physical data marts is a different matter. Teradata doesn’t seem too sure it believes in those. And while Teradata is obviously planning to increase its capability in that regard anyway, I didn’t get a lot of detail beyond the reference to Data Mover 2.0.
Related links
- My Greenplum-inspired post on the future of data marts, outlining issues in “private cloud” data warehousing.
- eBay’s “Analytics as a Service” pitch (about 1 ½ years old)
- A post by Teradata’s Dan Graham explaining the Teradata Agile Analytics Cloud and Elastic Mart(s) Builder Viewpoint portlet
- Home page and complete screen shot for the Teradata Elastic Mart(s) Builder Viewpoint portlet
Teradata hardware strategy and tactics
In my opinion, the most important takeaways about Teradata’s hardware strategy from the Teradata Partners conference last week are:
- Teradata’s future lies in solid-state memory. That’s in line with what Carson Schmidt told me six months ago.
- To Teradata’s surprise, the solid-state future is imminent. Teradata is 6-9 months further along with solid-state drives (SSD) than it thought a year ago it would be at this point.
- Short-term, Teradata is going to increase the number of appliance kinds it sells. I didn’t actually get details on anything but the new SSD-based Blurr, but it seems there will be others as well.
- Teradata’s eventual future is to mix and match parts (especially different kinds of storage) in a more modular product line. Teradata Virtual Storage is of pretty limited value otherwise. I probably believe Teradata will go modular more emphatically than Teradata itself does, because I think doing so will meet users needs more effectively than if Teradata relies strictly on fixed appliance configurations.
In addition, some non-SSD componentry tidbits from Carson Schmidt include:
- Teradata really likes Intel’s Nehalem CPUs, with special reference to multi-threading, QuickPath interconnect, and integrated memory controller. Obviously, Nehalem-based Teradata boxes should be expected in the not too distant future.
- Teradata really likes Nehalem’s successor Westmere too, and expects to be pretty fast to market with it (faster than with Nehalem) because Nehalem and Westmere are plug-compatible in motherboards.
- Teradata will go to 10-gigabit Ethernet for external connectivity on all its equipment, which should improve load performance.
- Teradata will also go to 10-gigabit Ethernet to play the Bynet role on appliances. Tests are indicating this improves query performance.
- What’s more, Teradata believes there will be no practical scale-out limitations with 10-gigabit Ethernet.
- Teradata hasn’t decided yet what to do about 2.5” SFF (Small Form Factor) disk drives, but is leaning favorably. Benefits would include lower power consumption and smaller cabinets.
- Also on Carson’s list of “exciting” future technologies is SAS 2.0, which at 6 gigabits/second doubles the I/O bandwidth of SAS 1.0.
- Carson is even excited about removing universal power supplies from the cabinets, increasing space for other components.
- Teradata picked Intel’s Host Bus Adapters for 10-gigabit Ethernet. The switch supplier hasn’t been determined yet.
Let’s get back now to SSDs, because over the next few years they’re the potential game-changer. The big news on SSDs is that after last year’s Teradata Partners conference, a stealth supplier* introduced itself and convinced Teradata it offers really great SSD technology. For example, not a single SSD it has provided Teradata has ever failed. (In hardware, that is. There have of course been firmware bugs, suitably squashed.) I think SSD performance is also exceeding Teradata’s expectations. This supplier is where the 6-9 month time-to-market gain comes from.
*Based on how often the concept of “stealth” and “name is NDAed” came up, I do not believe this is the SSD company another vendor told me about that is going around claiming it has a Teradata relationship.
Teradata SSD highlights include:
- I/O speeds on “random medium blocks” are 520 megabytes/second, vs. 15 MB/second on their fastest disks. And that’s limited by SAS 1.0, load-balanced across two devices, not the hardware itself. (2 x 300+ MB/sec turns out to be 520 MB/sec in this case.) No wonder Carson is excited about SAS 2.0.
- Teradata is using SAS interfaces for its SSDs, and believes that’s unusual, in that other companies are using SATA or Fibre Channel.
- Never having had a part fail, Teradata has no real basis to make MTTF (Mean Time To Failure) estimates for its SSDs.
- Teradata’s SSD appliance design includes no array controllers. The biggest reason is that right now array controllers can’t keep up with the SSDs’ speed.
- In its SSD appliance, Teradata has abandoned RAID, doing mirroring instead via a DBMS feature called Fallback that’s been around since Teradata’s earliest days. (However, unlike Oracle in Exadata, Teradata continues to use RAID for disks.)
- Useful life for Teradata’s SSDs is estimated at 5-7 years.
- Teradata’s SSDs are SLC (Single-Level Cell), as opposed to MLC (Multi-Level Cell).
Reports of perfectly-balanced hardware configurations are greatly exaggerated
Data warehouse appliance and software appliance vendors like to claim that they’ve worked out just the right hardware configuration(s), and that a single configuration is correct for a fairly broad range of workloads. But there are a lot of reasons to be dubious about that. Specific vendor evidence includes:
- Teradata ascribes considerable importance to a Virtual Storage technology whose main purpose is to allow mixing of heterogeneous storage devices in a single system. And the discussion rarely suggests that these parts will be in a rigid fixed relationship.
- Netezza — as Teradata keeps reminding me — often sells boxes with the expectation that they won’t be filled with data, so as to increase spindle count and hence performance.
- Oracle/Sun have dropped some comments about Exadata being more flexibly configured going forward.
- Kickfire’s new “high-end” appliance lets you attach fairly arbitrary amounts of external storage.
- And of course, software-only analytic DBMS vendors run their software in all sorts of hardware and storage environments.
What’s more, the claim never made a lot of sense anyway. With the rarest of exceptions, even a single data warehouse’s workload will contain different queries that strain different parts of the system in different ratios. Calculating the “ideal” hardware configuration for that single workload would be forbiddingly difficult. And even if one could calculate it, it almost surely would be different than another user’s “ideal” configuration. How a single hardware configuration can be “ideally balanced” for a broad class of use cases boggles the imagination.
Greenplum Single-Node Edition — sometimes free is a real cool price
Greenplum is announcing today that you can run Greenplum software on a single 8-core commodity server, free. First and foremost, that’s a strong statement that Greenplum wants enterprises to pay it for Greenplum’s parallelization/”private cloud” capabilities. Second, it may be an attractive gift to a variety of folks who want to extract insight from terabyte-scale databases of various kinds.
Greenplum Single-Node Edition:
- Is free of charge, although you can buy support.
- Has no restrictions on use, production or otherwise.
- Has no restrictions on database size.
- Is closed-source.
For those who want free, terabyte-scale data warehousing software, Greenplum Single-Node Edition may be quite appealing, considering that the main available alternatives are:
- General-purpose open-source DBMS, such as PostgreSQL and MySQL (lacking analytic DBMS performance and features)
- Infobright Community Edition (the other best choice – Infobright’s commercial sales success indicates the solidity of Infobright’s technology)
- Rough research-project code and other other questionable open source offerings
- Crippleware from other commercial analytic DBMS vendors (e.g., Teradata)
For example, comparing PostgreSQL-based Greenplum with PostgreSQL itself, Greenplum offers:
- The ability to scale out queries across all cores in your box (and no, pgpool is not a serious alternative)
- Storage alternatives such as columnar (I am told that EnterpriseDB recently stopped funding a project for a PostgreSQL columnar option)
Greenplum would surely also argue that its software is superior to PostgreSQL in parallel load, compression, MapReduce integration, and general fit-and-finish. I imagine that in some (perhaps not all) cases it would be right. PostgreSQL’s main technical advantages over Greenplum would probably lie in the area of datatype extensibility.
The main target users for Greenplum’s Single-Node Edition are obviously individual enterprise power users or very small analytic teams. I.e., it’s people with a data mart need that a central data warehouse isn’t meeting. Potential benefits to Greenplum include:
- Adding value to its Enterprise Data Cloud story
- Seeding the market for future enterprise sales
- Depriving competitors of revenue, perhaps at enterprises too small to ever be paying Greenplum customers
In addition, I see free Greenplum as a charity offering that could be appealing to scientists who face PostgreSQL performance limitations.
Related links
- Greenplum Free Single-Node Edition press release (I’m quoted)
- MySQL Performance blog on MonetDB and Infobright community edition
- PostgreSQL’s restriction to one core per query
- Infobright’s restriction to one core per query
This week at the Teradata Partners user conference
Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what’s going on, although names, dates, and details will have to await conversations and press releases this week.
- Teradata is productizing “private cloud,” under names including “Teradata Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” and “Teradata Elastic Mart Builder.” I.e., Teradata hopes to leapfrog Greenplum in its “Enterprise Data Cloud” strategy. This is only fair, in that Greenplum lifted the idea from Teradata and eBay in the first place. It also provides major support for what I think is an extremely sensible trend. Give or take issues of who announces and ships what a couple months before or after a competitor, my early thinking is that the main differences between Greenplum and Teradata in this regard will be:
- Virtual as opposed to just physical data marts, based on robust workload management software. (Advantage: Teradata)
- Pricing, deployment options. (Advantage: Greenplum)
- Features that don’t directly relate to enterprise/private cloud. (Advantage: Either, often Teradata.)
- Teradata is generally strengthening its data movement technology, e.g. for making various appliances work in sync. I’m not too clear yet on the details of that. I think this is what Teradata’s phrase “ecosystem management” refers to.
- Teradata is (pre-)announcing – at least as a statement of direction — an appliance based on solid-state drives (SSDs). I’ve thought for a while that Teradata was a leader in thinking through the issues around solid-state memory in data warehousing, so it makes sense that they’re among the leaders in actually coming to market as well. I plan to say more after meeting with, e.g., Carson Schmidt.
- Teradata has achieved a 300%ish speed-up in geospatial processing. I gather this is largely a byproduct of the parallel analytics work Teradata did around strengthening its SAS integration. However, there don’t seem to be a lot of Teradata geospatial users yet.
- Teradata Express, Teradata’s free Windows-based crippleware, is being ported to Amazon EC2 and VMware as well. Presumably to avoid cannibalizing Teradata product sales, there are quite a few limitations on Teradata Express, including system capacity, database size, and “no production use.”
- Teradata continues to extend its optimizations to handle queries issued by business intelligence tools. Previously, the focus of what Teradata discussed in this regard was query rewrite. But soon automatic recommendation and creation of Aggregate Join Indexes – i.e.., materialized views – will be included as well.
Greenplum customer notes
In a briefing about a forthcoming product announcement, Greenplum threw in a slide saying:
- Greenplum is getting 12-15 new (paying) customers per quarter, all of whom it fondly refers to as “Tier 1″ enterprises.
- Greenplum will hit the 100+ customer mark this quarter (thus joining Vertica and Infobright).
- <10% of Greenplum business is now “influenced” by Sun hardware.
I asked Ben Werther to unpack that last claim for me. He quickly noted that it wasn’t his slide, but rather had been put together by colleagues. That said:
- As of the past quarter or two, <10% of Greenplum’s sales activity is on Sun, which works out to maybe one sale per quarter and at most a small number of sales cycles. (That’s down from from 50%+ not that long ago.)
- Most Greenplum business is now on HP or Dell equipment. Some is on IBM. There are some interesting sales cycles on Cisco’s new UCS (Unified Computing System) blades, but no closed deals yet. EMC seems to be part of the Cisco story.
No doubt part of the reason for the move away from Sun equipment is the impending Oracle acquisition. Another may be that the Greenplum/Sun appliance is somewhat underpowered. E.g., without particularly high levels of compression, eBay puts over 60 terabytes of data on each Greenplum node, which probably isn’t ideal from the standpoint of query performance.
Greenplum also says that 50% or so of sales are subscription-priced, rather than perpetual-licensed. I don’t have a sense for how long that’s been going on. (Edit: Ben Werther tells me this has been true for over a year.)
Three big myths about MapReduce
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
So let’s give it a try.
When Dave DeWitt and Mike Stonebraker leveled their famous blast at MapReduce, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, namely the claim that these ideas aren’t particularly new. I haven’t actually read enough computer science literature to have an independent opinion on that issue. But I’ll say this – claims from companies such as SenSage, Oracle, or Splunk that “We’ve been doing MapReduce all along” seem pretty credible to me.
True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce. In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple <Key, Value> pairs, with data redistributed based on Key at every step.
Here are some examples of what I mean, drawn from my recent MapReduce webinar.
- If you do text indexing in MapReduce, your goal is to wind up with a text index. So at some point you Reduce to a pair <WordName, {all the (DocumentID, offset) pairs for the whole corpus, suitably ordered}>. That’s a heckuva compound “Value”.
- The goal of data mining is usually to estimate a rather small number of parameters based on a large overall data set, often – depending on algorithm – in the form of a single vector. When you do that in MapReduce. you partition data among nodes, calculate something on each node that is structured more or less like your final vector. So when it comes time for the reduce, you just ship all of your vectors – one per node – to a single Reduce node, and do the appropriate math. Redistribution based on Key would be quite pointless.
- When you sessionize clickstream logs in MapReduce, you may have just as many output records as input records. However, they now are reformatted, and might have a SessionID appended. In those cases, Reduce isn’t doing much by the way of reduction.
- And as I happens in some Vertica-Hadoop use cases around mortgage trading, sometimes MapReduce can even make data sets vastly larger.
By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it’s a MapReduce strength. But it’s not quite the way MapReduce has been promoted and explained to the IT public.
Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:
- Parallel programming
- Distributed data management
For example, I imagine Greenplum’s and Vertica’s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica’s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum’s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google’s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.
Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce — like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data’s SQL/MapReduce – you’re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it’s likely that projects will interact or even combine in all sorts of interesting ways.
Bottom line: There are a lot of different ways to exploit MapReduce-related technology.
Introduction to SenSage
I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage’s, hasty. Still, I think I have enough of a handle on SenSage basics to be worth writing up.
General SenSage highlights include:
- SenSage used to be known as Addamark.
- SenSage used to characterize itself as being in the Security Information Management (SIM) market.
- Now SenSage characterizes itself (approximately) as selling technology built around a columnar DBMS that happens to be pretty good at log analysis, compliance, and/or archiving.
- More concisely, SenSage says it is in the event data warehouse category. (The same could arguably be said of Splunk.)
- SenSage says it has >400 paying customers, of which ~200 are direct.
- SenSage has >120 employees and, like Splunk, is profitable.
- SenSage has enjoyed >50% annual revenue growth the past four years.
- Some SenSage deals are in the multiple-million dollar range.
- A major SenSage channel partner – dozens of installations — is SAP, which resells SenSage software on HP hardware is a “Compliance Log Warehouse.”
- A hot market for SenSage is CDRs (Call Detail Records).
- SenSage says that, among analytic DBMS vendors, it competes with Oracle, IBM, Teradata, Netezza and, to some extent, Vertica and Greenplum.
Technical SenSage highlights include:
- SenSage’s core technology is an append-only columnar DBMS, with no master node.
- SenSage’s DBMS uses no indexes and requires “no” database administration.
- SenSage’s database is range-partitioned, with the range-partition key always being time.
- SenSage has something it calls SQO (Sparse Query Optimization), which sounds a lot like Netezza zone maps. SQO never yields a false negative on whether data is in a block, never yields a false positive on equality predicates, and only rarely yields a false positive on range predicates.
- SenSage’s database uses large block sizes – typically 250,000 records/block, at 200-250 bytes per record. (That’s in the range of 64 megabytes/block.)
- SenSage says its software can load 10-50,000 records/second/node. If I’m doing the arithmetic correctly, that’s roughly 7-40 gigabytes/node/hour.
- SenSage collects log data into its event data warehouse in what it characterizes as an agentless manner. Even so, it seems that for a majority of kinds of data sources one does have to write custom agents. The two other ways to get data into SenSage – and presumably most of the data volume comes through these – are:
- File transfer in the usual way
- syslog
- SenSage says its software can read 100s of data sources, and that this is a huge competitive advantage. I’m not totally sure how that jibes with the prior point.
- SenSage says it gets 5X compression on CDR data, 10-20X on other kinds of logs. That’s not too far off from Vertica’s compression figures.
- SenSage says that it has datatype-aware compression as well as more standard stuff, with VARCHAR compressing particularly well.
- In particular, SenSage uses both dictionary/token and delta compression.
- SenSage’s software is pretty agnostic with respect to storage kind – DAS (Direct Attached Storage), SAN (Storage-Area Network), or content-addressable. In particular, there’s only about a 4% performance hit for using content-addressable storage.
- When using WORM (Write Once Read Many) storage like EMC’s Centera, SenSage leaves record locator information behind on ordinary storage and otherwise queries the WORM storage just like it queries anything else.
- SenSage says it has been using MapReduce since “Day 1”.
- Probably not coincidentally, you can use Perl and other aggregates in SenSage SQL statements.
- Perhaps also not coincidentally, SenSage says it has a number of advanced built-in analytic functions, including some focused on sessionization.
In addition to all that, SenSage offers a built-in event processing engine, consisting of:
- A finite-state machine correlation engine.
- A proprietary event processing language.
- A GUI to “abstract” (i.e., generate?) the event processing language.
The SenSage event processing engine is used to generate alerts. Data that comes into SenSage actually is passed to two places at once, namely to both the event processing engine and the database itself.
Technical introduction to Splunk
As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.
Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:
- Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs. However, in the latter case indexing is turned off. Thus, Splunk does not portray its software as “agentless.” However, it asserts that its agent-like software runs without “material” overhead.
- The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
- Splunk tries to figure out what the individual entries are in a section of log it looks at. In particular:
- Time stamps are a big clue in this “inferencing” process, but they are not the be-all and end-all.
- Nor are line boundaries, if logs are naturally broken up into lines. (Splunk threw that latter comment in as a shot at SenSage.)
- I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job or recognizing same. Beyond that, fields seem to be specified by users when they define searches.
- Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.
Apparently, Splunk’s indexing is typically done via MapReduce jobs. I don’t know whether any actual Splunk searches are also done via MapReduce; surely they aren’t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn’t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster’s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.
Splunk’s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn’t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
General introduction to Splunk
I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.
Splunk’s technical stories include:
- Text search over log files.
- Business intelligence over text search. (That part sounds a lot like Attivio.)
- MapReduce with schema flexibility and smart multi-stage execution plans. (That part sounds a lot like Aster Data.)
More on those in a separate post.
Less technical Splunk highlights include:
- Splunk has ~1200 paying customers, and is adding a couple hundred more per quarter.
- Splunk has ~160 people.
- ~80% of Splunk sales are in North America.
- Typical Splunk sales prices are in the $10-50K range, with an average around $25K, or maybe that average is a bit over $30K. Some Splunk deals are six- or even seven-figure.
- Splunk is “quite profitable.”
- Splunk’s eponymous product is priced according to how much data is indexed per day. If you index half a gigabyte of logs per day or less, Splunk is completely free. So, while Splunk is closed-source, there’s something of an open-source-like Splunk adoption model.
- Splunk has been selling product for a couple of years. I gather Splunk 4 was recently released.
- Splunk’s biggest industry segments are, not too surprisingly,
- Telco
- Financial services
- Government
- “Online”
- Splunk’s paying customers seem to use it mainly for:
- Web logs and associated network event logs (this seems to be the biggest area)
- Security and perhaps other general IT log analysis
- Physical security logs (mainly in the government)
- Anti-fraud (I’m not sure how that works)
- One would think Splunk would be used to manage a lot of intelligence telemetry, but that wasn’t particularly hinted at.
- In general, the core problem Splunk is used for is log analysis for trouble-shooting purposes.
- Splunk’s nonpaying users are more diverse; examples mentioned included windmill operations and protein research.
- Splunk’s customers include Aster Data flagship accounts MySpace and LinkedIn. I bet many other top web companies are Splunk customers as well.
Kickfire capacity and pricing
Kickfire’s marketing communication efforts are still a work in progress. Kickfire did finally relax its secrecy about FPGA-vs.-custom-silicon – not coincidentally during Netezza’s recent publicity cycle. That wise choice helped Kickfire get some favorable attention recently for its technical and market strategy, e.g. from Daniel Abadi, Merv Adrian and, kicking things off — as it were — me. Weeks after a recent Kickfire product release, there’s finally a fairly accurate data sheet up, although there’s still one self-defeatingly misleading line I’ll comment on below. Pricing is a whole other area of confusion, although it seems that current list prices have been inadvertently* leaked in Merv’s post linked above, with only one inaccuracy that I can detect.**
*I gather from the company that they forgot to tell Merv pricing was NDA.
** Merv cited a price as “starting” that I believe to be top-of-the-line. No criticism of Merv is implied in that; Kickfire has not been very clear in communicating hard numbers.
All that said, if one takes Kickfire’s marketing statements literally, Kickfire list pricing is around $20-50K per terabyte for a few small, fixed, high-performance configurations. That’s all-in, for plug-and-play appliances. What’s more, that range is based on the actual published user data capacity numbers for various Kickfire models, which I think are low for several reasons:
- Kickfire doesn’t officially admit that its model with 14.4 terabytes of disk can manage more than 6 terabytes of data, even though it clearly can.
- Actually, those 14.4 terabytes of disk can be increased or lowered as you choose.
- The basic compression figures implied in those calculations seem conservative.
- Compression figures are a lot more conservative yet, in that Kickfire assumes you’ll have a lot of actual indexes on your data. I’m not sure that’s necessary for most workloads.
MapReduce webinars and annotated slides
As previously noted, I’m giving a webinar twice today — i.e., Thursday, October 15 – at 10:00 am and 1:00 pm Eastern time.
- The subject is MapReduce.
- The sponsor is Aster Data.
- Part of the webinar will be an explanation of MapReduce basics, especially the conflict between theory/propaganda and reality.
- As you might guess from the identity of the sponsor, there will be an emphasis on how MapReduce and SQL play nicely with each other.
- You can register for the webinar on Aster’s site.
- The webinar replay is supposed to be posted by the end of the week. I’ll edit this post accordingly when a link to the replay is up.
- I’ve already uploaded the slides from which I will present. (But not the ones from which Aster folks will be talking. I’ve seen those, and there’s some good technical crunch in some of them.) The “Notes” under the slides have a number of relevant URLs for follow-up, as well as a small number of explanatory comments (e.g., as to why one slide simply has a quote from and corresponding picture of Shakespeare).
Infobright notes
I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn’t primarily a briefing, but a few takeaways are:
- Infobright now has >100 paying customers.
- Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.
- Agile development is at or approaching two-week release cycles.
- Like Kickfire, Infobright has a multi-year deal with MySQL that insulates it against many potential Oracle/MySQL shenanigans.
- From an industry perspective, Infobright’s customer base sounds a lot like other vendors’:
- Data mart outsourcing/online analytics
- Log files for websites
- Telecommunications
- Financial services
- OEM, especially in the markets cited above
- “Hey, we’re beginning to see the occasional energy deal”
- A few random others
- Infobright is seeing some household-name customers, who surely have big-name analytic DBMS products, but who also have a policy that open source is the default choice, and if open source can get the job done then the favorite closed-source choices aren’t used.
- Infobright has the usual open-source community story — lots of involvement and engagement in the forums, but contributions are limited mainly to connectivity, utility scripts, etc. (Maybe some national language translation too; I’m not sure.)
Greenplum is going hybrid columnar as well
Over the past summer, Vertica, VectorWise, and Oracle all announced flavors of hybrid row/columnar storage. Now it’s Greenplum’s turn. Greenplum is actually offering true columnar storage, as opposed to Oracle’s PAX-like scheme — and also as opposed to the kind of Frankencolumn storage Daniel Abadi decries. For example, you don’t have to do a join to retrieve multiple columns; you just ask for them and there they are. Similarly, Greenplum doesn’t maintain explicit row IDs – whether in row-oriented or column-oriented append-only storage – relying instead on block-level header information.
Highlights include:
- Column orientation is a special case of what Greenplum is calling Polymorphic Data Storage.*
- As per product management chief Ben Werther’s blog post, what Greenplum’s polymorphic data storage boils down to is that you can store different tables in different storage paradigms. This is transparent to the SQL or any other API; it’s just a performance choice.
- Indeed, Greenplum lets you store different partitions of the same table in different storage and/or compression schemes. So Greenplum now has a kind of ILM (Information Lifecycle Management) story, although it doesn’t offer the faster vs. cheaper storage media differentiation options of Sybase IQ or Vertica.
- Greenplum now has, depending on how one counts, three or four main types of table:
- Traditional PostgreSQL, which has been available since Day One
- Row-oriented append-only (compressible and scan-optimized), available since Greenplum 3.2 (July, 2008)
- Columnar append-only (new in Greenplum 3.3.4, shipping now)
- External, in which Greenplum treats something external – in a relational DBMS or otherwise – as if it were a Greenplum table
- Traditional PostgreSQL, which has been available since Day One
- Greenplum offers multiple versions of LZ (Lempel-Ziv) and gzip compression, any of which you can choose on a table-by-table or partition-by-partition basis.
- Greenplum offers the same compression algorithms for both row-oriented and column-oriented tables.
- Greenplum says that compression is typically at least 50% better (i.e., to 2/3 as much space) in columnar vs. row storage, for the same algorithm.
- Just as it doesn’t offer columnar-specific compression algorithms, Greenplum also doesn’t sport other columnar features Daniel loves, such as in-memory compression or late materialization. (But then, VectorWise doesn’t do in-memory compression either, and Daniel likes VectorWise.)
- All the Greenplum choices I’ve mentioned have to be made manually by DBAs.
- Similarly, I doubt Greenplum can match Vertica’s engineering for getting updates and trickle feeds quickly into a column store – a traditional columnar Achilles heel that Vertica has invested a lot of effort to circumvent.
*The term “polymorphic” is somewhat, shall we say, overloaded these days.


