Other
Are analytic DBMS vendors overcomplicating their interconnect architectures?
I don’t usually spend a lot of time researching Ethernet switches. But I do think a lot about high-end data warehousing, and as I noted back in July, networking performance is a big challenge there. Among the very-large-scale MPP data warehouse software vendors, Greenplum is unusual in that its interconnect of choice is (sufficiently many) cheap 1 gigabit Ethernet switches.
A recent Network World story suggested that Greenplum wasn’t alone in this preference; other people also feel that clusters of commodity 1 gigabit Ethernet switches can be superior to higher-performing ones. So I pinged CTO Luke Lonergan of Greenplum for more comment. His response, which I got permission to publish, was:
It turns out that non-blocking bandwidth at large scale is very cheap now due to switch vendors using FAT tree internally for switches from 48 ports to 672 (See the Force10 ES1200 and others). Also as the SDSC authors point out, you can build larger non-blocking networks that scale to huge size based on FAT or CLOS. Some of us built them for supercomputers in 1998 (see http://lwn.net/2000/features/FSLCluster/), scaling even low latency supercomputers up to thousands of nodes.
So - bandwidth is cheap, latency is expensive. Data analysis is a bandwidth problem, not a latency problem.
To put it mildly, Ethernet switching is not one of my core areas of expertise. So I’m just throwing this subject out for discussion. Thoughts, anybody?
Sales figures for analytic DBMS
One of my clients asked how many new customers I thought were buying analytic DBMS each quarter. I don’t generally track such things, but hey — a client asked, so I did the best I could. And since I did the work, now I’ll share it generally. To wit:
- Teradata, Netezza, and Sybase are publicly traded companies. You can find out a lot from their SEC filings, which I prefer to find via the SEC’s own EDGAR site. (For most purposes, including this one, restrict yourself to forms 10-K, 10-Q, and S-1.) More detail may often be found on the companies own investor relations press releases, investor conference calls, and so on. Anyhow, the bottom line is that some form of customer count figure is public information.
- Vertica claims 50 paying customers, all within the past year. Greenplum also claims 50 paying customers, almost all within the past year.
- The other new analytic DBMS vendors one usually hears about — e.g. ParAccel, InfoBright, Kognitio, Exasol — are individually in the onesie-twosie range. Taken together, the newer companies have on the order of 10 new accounts/per quarter. (OK, Kognitio isn’t really that new a company, but for most other purposes it belongs on the list.)
The client agreed with these figures.
Upon review, that figure of 10 could probably be made to look too low, just by expanding the list of vendors enough. E.g., how many copies of NeoView sell each quarter? How many commercial users are there of the open source MonetDB? Still, when assessing the “mainstream” market, it doesn’t feel misleading.
And by the way, different analytic DBMS vendors sometimes have the same new customers.
Enterprises are buying multiple brands of analytic DBMS each
Over the past few weeks I’ve had a lot of NDA discussions about analytic DBMS vendors’ specific customers. And so I’ve been acutely aware of something I already sort of knew — just as there was in prior generations of database management technology, there’s huge overlap among analytic DBMS vendors’ customer bases as well. As they always have, enterprises are investing in multiple different brands of DBMS, even in cases where those DBMS can do pretty much the same things.
For example:
- Many Teradata users are buying newer technology too. But they aren’t actually throwing out Teradata.
- The same sometimes applies to Netezza already. At least two Netezza references are also references for a rival vendor.
- One outfit is among the biggest customers for two different analytic DBMS vendors, neither of which is Teradata or Netezza.
- One corporation is using or deploying four different brands of analytic DBMS.
- TEOCO is a big user of both DATAllegro and Netezza.
BPEL as a source of application management metadata
Let’s put aside for now all the discussions about whether BPEL is an appropriate tool to capture a “true” business process, i.e. to implement the business logic understood by a business analyst (a topic that has been discussed at length already, including here, here, here, here, here, here and around the 5 minute mark of this podcast). Today, let’s look at it as simply another resource in a developer’s toolbox, alongside things like servlets and XML parsers. It’s a tool that can simplify the invocation of remote services (especially asynchronously), the parallelization of tasks, the definition of scoped compensation handlers, the transformation of XML, the encapsulation of key business logic and, most importantly, the reliable implementation of long-lived processes. If you need a few of these features, you might find BPEL a suitable programming tool. Plus, it refreshingly encourages handling of XML as XML (e.g. via XPath) rather than mindless code generation.
In addition to whatever developer productivity benefit you see in BPEL, there are other potential benefits form using it. They are the topic of this post and they relate to application management.
We all know that in an ideal world, no developer would release an application without providing a set of management capabilities that are carefully crafted to reflect the business logic of the application. Such that IT administrators can monitor, configure, optimize and troubleshoot the application in ways that are related to what the application really does (as opposed to generic metrics like memory, CPU and I/O metrics…).
Back in the real world, this is of course rarely the case. Enters BPEL. Just by virtue of using it in a reasonable way, and without any “just for the ops guys” metadata, BPEL provides a management model for the application. Sure it’s not as good as a hand-crafted management model, but at least it’s there. And it has some pretty compelling properties:
- It feeds directly from the metadata used by the runtime, so it is guaranteed to be accurate (unlike metadata that is created specifically for management but has no role in the actual runtime).
- It shows what external services the application depends on. Of course there is no guarantee that all remote invocation will be represented in the BPEL process, but since that’s a strength of BPEL it is reasonable to expect that it provides a good view of application dependencies (to be complemented, of course, by the application infrastructure dependencies like the database and the BPEL engine itself…). Remote invocations are a common point of failure and/or performance problems so they are a first class citizen of an application management model.
- It explicitly captures process instances. No more jumping from one database table to another (assuming you even know where to look) to try to get a sense of the current overall status. The BPEL instances show the number of in-flight transactions in the application. It is also easy to compare the initialization and termination rates to see the trend.
- It provides a horizontal segmentation of the processing tasks (via the BPEL activities) that is a good complement to the vertical segmentation often offered by application management tools (e.g. time spent in the database, time spent waiting on I/O, etc…).
- It makes explicit certain exception conditions.
All these only make use of very basic aspects of BPEL: the enumeration of PartnerLinks, the notion of a process instance, the existence of activities, the fault/compensation/termination handlers. A fair amount of visibility into the health of the application can be derived form this alone. I am not making fancy assumptions about the management tool being able to make sense of the routing logic in the process or of the correlation rules. I am not assuming that the BPEL engine provides ways to control individual process instances. I am not assuming that the name attributes of certain elements (e.g. PartnerLink, variable) convey semantics that could help the administrator understand some of the semantics of the application.
At the end, it’s not about managing BPEL, it’s about managing an application that uses BPEL.
My point is not to push everyone to write any application as a BPEL process (or a set of them) as a way to get a great management infrastructure for free. But if BPEL is a potential choice for the application, then it’s worth considering those extra benefits in the “pros and cons” analysis. And if you have already decided to use BPEL, it may be worth looking into what management dividends you can harvest from this choice. Of course your mileage may vary depending on how manageable your BPEL infrastructure is. Hint hint…
A few related links. Todd Biske has also written about the management value of BPEL, here and here. A similar analysis can be applied to SCA, but at this point in time there are many more applications out there that use BPEL than SCA, making the former more relevant. I briefly described the SCA side of the equation in an earlier exchange with David Chappell. That discussion is summarized here (including a pointer to David’s original piece). In an earlier post, I touched on the manageability potenial of other sources of application metadata, like OGSi and Spring (in addition to SCA and BPEL). Jean-Jacques Dubray provided additional context at InfoQ.
Vertica’s paying customer count
In a recent Computerworld article, Andy Ellicott of Vertica was cited as saying Vertica has 50 paying customers total. That’s very much on par with Greenplum’s figure, leaving aside any questions of deal size. (Greenplum runs a number of databases much larger than Vertica’s biggest. However, I believe Greenplum also charges a lot less per terabyte of user data.)
Previous Vertica paying customer count figures include:
Three approaches to parallelizing data transformation
Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.
*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.
But depending on your needs, at least two other approaches to data transformation parallelization could also be considered. Pervasive Software, which has a big data integration software business of its own, built a new ETL tool. The foundation was a middle-tier multi-core-friendly Java dataflow engine, which has been now split out as Pervasive Datarush. The product is in the early stages of being released, which may be a good excuse for the website confusingly suggesting both of:
- You can have Datarush for free.
- If Datarush doesn’t produce a 30X speedup for you, you can get your money back.
The third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one pushback — I left out the area of data transformation. As CEO Mayank Bawa puts it:
Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.
Some of our recent links about MapReduce
- The integration of MapReduce with SQL data warehousing
- Three major applications of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
All I know about RDF/OWL I learned in preschool
I don’t want to seem pretentious, but back in preschool I was a star student. At least when it came to potatoes. I am not sure what it’s called in US preschools, but what we meant by a potato, in my French classroom, was an oval shape in which you put objects. The typical example had two overlapping ovals, one for green things and the other for animals. A green armchair goes in the non-overlapping part of the “green” oval. A lion goes in the non-overlapping part of the “animal” oval. A green frog goes in the intersection. A non-green bus goes outside of both ovals. Etc.

As you probably remember, there are many variations on this, including cases where more than two ovals overlap. The hardest part was when we had to draw the ovals ourselves as opposed to positioning objects in pre-drawn ovals: we had to decide whether to make these ovals overlap or not. Typically they would first be drawn separately until an object that belonged to both would come up, prompting some head-scratching and, hopefully, a redrawing of the boundaries. Some ovals were even entirely contained within a larger oval! Hours of fun! I loved it.
[Side note: meanwhile, of course, the cool kids were punching one another in the face or stealing somebody's lunch money. But they are now stuck with boring million-dollar-a-year jobs as cosmetic surgeons or Wall Street bankers (respectively) while I enjoy the glamorous occupation of modeling IT systems. Who's laughing now?]
To a large extent, these potatoes really are all you need to understand about RDFS and OWL classes. OO people, especially, are worried about “multiple inheritance”. But we are not talking about programmatic objects here, in which inheritance brings methods with it. Just about intersecting potatoes. Subclassing is just putting a potato inside another one. Unions and intersections are just misshaped potatoes made by following the contours of existing potatoes. How hard can all that be?
Sure there are these “properties” you’ve heard about, but that’s just adding an arrow to show that the lion is sitting on the armchair. Or eating the frog.
Just don’t bring up the fact that these arrows can themselves be classified inside their own potatoes, or the school bully (Alex Emmel) will get you.
Why MapReduce matters to SQL data warehousing
Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this.
The core ideas of MapReduce are:
- For large problems, parallel computing is much more cost effective and/or feasible than the alternatives.
- If you shoehorn programs into a certain very simple framework – namely that you’re limited to only having map and reduce steps — then building a general execution engine that gives parallelism “for free” is straightforward.
- A lot more problems can be solved within that framework than one might at first expect.
In essence, you can do almost anything to a single record* — that’s a map step. But you are sharply limited in how you combine information about multiple (often intermediate) records – that’s a reduce step. Still, reduce steps let you do counts, sums, or other aggregations. That, plus the general power of map steps, makes MapReduce useful for at least three major classes of applications:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
Except for the building of entire search engines, these are all application areas that data warehouse users should and do care about. And they all still could benefit from large performance increases, as is evidenced by the routine compromises analysts make in areas such as data reduction, sampling, over-simplified models and the like.
*Technically, MapReduce doesn’t allow for records. Instead, you process key-value pairs and lists of same. But so far as I can tell, that’s a distinction without a difference. LISP long ago proved that lists are a very general construct indeed.
MapReduce can be superior to pure SQL for these application areas, because they involve creation of data structures that are awkward to fit into a SQL rows-and-tables paradigm. Inverted-list text indexes just aren’t tables. Formally, graphs can always be fit into tables; but even so, if you want to follow a graph for numerous hops, relational structures can be problematic. Data mining can involve very high-dimensional problems with super-sparse tables. And while exhaustive text extraction into flat tables works OK, getting from there to common-sense semantic hierarchies can be a bit of a kludge.
Some of our recent links about MapReduce
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
Known applications of MapReduce
Most of the actual MapReduce applications I’ve heard of fall into a few areas:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what’s in the two big sources I found online. To wit:
1. In a slide presentation, Google offers the following applications of MapReduce:
- distributed grep
- distributed sort
- web link-graph reversal
- term-vector per host
- web access log stats
- inverted index construction
- document clustering
- machine learning
- statistical machine translation
2. The Hadoop applications page offers a rich trove of applications. Excerpts include:
- Aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences.
- Analytics
- Analyze and index textual information
- Analyzing similarities of user’s behavior.
- Build scalable machine learning algorithms like canopy clustering, k-means and many more to come (naive bayes classifiers, others)
- Charts calculation and web log analysis
- Crawl Blog posts and later process them.
- Crawling, processing, serving and log analysis
- Data mining and blog crawling
- Facial similarity and recognition across large datasets.
- Filter and index our listings, removing exact duplicates and grouping similar ones.
- Filtering and indexing listing, processing log analysis, and for recommendation data.
- Flexible web search engine software
- Gathering world wide DNS data in order to discover content distribution networks and configuration issues
- Generating web graphs
- Image based video copyright protection.
- Image content based advertising and auto-tagging for social media.
- Image processing environment for image-based product recommendation system
- Image retrieval engine
- Large scale image conversions
- Latent Semantic Analysis, Collaborative Filtering
- Log analysis, data mining and machine learning
- Natural Language Search
- Open source social search tools.
- Parses and indexes mail logs for search
- Plot the entire internet
- Process apache log, analyzing user’s action and click flow and the links click with any specified page in site and more.
- Process clickstream and demographic data in order to create web analytic reports.
- Process data relating to people on the web
- Process documents from a continuous web crawl and distributed training of support vector machines
- Process whole price data user input with map/reduce.
- Produce statistics.
- Product search indices
- Recommender system for behavioral targeting, plus other clickstream analytics
- Reduce usage data for internal metrics, for search indexing and for recommendation data.
- Research for Ad Systems and Web Search
- Retrieving and Analyzing Biomedical Knowledge
- Run Naive Bayes classifiers in parallel over crawl data to discover event information
- Search engine for chiropractic information, local chiropractors, products and schools
- Serve large Lucene indexes
- Session analysis and report generation
- Source code search engine
- Statistical analysis and modeling at scale.
- Storage, log analysis, and pattern discovery/analysis.
- Store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
- Teaching and general research activities on natural language processing and machine learning.
- Vertical search engine for trustworthy wine information
There also were some research apps and some general processing speed-up apps I found harder to excerpt.
Some of our recent links about MapReduce
- The integration of MapReduce with SQL data warehousing
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
MapReduce links
For whatever reason, I seem to be making the peripheral posts about MapReduce tonight before getting to the meat of the issues. So be it. There’s a rich set of links out there about MapReduce, and here are some of the best of them:
- Aster Data introduced MapReduce integrated into its SQL data warehouse DBMS tonight. Aster’s site features an excellent white paper.
- Exactly the same is true of Greenplum.
- Google Labs offers the seminal MapReduce research paper. It also has a broken link to an associated slide presentation, which fortunately is available here.
- One can get a good sense of MapReduce by reading up on the open source implementation Hadoop.
- In particular, this list of Hadoop applications is the longest list of MapReduce applications I know of (ahead even of Google’s long internal list).
- Joel Spolsky explained the core MapReduce concept a couple of years ago.
Some of our recent links about MapReduce
- The integration of MapReduce with SQL data warehousing
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
MapReduce sound bites
Last Thursday, both Greenplum and Aster Data — the two most recent of my numerous data warehouse specialist customers — both told me of the same major innovation. Both were rushing to announce it first, before anybody else did. This led to considerable tap dancing, with the upshot being that both are releasing the information tonight or tomorrow morning.
What’s going on is that Aster Data and Greenplum have both integrated MapReduce into their respective MPP shared-nothing data warehouse DBMS. I’ll write about that at length very shortly, but for now let me throw up some sound bites ahead of the more detailed analysis:
- MPP shared-nothing database managers like Greenplum or Aster Data give great performance. But sometimes you need to do even better. That’s where MapReduce comes in.
- On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should just increase its applicability and power.
- Google’s internal use of MapReduce is impressive. So is Hadoop’s success. Now commercial implementations of MapReduce are getting their shots too.
- At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot else. MapReduce can handle a significant fraction of that.
- The hardest part of data analysis is often the recognition of entities or semantic equivalences. The rest is arithmetic, Boolean logic, sorting, and so forth. MapReduce is already proven in use cases encompassing all of those areas.
- MapReduce isn’t about data management, at least not primarily. It’s about parallelism.
- MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.
- MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help.
- In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such scenarios are found in CRM and relationship analytics.
Some of our recent links about MapReduce
- The integration of MapReduce with SQL data warehousing
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
Greenplum’s single biggest customer
Greenplum offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except that:
- One Greenplum customer is at 400 terabytes now, and upgrading to >1 petabyte “as we speak.”
- Greenplum’s other soon-to-be >1 petabyte customer isn’t in production yet. (Greenplum previously told me that customer was in the process of loading data right now.)
Greenplum is in the big leagues
After a March, 2007 call, I didn’t talk again with Greenplum until earlier this month. That changed fast. I flew out to see Greenplum last week and spent over a day with president/co-founder Scott Yara, CTO/co-founder Luke Lonergan, marketing VP Paul Salazar, and product management/marketing director Ben Werther. Highlights – besides some really great sushi at Sakae in Burlingame – start with an eye-opening set of customer proof points, such as:
- 50 total paying Greenplum customers, over half of whom are already in production.
- 6 Greenplum users in production with >100 terabytes of user data. That may beat anybody except Teradata, among SQL data warehouse specialist vendors.
- 2 Greenplum customers expected to be in production within 60 days with >1 petabyte of user data. That may beat even Teradata. Anyhow, it looks as if Greenplum and Teradata will be 1-2 in some order crossing the 1-petabyte line. (Edit: Here’s more detail on >1 petabyte Greenplum users.)
- 5 Greenplum customers with “multiple 100s of users.” That’s not much by the standards of more mature vendors, but it suffices to show that Greenplum has some kind of a handle on concurrency.
- 3 Greenplum customers with 1000s of tables. That suffices to show that Greenplum’s claims to schema agnosticity are more than academic, even if it’s not enough to show that many enterprises care.
- Greenplum customers using tools from the following list, and I quote: SAS, Unica, Datastage, Information Builders, Informatica, Oracle BI, Microstrategy, Microsoft SSIS and SSRS, Business Objects / BODI, SAP, Talend, Pentaho
- (Again I quote) “Tier 1” customers in the following verticals:
- Retail
- Pharma
- Telco
- Internet
- Retail Banking
- Insurance
- Health Care
- Commercial Banking
- Transportation
- Service Providers
- Media
- Manufacturing
Even though the bulk of Greenplum’s revenue comes from the Sun appliance relationship, 20 paying customers run Greenplum on Linux. Another interesting demographic is that 25-40% of Greenplum’s revenue tends to come from Asia (obviously, the figure fluctuates greatly from quarter to quarter). Perhaps not coincidentally, one of Greenplum’s three salespeople last year was based in Asia. (The current total is 15, and growing fast.)
Technical highlights include:
- Greenplum is row-based, shared-nothing, MPP. It runs on standard hardware and operating systems. (But fortunately for its key partnership, Greenplum evidently does really run best, at least for now, on recommended Sun standard appliance configurations.)
- Most or all of the PostgreSQL data access methods are left intact. The big changes to PostgreSQL lie in the areas of query optimization, planning, and execution. I.e., Greenplum has its own way of breaking up a query into pieces – and of course of seeing that data gets shipped among nodes – but the low-level operators for storage and access are from PostgreSQL.
- Greenplum nodes are just connected to a group of standard switches, via standard 1 gigabit Ethernet. Greenplum insists that interconnect bandwidth is not a problem.
- Currently, there’s a boss node, with all the other nodes being peers. But by now (as opposed to in an early prototype of Greenplum), intermediate results are shipped peer-to-peer rather than back up to the boss node. In the future, compute and storage nodes will be (optionally) split out from each other.
- Compression is being introduced in the next point release, with big numbers (at least by row-based standards) out of the gate. It will initially be just for append-only tables, but that limitation will be lifted later on.
- Also in that release, Greenplum is introducing embedded parallel mathematical packages, such as linear algebra and statistics (specifically, R).
- Greenplum has no current in-the-cloud offering, but one is in the works.
- Greenplum offers an ever-growing variety of administration tools.
My current customer list among the data warehouse specialists
One of my favorite pages on the Monash Research website is the list of many current and a few notable past customers. (Another favorite page is the one for testimonials.) For a variety of reasons, I won’t undertake to be more precise about my current customer list than that. But I don’t think it would hurt anything to list the data warehouse DBMS/appliance specialists in the group. They are:
- Aster Data
- Calpont
- DATAllegro
- Greenplum
- Infobright
- Netezza
- ParAccel
- Teradata
- Vertica
All of those are Monash Advantage members.
If you care about all this, you may also be interested in the rest of my standards and disclosures.
It’s party time again for the tinkerers
Around 1995 and 1996, if you knew how to set up an HTTP server on a Solaris box, hand-write a few HTML pages and create a simple CGI script to save the content of a form into a file (extra credit if you remembered to append to the file rather than overwriting it every time), then you were a world-class web designer. At least in my neck of the woods, which wasn’t Silicon Valley at the time. These people were self-trained, of course. I made some side money back then, creating a few web sites with just these limited skills. I am sure there were already people who had really thought about web design and could create useful and attractive sites (rather than simply functional ones). But all twelve of them were busy elsewhere and I would guess that none of them spoke French anyway. They were not my competition in Paris, when talking, for example, to a large French bank who wanted to create a web site to hire college students. My only competition was a bunch of Photoshop clowns whose idea of web design was to create a brochure in Photoshop/Framemaker and make the whole web page one big JPEG file.
Compare this to utility computing (aka clouds) today. Any Linux sysadmin who has, over the last year, made the effort to read and experiment with cloud computing (typically Amazon EC2), to survey available tools and to write a few scripts to tie them together is now an IT rock star, a potential catalyst for operations as a competitive advantage.
Just like self-taught HTML dilettantes didn’t keep control of the web design playground for long, early cloud adopters among sysadmins won’t enjoy they differentiation forever. But I would guess that they do today. Anyone has statistics in terms of valuation for such skills on the job market?
Of course the Photoshop crowd eventually got their Frontpage, Dreamweaver, etc to let them claim that they could create web sites. These tools were pretty bad at first because they tried to make things look familiar to graphic designers (image maps galore!). They slowly got better.
The same thing is likely to happen in utility computing. Traditional IT management tools will soon get cloud features. Like the HTML WYSIWYG tools, they’ll probably tend to be too influenced by current IT management concepts and methods. For example, all the ITIL cheerleaders out there are probably going to bend cloud features to fit ITIL rather than the other way around. Even though utility computing might well invalidate some pretty fundamental assumptions/requirements of parts of ITIL.
The productivity increases created by utility computing are probably large enough that even these tools will provide great value. And they’ll improve. In the same way that the Web was a major enough improvement that even poorly designed web sites were way ahead of the alternatives.
Today, you obviously can’t make a living as an “HTML in notepad” developer. You must either be a real graphic designer and use tools to turn your designs in Web artifacts or be deep in Web technologies. Or both. Similarly, you soon won’t be providing much value if you just know how to start and provision EC2 instances. You’ll need to either be a real IT admin who can manage the utility resources as part of a larger system (like the applications) or be a hard-core utility computing expert who tackles hard problems like optimizing your resource consumption across cloud providers or securing and ensuring the compliance of your distributed IT system.
But for now, the party is raging and the dress code is still pretty lax.
Kevin Closson doesn’t like MPP
Kevin Closson of Oracle offers a long criticism of the popularity of MPP. Key takeaways include:
- TPC-H benchmarks that show Oracle as somewhat superior to DB2 are highly significant.
- TPC-H benchmarks in which MPP vendors destroy Oracle are too unimportant to even mention.
- SMP did better than MPP the last time he was in a position to judge (which evidently was some time during the Clinton Administration), so it surely must still be superior for all purposes today.
The Explosion in DBMS Choice
If there’s one central theme to DBMS2, it’s that modern DBMS alternatives should in many cases be used instead of the traditional market leaders. So it was only a matter of time before somebody sponsored a white paper on that subject. The paper, sponsored by EnterpriseDB, is now posted along with my other recent white papers. Its conclusion — summarizing what kinds of database management system you should use in which circumstances — is reproduced below.
Many new applications are built on existing databases, adding new features to already-operating systems. But others are built in connection with truly new databases. And in the latter cases, it’s rare that a market-leading product is the best choice. Mid-range DBMS (for OLTP) or specialty data warehousing systems (for analytics) are usually just as capable, and much more cost-effective. Exceptions arise mainly in three kinds of cases:
- Small enterprises with very limited staff.
- Large enterprises that have negotiated heavily-discounted deals for a market-leading product.
- Super-high-end OLTP apps that need absolute top throughput (or security certifications, etc.)
Otherwise, the less costly products are typically the wiser choice.
In the analytics area, appliances and other specialty data warehousing products offer huge price/performance advantages over general-purpose systems. What’s more, their superior performance allows them to get by with much simpler indexing structures, greatly reducing administrative burdens. If you have a data warehouse — or just a collection of data marts – running on Oracle or Sybase Adaptive Server or Microsoft SQL Server, it’s likely you could do yourself a huge favor by moving it to a specialty system.
If you’re an ISV, selling copies of the same software to many different customers, you should not be locked into expensive market-leading DBMS. Whatever the remaining deficiencies of mid-range systems, at least one of them will surely be good enough to support your software with an acceptably low level of one-time porting effort. (In many cases, EnterpriseDB’s Postgres Plus Advanced Server will have the edge, due to its Oracle compatibility as well as its generally rich feature set.) What’s more, besides saving license and maintenance fees, a mid-range DBMS may be easier for your customers to operate than a complex market leader is.
The one area where it may be premature to port away from market-leading DBMS is in-house OLTP applications. The first rule for OTLP apps is that they Must Not Break And so if they’re not broken, it is often advisable to be cautious about fixing them. In some cases prompt porting is a good idea anyway, but often there will be lower-hanging fruit elsewhere in the enterprise.
As you may imagine, this paper contains only a small fraction of our analysis of DBMS alternatives. Indeed, that’s the main topic of our blog DBMS2. Specific recommended links include:
- The first of an extensive series of blog posts about database diversity, containing links to many of the rest (most of which are by me, but a couple of which are by database pioneer Michael Stonebraker).
- Our coverage of data warehousing
- Our coverage of mid-range database management systems
Three happy 100 terabyte-plus customers for DATAllegro
Over on my Network World blog, I asked the question “So who are DATAllegro’s actual current customers?” As regular readers know, that’s a fairly hard question to answer. TEOCO is widely known as DATAllegro’s flagship reference, but after that the list gets thin in a hurry.
As a by-the-by to other discussions, DATAllegro Stuart Frost undertook to respond in part himself. Specifically, he gave me two names of two other happy customers that are or imminently will be running DATAllegro against 100+ terabytes of user data. The names are confidential, but both are enterprises whose names have long been linked to DATAllegro’s, and both are ones about which there’s been some doubt as to their DATAllegro relationship in the past. Now, I haven’t actually checked those references myself, but I’m guessing that if somebody else tries to at this point, they’ll actually find happy users.
Obviously, three’s not a very large number for an overall customer base. Indeed, one can find more than that in the men’s room at a Netezza user conference. But three customers at the 100+ terabyte level is a more serious accomplishment. Teradata is probably the only rival that’s well ahead of that figure, and I’m actually not 100% sure that anybody else has even matched it. (But then, to be technical, neither quite yet has DATAllegro itself.)
Exasol technical briefing
It took 5 ½ months after my non-technical introduction, but I finally got a briefing from Exasol’s technical folks (specifically, the very helpful Mathias Golombek and Carsten Weidmann). Here are some highlights:
-
Exasol has no concept of a “head” or “master” node, with different software than the others. Instead, all nodes are peers. For example, any node’s IP address can be given to an application; that node will then parse the SQL and distribute it appropriately to the other nodes.
-
Exasol is ACID-compliant, swapping blocks to disk when there’s an update. And one certainly can query data that’s on disk …
-
… however, Exasol’s memory structures are totally optimized for in-memory operation. Exasol is perfectly happy to swap in different parts of the database on a scheduled basis every few hours, but sending queries straight to disk isn’t optimal. Exasol’s recommended hardware configurations always are designed so that most queries can be executed against data already in RAM. However, if for example only the last 30 days of data are in RAM and a few queries go against full-year data, that’s OK.
-
Exasol has a compression story typical for a columnar DBMS vendor – heavy use of dictionary/token compression, other unspecified compression algorithms as well, data kept compressed in RAM, etc.
-
Like most other MPP data warehousing vendors, Exasol partitions data among nodes via a hash key. This is the industry’s most common scheme, because it has the parallelization benefits of random/equal distribution of data, yet still lets you get a head start on some hairy hash joins for extra performance.
-
Like Vertica, Exasol replicates small tables (e.g., dimension tables) across each node.
-
Exasol’s optimizer creates and maintains join indexes automagically on the fly. Exasol disagreed when I say “Oh, like a materialized view?” But I suspect this is the kind of join index that Teradata says privately is a special case of materialized view, and says publicly is a lot like a materialized view.
-
Generally, Exasol describes its optimizer as being “very MPP-aware.”
-
Exasol mainly wrote its own code from scratch. Right now they seem to have a kind of distributed operating system called EXACluster running over Linux, but they seem to be replacing the Linux underpinnings with their own stuff. E.g., disk access is going into EXACluster.
-
EXACluster already handles high availability/failover between nodes.
-
Exasol replicates data among nodes to allow for failover. That sounds similar to Vertica’s approach. Also, if you add nodes and restart Exasol, the database will automagically be repartitioned.
-
The biggest deployed Exasol system mentioned has 3 terabytes of user data. It is running on 5 nodes w/ 32 GB of RAM each.
-
For any given amount of total RAM a user is willing to deploy, Exasol recommends more nodes with less RAM/node. I didn’t probe directly as to why.
-
Exasol doesn’t have stored procedures. They assert that stored procedures would be useful mainly for ELT/ETL, and that alternatives perform well enough.
-
Like many data warehouse specialists, Exasol recommends ELT (Extract/Load/Transform) over ETL (Extract/Transform/Load). I.e., they
-
Exasol has user-defined functions (UDFs).
-
Exasol is working on BLOB support. Geospatial data is also on the radar (no pun intended), but it didn’t sound as if there was a currently active project.
We also talked about concurrency, which is always a confusing subject. Exasol said that to date there were no more than 50 concurrent “log-ins,” which they equate to there being 1000s of named users (because queries execute so quickly). They also say they’ve tested up to 400 concurrent queries internally. I didn’t probe about what they’d do to balance short-running and long-running queries, in part because Exasol gives the impression that on their systems, there is no such thing as a long-running query. But obviously this is all somewhat fuzzy.
In a related point, Exasol says that overall throughput is higher when is at least a certain number of concurrent users. The supporting evidence offered was, of all things, TPC-H benchmarks. Apparently (I haven’t checked this myself), Exasol (and also ParAccel, which of course has a similar architecture) chose to run the benchmark with more than the minimum number of simultaneous users required. SMP systems, Exasol believes, don’t exhibit similar behavior.
Finally, a couple of less technical highlights:
-
Licensing is per-gigabyte of RAM. (This fits with the whole memory-centric orientation.) 100 gigabytes of RAM are 120,000 Euros list price. Price doesn’t scale linearly with amount of RAM.
-
The partner whose name was redacted in February is now officially disclosed. Exasol is partnering in Japan with the services side of Hitachi. Exasol says Hitachi has 15-20 people working on introducing Exasol to Japan. Target customers are not primarily Hitachi’s hardware installed base.



