My quick reaction to the Actian/ParAccel deal was negative. A few challenges to my views then emerged. They didn’t really change my mind.
Amazon did a deal with ParAccel that amounted to:
- Amazon got a very cheap license to a limited subset of ParAccel’s product …
- … so that it could launch a service called Amazon Redshift.
- Amazon also invested in ParAccel.
Some argue that this is great for ParAccel’s future prospects. I’m not convinced.
No doubt there are and will be Redshift users, evidently including Infor. But so far as I can tell, Redshift uses very standard SQL, so it doesn’t seed a ParAccel market in terms of developer habits. The administration/operation story is similar. So outside of general validation/bragging rights, Redshift is not a big deal for ParAccel.
OEMs and bragging rights
It’s not just Amazon and Infor; there’s also a MicroStrategy deal to OEM ParAccel — I think it’s the real ParAccel software in that case — for a particular service, MicroStrategy Wisdom. But unless I’m terribly mistaken, HP Vertica, Sybase IQ and even Infobright each have a lot more OEMs than ParAccel, just as they have a lot more customers than ParAccel overall.
This OEM success is a great validation for the idea of columnar analytic RDBMS in general, but I don’t see where it’s an advantage for ParAccel vs. the columnar leaders.
As I admitted in the comment thread to my first Actian/ParAccel post, I’m confused about what kind of concurrent usage ParAccel can really support. The data I have, e.g. in the link immediately above, is not conclusive. Googling suggests that VectorWise was at one user per core a couple of years ago, supportive of my hypothesis that it doesn’t have some big concurrency edge on ParAccel. But to repeat — I don’t really know.
DBMS acquisitions in the past
My history blog on DBMS acquisitions yielded more favorable examples than I was expecting. (Of course, I omitted a lot of small and boring failures.) And DBMS conglomerates are the rule more than the exception, with IBM, Sybase, Teradata and Oracle all adopting acquisition-aided multi-DBMS strategies, at least to some extent.
That said, Sybase is the main example of a vendor of a slow-growth DBMS (Adaptive Server Enterprise) doing well with a faster-growing one (Sybase IQ). Perhaps not coincidentally, Actian’s latest management team draws significantly on Sybase. So yes; ParAccel is now owned by a company run by guys who know something about selling columnar DBMS.
But the whole thing would be more convincing if Ingres had shown more life under Actian’s ownership, or indeed at any point in the past 20 years. My bottom line is that Actian was floundering badly in the DBMS market 1 1/2 years ago, and not a lot of favorable news has emerged in the interim — except, quite arguably, for the management changes and acquisitions themselves.
Actian, which already owns VectorWise, is also buying ParAccel. The argument for why this kills VectorWise is simple. ParAccel does most things VectorWise does, more or less as well. It also does a lot more:
- ParAccel scales out.
- ParAccel has added analytic platform capabilities.
- I don’t know for sure, but I’d guess ParAccel has more mature management/plumbing capabilities as well.
One might conjecture that ParAccel is bad at highly concurrent, single-node use cases, and VectorWise is better at them — but at the link above, ParAccel bragged of supporting 5,000 concurrent connections. Besides, if one is just looking for a high-use reporting server, why not get Sybase IQ?? Anyhow, Actian hasn’t been investing enough in VectorWise to make it a major market player, and they’re unlikely to start now that they own ParAccel as well.
But I expect ParAccel to fail too. Reasons include:
- ParAccel’s small market share and traction.
- The disruption of any acquisition like this one.
- My general view of Actian as a company.
2 years after being acquired, Vertica — which conceptually has always been ParAccel’s closest competitor — has finally taken major hits on engineering staffing. Even so, I expect HP Vertica to reopen what was once a large technology and momentum gap vs. ParAccel.
My views on Actian start:
- Actian is attempting to build a database software conglomerate on the cheap, starting with Ingres, ParAccel, VectorWise, Pervasive (itself a small conglomerate) and Versant.
- Actian hasn’t accomplished much with Ingres, its original acquisition.
- Actian hasn’t accomplished much with VectorWise.
- Actian’s brief, embarrassing pivot away from database software was a joke. (The comments at that link also show VectorWise’s positioning as very different in September, 2011 than it is now.)
- I’ve had some very bad experiences with Actian management, although it seems to have largely turned over since then.
- I can’t identify the folks to make this work at the acquired pieces either (even though I think well of a few of them, e.g. Mike Hoskins and Rick Glick).
I.e., building a database conglomerate is hard, and Actian isn’t up to the challenge.
Actian has three main paths it can follow for synergy:
- Acquire a lot of pieces and flip the whole thing for more money to a foolish buyer. This strategy worked splendidly for Autonomy, and to some extent for Sybase as well. But it’s a longshot, and not necessarily a win for customers even if investors do well.
- Sell a bunch of disparate products through the same sales force. Tough to execute. And at best it raises sales coverage up to the level of that for the most successful product — and Actian doesn’t really have successful new products.
- Integrate the technologies. Blech. You don’t integrate DBMS with wildly different architectures, as Informix died trying in the 1990s.
I don’t see enough opportunity there for the whole thing to work out, with sales synergy being the best opportunity to prove me wrong.
I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.
1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.
Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.
Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:
- There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
- Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.
2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:
- Customer interaction
- Network and sensor monitoring
- Game and mobile application back-ends
Also arising fairly frequently are:
- Algorithmic trading
- Risk measurement
- Law enforcement/national security
- Stakeholder-facing analytics
I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.
3. Much of customer interaction revolves around recommendation and personalization. In connection with that I’ll remind you:
- Multiple sources say that 5 millisecond response is a real need. Srini Srinivasan explained why in a January comment.
- The results of the recommendation and personalization can be delivered in many different ways — product recommendations, ads, special offers, email, snail mail, call center scripts and more. This is the paradigmatic example for my skepticism about complete analytic applications.
4. Networks and sensors emit the epitome of machine-generated data. Data sources include web logs, network logs (in the IT sense), telecommunication networks, other utilities (e.g. electric), vehicle fleets, and more. Application themes include:
- Human monitoring, via some kind of real-time business intelligence view. I hear about that a lot.
- Various kinds of automated response. (Security is an obvious example.)
- Integration with other kinds of application, data source, or use case.
As one example of the last point, Oliver Ratzesberger told me years ago that eBay had up-to-the-minute BI cubes integrating customer response and log data, for the purpose of quickly detecting technology problems. Acunu recently told me that similar applications are one of their sales focuses.
5. In another example, games and mobile applications can be a lot like websites in terms of the analytics that support them (all the more so if we’re talking about games with in-app purchases). Two special features come up repeatedly, however — leaderboards for games, and geospatial data sent by mobile devices.
6. Algorithmic trading is flashy because of the sums of money involved, and because of what is often hyper-low latency; I’ve even heard 50 microseconds, and that’s a slightly out of date figure for a sequence of several atomic operations. But otherwise it’s not one of the more interesting areas to me, for at least two reasons:
- It depends on a lot of latency-specific stuff, such as hand-crafted hardware.
- The participants are secretive — understandably so as they’re literally in a race with each other –and don’t reveal much.
Another reason I don’t study it much is that high-frequency trading could be devastated at any time by some simple regulatory changes.
7. I finally figured out one of the big drivers for better risk analysis. Banks need to keep capital lying around to cover a fraction of the risk they take on. If they can estimate the risk more precisely, and come up with a lower number, then they need to keep less capital. That’s a lot like finding large bags of money.
8. Anti-fraud applications arise in many industries, with many different kinds of data and latency requirement. For example:
- Insurers don’t want to pay bogus claims. They usually have weeks to think about that problem.
- Telcos don’t want to provision services for customers who will defraud them. They have to decide at call-center speed.
- Similarly, retailers don’t want to accept bogus returns.
- Stockbrokers don’t want rogue traders to defeat their controls. A lot of data and analysis go into that mission, as billions of dollars — literally — can be at stake.
9. And finally, the recent Boston Marathon bombing has brought law-enforcement/anti-terrorism applications to the fore. The Boston Globe criticized difficulties in information sharing, but the money quote is:
The FBI followed up by checking government databases and looking for things such as “derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history,” according to FBI Supervisory Agent Jason J. Pack. “The FBI also interviewed Tamerlan Tsarnaev and family members. The FBI did not find any terrorism activity.”
Neither the telephone intercept nor the web-surfing tracking is a capability the government routinely admits, unless there was something like a wiretap order that I so far haven’t seen reported.
- Government surveillance is even more inevitable than when I wrote in 2010 that freedom can only be preserved by limiting government USES of data.
- Stakeholder-facing analytics isn’t much better understood than when I wrote about it in 2010.
- I wrote up a different list of analytic use cases back in 2006.
- The continued drop in high-frequency trading latency strengthens my 2009 contrast between the speed of a turtle and the speed of light; we’re now over a 3 * 10^10 difference between the speed of trading and the speed of generic planning, and many turtles walk well faster than 1 cm/sec.
The third of my three MySQL-oriented clients I alluded to yesterday is MemSQL. When I wrote about MemSQL last June, the product was an in-memory single-server MySQL workalike. Now scale-out has been added, with general availability today.
MemSQL’s flagship reference is Zynga, across 100s of servers. Beyond that, the company claims (to quote a late draft of the press release):
Enterprises are already using distributed MemSQL in production for operational analytics, network security, real-time recommendations, and risk management.
All four of those use cases fit MemSQL’s positioning in “real-time analytics”. Besides Zynga, MemSQL cites penetration into traditional low-latency markets — financial services (various subsectors) and ad-tech.
Highlights of MemSQL’s new distributed architecture start:
- There are two kinds of MemSQL node — “aggregator” and “leaf”.
- Aggregators are a kind of head node. You can have a bunch of them.
- Leafs run full single-server MemSQL. You can have a bunch of them too.
- MemSQL has two query optimizers. One kind runs on the aggregator nodes, and thinks about the whole cluster. The other runs on the leafs, and only thinks about its own node.
- Much of the join and aggregation work is done on the aggregator nodes, but I didn’t pursue that issue in much detail.
- It is good policy — and supported — to replicate small dimension/reference tables across the cluster. These are replicated to aggregator and leaf nodes alike. (This tells us that some joins are indeed done on the leafs. )
- MemSQL replication can be synchronous or asynchronous. It can be used for high availability.
- MemSQL writes (whether primary or replicated) go to a buffer. The buffer size can be 0 or positive, in a tradeoff of durability vs. the likelihood of a disk I/O bottleneck.
- MemSQL has many virtual nodes on each physical (leaf) node. (This is pretty much an industry-standard best practice, as it helps with elasticity, recovery from node failure, and so on.)
- Compression is still a future feature.
- So is online schema change.
- Leaf nodes have cost-based optimizers.
- MemSQL’s aggregator (cluster-wide) optimizer is mainly heuristic, but is supposed to get more cost-based in future releases.
- In some releases it will be possible to keep MemSQL running while upgrading the software. But that’s not a promise for releases that change how replication works.
And which not-easily-parallelized aggregate did MemSQL implement first? The same one Platfora did — COUNT DISTINCT.
Last week, I edited press releases back-to-back-to-back for three clients, all with announcements at this week’s Percona Live. The ones with embargoes ending today are Tokutek and GenieDB.
Tokutek’s news is that they’re open sourcing much of TokuDB, but holding back hot backup for their paid version. I approve of this strategy — “doesn’t lose data” is an important feature, and well worth paying for.
I kid, I kid. Any system has at least a bad way to do backups — e.g. one that involves slowing performance, or perhaps even requires taking applications offline altogether. So the real points of good backup technology are:
- To keep performance steady.
- To make the whole thing as easy to manage as possible.
GenieDB is announcing a Version 2, which is basically a performance release. So in lieu of pretending to have much article-worthy news, GenieDB is taking the opportunity to remind folks of its core marketing messages, with catchphrases such as “multi-regional self-healing MySQL”. Good choice; indeed, I wish more vendors would adopt that marketing tactic.
Along the way, I did learn a bit more about GenieDB. In particular:
- GenieDB is now just backed by a hacked version of InnoDB (no more Berkeley DB Java Edition).
- Why hacked? Because GenieDB appends a Lamport timestamp to every row, which somehow leads to a need to modify how indexes and caching work.
- Benefits of the chamge include performance and simpler (for the vendor) development.
- An arguable disadvantage of the switch is that GenieDB no longer can use Berkeley DB’s key-value interface — but MySQL now has one of those too.
I also picked up some GenieDB company stats I didn’t know before — 9 employees and 2 paying customers.
Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:
- Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
- This year the figure is around 40%.
- The 6700 is based on Intel’s Sandy Bridge.
- Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
- Teradata has now largely switched over to InfiniBand.
Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:
- Teradata Unified Data Architecture.
- Fabric-based computing, even though this isn’t really about storage.
- Teradata SQL-H.
The upshot is that Teradata has at least 6 kinds of rack or cabinet it wants to sell you — along with software to connect them — of which it really thinks you should get at least 3:
- The 4 main Teradata-software appliances:
- Active Enterprise Data Warehouse (the new 6700). Teradata thinks every sufficiently large enterprise should have one of these.
- Extreme Performance Appliance (Teradata 4xxx), based on solid-state drives (which are also used in the 6xxx systems). At least I think so; the 4xxx wasn’t in the most recent slide deck I saw.
- Data Warehouse Appliance (Teradata 2700).
- Extreme Data Appliance (Teradata 1650).
- The Teradata Aster Big Analytics Appliance, running Aster and Hadoop software. Teradata basically thinks everybody should have one of these too.
- A separate cabinet for special-purpose “Teradata Managed Servers”. While there’s some space for Managed Servers in other Teradata appliances, Teradata now offers so many such capabilities that it thinks you will likely need a separate rack for those as well. These include (partial list):
- Viewpoint system management.
- Teradata Unity.
- Data movement, which is not the same thing as Teradata Unity.
- Data loading, which is yet something else.
- Generic compute (notably, to run SAS).
Even that doesn’t exhaust the possibilities:
- The 36 InfiniBand ports Teradata can fit into a cabinet aren’t enough, it suggests and presumably will sell you free-standing Mellanox switches as an alternative.
- That slide deck split the Big Analytics Appliance back out into Aster and Hadoop options.
- There also seems to be a SAS-specific modeling appliance.
And you can have — or in some cases must have — Teradata Managed Server nodes in other kinds of Teradata appliance.
Finally, Teradata also offers a stand-alone single- or several-node Teradata 670 Data Mart Appliance, notes on which include:
- The Teradata 670′s entry price is under $1/2 million, if you want to use it as your first Teradata system (something that evidently is happening, mainly outside the Americas).
- Another use for the Teradata 670 is for physical — as opposed to virtual — data mart spin-out.
- The primary use for the Teradata Data Mart Appliance, however, seems to be test/development for larger Teradata systems.
- The Teradata Data Mart Appliance is one of the options for placing in a separate managed-server Teradata rack.
As vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. Rather, SQL-H is a transparency interface that makes Hadoop data responsive to the same code that would work on Teradata Aster …
*Speaking of confusion — Teradata Aster seems to use the spellings SQL/MR and SQL-MR interchangeably.
… except that now there’s also a SQL-H for regular Teradata systems as well. While it has the same general features and benefits as SQL-H for Teradata Aster, the details are different, since the underlying systems are.
I hope that’s clear.
I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:
- DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
- DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
- DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
- Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
- DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
- The Deep guys have plans and designs for scale-out — transparent sharding and so on.
*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.
Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:
- Akiban definitely does. (Note: Stay tuned for some next-steps company news about Akiban.)
- Tokutek has planted a small stake there too.
- Key-value-store-backed NuoDB and GenieDB probably leans that way. (And SanDisk evidently shut down Schooner’s RDBMS while keeping its key-value store.)
- VoltDB, Clustrix, ScaleDB and MemSQL seem more strictly tabular, except insofar as text search is a requirement for everybody. (Edit: Oops; I forgot about Clustrix’s approach to JSON support.)
Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.
Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.
Among the 10 people listed as part of Deep Information Sciences’ team, I noticed 2 who arguably had DBMS industry experience, in that they worked at virtualization vendor Virtual Iron, and stayed on for a while after Virtual Iron was bought by Oracle. One of them, Chief Scientist & Architect Tom Hazel, also was at Akiban for a few months, where he did actually work on a DBMS. Other Deep Information Sciences notes include:
- Deep has 25 or so people in all.
- Deep had a recent $10 million funding round.
- Deep Information Sciences is the former Cloudtree, which as of February, 2011 was pursuing quite a different strategy. (Evidently there was a pivot.) Deep was founded in 2010.
- There are 2 paying customers for DeepDB, even though it’s still in beta, and 8 trials. A similar number of trials and strategic partners are queued up.
- DeepDB general availability is expected later this quarter.
Although our call was blessedly technical, we didn’t have a chance to go through the DeepDB architecture in great detail. That said, DeepDB seems to store data in all of 3 ways:
- An in-memory row store.
- An on-disk row store with a very different architecture.
- Indexes, which can also serve as a column store.
Notes on that include:
- DeepDB’s in-memory row store is designed to manage single rows as much as possible, rather than pages. Indeed, there are “aspects of tries”, although we didn’t drill down into what exactly that meant.
- Indexes are streamed to disk no less than once every 15 seconds, by default, and perhaps with latency as low as 10 milliseconds.
- Perhaps the most important point I didn’t grasp is “segments”. The data and indexes on disk are stored in segments, which can be of different sizes, and which may each carry some summary data/metadata/whatever. Somehow, this is central to DeepDB’s design.
- In what is evidently a design focus, DeepDB tries to get the benefit of “in-memory data” that isn’t actually taking up RAM. B-trees can point at rows that aren’t actually in memory. Segments evicted from cache can leave some metadata or summary data behind.
- DeepDB’s compression story seems to be a work in progress.
- There’s prefix compression already, at least in the indexes, which Deep just calls “compaction”.
- Other compression is working in the lab, but not scheduled for Version 1.0.
- Block compression seems to be in play.
- Delta compression was mentioned once
- Dictionary compression wasn’t mentioned at all.
- DeepDB apparently will keep compressed data in cache, then decompress it to operate on it.
- Different segments can be compressed/uncompressed differently.
- DeepDB’s on-disk row store is append-only. Time-travel is being worked on. While I forgot to ask, it seems likely that DeepDB has MVCC (Multi-Version Concurrency Control).
And finally: DeepDB in its current form is a “drop-in” InnoDB replacement, but not necessarily bug-compatible.
I have been using the analogy that sometimes getting WebCenter projects started, progressed or completed is like climbing a mountain. Customers aren’t always sure where to begin, how to stay on path, or what obstacles may lie ahead. Most customers seem to want to evolve their WebCenter use cases, say from standard content management to an enterprise portal, but not knowing such things as the amount of effort required, technical complexities, and deployment options tends to keep such projects at the base of the proverbial WebCenter mountain.
What better place to start your trek up that mountain than Denver, Colorado – site of Collaborate 13. Fishbowl Solutions will be there, and we would enjoy discussing your WebCenter projects and how we might assist in helping those projects get started, progressed and completed – avoiding the cliffs and jagged rocks along the way. We would also like to share with you some new and exciting ways that your trek can be made easier through our value-add WebCenter solutions. Here is a quick description of the solutions we will highlight at Collaborate 13:
- Mobile Applications: Access WebCenter Content on Apple and Android mobile devices
- Google Search Appliance Connector: Improve the relevancy of search results across your WebCenter-based systems
- Intranet In A Box: Framework to Build a Next-Generation Intranet in 60 Days
- WebCenter Upgrade Package: Comprehensive plan to move to 11g
These solutions will be demonstrated in our booth – #1277 – and will be discussed across our six presentations. Be sure to check out our Collaborate 13 page for all the details on our Collaborate activities. We look forward to helping you start your WebCenter ascent at Collaborate 13.
The post Reach the Oracle WebCenter Summit with Fishbowl Solutions at Collaborate 13 appeared first on C4 Blog by Fishbowl Solutions.
Last year back in Feb we had PS5 and now with PS6 the WebCenter Suite released yesterday I can say its all just getting better and better!..
A rundown of the new Jdev 220.127.116.11.0 Enhancements can be seen here
Here are some of the items that catch my eye and you may have seen it on the twitter stream with a couple early tweets before the official release.
Firstly the new Skyros Skin (I’m presuming after the Greek Island)
Very clean and great looking skin; uses a lot of CSS3 properties instead of hundred of images to structure components – tabs degrades nicely for older browsers IE8 and below ie rounded corners become square.
Also a few new skin selector properties that tidy up the structure for better skinning development – I’ll try to post some updates later on to give you a rundown of some of the new enhancements with skinning.
You can see it in action in the new PS6ADF Faces Rich Client here
There are a few DVT extras like Sunburst; although I’m sorry to say I’ve never been too impressed with DVTs you get out of the box.
PanelGridLayout makes it across from R2 into R1PS6 looks promising
Follows the CSS3 specs for grid layout so it can be optimized for layout performance and is also the recommended UI layout component for most pages.
Runtime code editor Finally colour coded goodness!
I believe it’s using codeMirror great job.
File Uploader also looks interesting haven’t tried it out yet – drag drop support is interesting with java support for older browsers looking forward to seeing it in action.
Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included:
- MongoDB’s tunable consistency seems really interesting, with numerous choices available at the program-statement level.
- All rumored performance problems notwithstanding, Ron asserts that MongoDB often “kicks butt” in actual proof-of-concept (POC) bake-offs.
- Ron cites “12 different language bindings” as a key example of developer functionality giving 10gen an advantage vs. Ron’s previous employer MarkLogic.
- 10gen is working hard on management tools, security, and so on.
- Ron claims that the “MongoDB loses data” knock is a relic of the distant — i.e. 1-2 years ago — past.
- We had the same “Who needs joins?” discussion that I used to have with MarkLogic — Ron’s former company — and which MarkLogic has since disavowed.
- There’s nothing special about MongoDB’s b-tree indexes. (I mention that because Tokutek thinks it offers a faster MongoDB indexing option.)
While this wasn’t a numbers-oriented conversation, business highlights included:
- A lot of MongoDB’s competition is RDBMS — Oracle, SQL Server, MySQL, etc.
- MongoDB’s top NoSQL competitor is Cassandra. 10gen sees less Couchbase than before, and also less HBase than Cassandra.
- There’s yet another favorable MongoDB soft metric — 50,000 registrants for free online education, 2/3 outside the US.
I can add that anecdotal evidence from other industry participants suggests there’s a lot of MongoDB mindshare.
Specific traditional-enterprise use cases we discussed focused on combining data from heterogeneous systems. Specifically mentioned were:
- Reference data/360-degree customer view.
- Reference data about securities.
- Aggregation of analytic results from various analytic systems across an enterprise. (For risk management).
DBAs’ roles in development
A lot of marketing boils down to “We don’t need no stinking DBAs!!!” I’m thinking in particular of:
- Hadoop and/or exploratory BI* messaging that positions against the alleged badness of “traditional data warehousing”.
*See in particular the comments to that post.
The worst-case data warehousing scenario is indeed pretty bad. It could feature:
- Much internal discussion and politicking to determine the One True Way to view various data fields, with …
- … lots of ongoing bureaucratic safeguards in the area of data governance.
- Long additional efforts in the area of performance tuning.
- Data integration projects up the wazoo.
But if the goal is just to grab some data from an existing data warehouse, perhaps add in some additional data from the outside, and start analyzing it — well, then there are many attempted solutions to that problem, including from within the analytic RDBMS world. The question is whether the data warehouse administrators try to help — which usually means “Here’s your data; now go away and stop bothering me!” — or whether they focus on “business prevention”.
Meanwhile, on the NoSQL side:
- The smart folks at WibiData felt the need for schema-definition tools over HBase.
- Per Ron Avnur, MongoDB users are clamoring for consistency-rule specification via an administrative (rather than programmatic) UI.
It’s the old loose-/tight-coupling trade-off. Traditional relational practices offer a clean interface between database and code, but bundle the database characteristics for different applications tightly together. NoSQL tends to tie the database for any one app tightly to that app, at the cost of difficulties if multiple applications later try to use the same data. Either can make sense, depending on (for example):
- How it seems natural to organize your development and data administration talent.
- Whether the app is likely to survive long enough that you’ll want to run many other applications against the same database.
Upgrading to Oracle WebCenter Content 11g: Use Upgrade Assistant or Migrate Content to a New Instance?
We held a webinar recently to introduce our 11g Upgrade Package. One good question was asked a couple times and really merits a better answer than could be delivered in the Q&A session so I thought we could elaborate a bit more here. The question is: Is it better to use the 11g Upgrade Assistant to upgrade in place or to migrate everything to a fresh new 11g instance. The short answer is – It depends:) Another good answer is – Neither.
You may now understand why I wanted to expand on the short answers given at the end of the webinar. I’ll describe the two approaches below as well as a potentially better 3rd way. Whichever approach is selected, the process needs to be well thought out, clearly defined, and tested before doing anything in production.
Migrating to New Instances
Migrating to new instances isn’t really an upgrade, but it is very common. For a long time, this was the standard recommended way that we’d execute upgrades. In this scenario, wholly new instances of Content Server are installed and all configuration, status information, and content is migrated to the new instances. This is necessary if you are planning to change the variety of hardware or database to be used going forward.
Care must be taken to get all configuration migrated. The Configuration Migration Utility (CMU) is used to migrate the bulk of the configuration, but archiver is used to migrate custom table data. Also, many add-ons or optional components require special steps to migrate other relevant information.
The migration of content is typically performed with the Archiver applet. This process is error prone and needs to be monitored carefully. Migrating content in a workflow lies somewhere between difficult to not supported. If a lot of content needs to be migrated, a custom process is likely to be recommended. If the instance manages a lot of content, a migration can take a long time and this needs to be accounted for in the planning. Fishbowl has created the Content Migrator to support the migration of content from one instance to another without needing to use Archiver.
It’s all doable, but migrations are usually more complicated than the other options.
Use of the Upgrade Assistant allows for ‘in place’ upgrades. The high level process involves:
- Installing WebLogic Server (or the supported container of your choice)
- Installing the WCC installation files
- Then running the Upgrade Assistant pointing to the existing instance of Content Server.
The Upgrade Assistant will take over the existing instance and upgrade the file system and database schema to support 11g. It will disable any older patch components and enable the 11g versions of any standard components. It will also disable any custom components and give you the list of these. For this reason, it is very important to do this in a quality dev environment 1st as it will take time to enable, test, and likely fix any custom components in use in your system. Once the dev instance is upgraded and tested, the production instance is upgraded in exactly the same way, but after the Upgrade Assistant finishes, you can install the upgraded components from the dev instance, restart and do your final system testing before turning it back over to your users.
The benefit of leveraging the Upgrade Assistant is that you don’t have to migrate the configuration and content to the new instance. If you aren’t changing your search index (Verity is no longer supported in any way), you don’t have to immediately rebuild your search index either though it is highly recommended that a rebuild be completed after the upgrade.
The downside of this approach is that you are directly working in the production instance. There is no way around significant downtime and if something goes wrong you’re under the gun to rectify it immediately or you have to restore everything from tape and try again later.
Copy & Upgrade in Place
The 3rd options is to make a copy of the production instance and upgrade it in place. As long as you’re not changing the variety of operating system or the type of database, this has worked well for us. If there’s a lot of content, the copying itself can take a fair amount of time, but then the upgrade can be executed on new and refreshed hardware and the production system need only be down while the copy is being made. After the copy, the old Content Server can be restarted in read only mode until you can switch over to the new system. Sometimes this is only done to make a new dev or stage instance if the old instances were out of sync. After having run the upgrade in those two environments, customers may be comfortable running the production upgrade in place.
If you’d like us to assist you with an upgrade to WCC 11g we’ll ask that you complete an upgrade questionnaire. We put this together to most efficiently estimate the level of effort for an upgrade. Some of the questions included on the questionnaire help us determine which of these approaches might best fit your needs.
If you’d like to see what we think, please email any of us here or firstname.lastname@example.org and request a copy of the upgrade questionnaire.
Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.
In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.
Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed:
- Platfora incrementally batch-loads data from Hadoop into its own bare-bones SQL data store, and does BI against that. That data store:
- Of course wants to run in-memory whenever possible …
- … but also has a significant disk-based aspect.
- Is true-columnar on disk and in memory alike.
- Stores all columns from a given row on the same nodes.
- Specifically, Platfora builds star-schema data marts, called “lenses”. To avoid data bloat on the Platfora servers:
- Two lenses with the same data often only store it once.
- The data for a given lens can be “evicted” if it won’t be needed for a while. (But the specifications for the lens are of course kept in case you want to rebuild it later.)
Notes on Platfora’s Hadoop ETL (Extract/Transform/Load) include:
- The basic idea is that you periodically re-run a job to pick up incremental changes since the last load.
- Right now that’s just a cron job or something. Platfora plans to add scheduling features imminently.*
- Platfora is sensitive to Hive partitioning.
- Platfora can run filters and so on to extract non-Hive data (the more common case).
*But in a sad comment on Hadoop’s workload management capabilities, Platfora doesn’t expect these features to be much used, at least at first.
Platfora’s aggregation story goes something like this:
- If an aggregate can be updated incrementally — for example a count or sum — Platfora probably will maintain it for you and update it on load.
- Ditto if it can be maintained almost incrementally — for example an average.
- Platfora also does Distinct calculations, even though those have to be worked through on its own servers.
As you would expect, Version 1 of the Platfora data store has various limitations, such as:
- Platfora Version 1 can’t do much with arrays or (other) nested data structures — it just transforms them into JSON strings.
- Platfora’s SQL support is limited.
- The Platfora data store has a “fat head” master (but at least that head is multi-node).
Naturally, Platfora hopes to fix these issues down the road.
Finally, a few company notes:
- Platfora has had 20 beta users, mainly but not entirely among online businesses.
- Platfora has close to 50 people.
- Platfora is currently focused on US direct sales, relying on inbound leads.
- The trend to clustered computing is sustainable.
- The trend to appliances is also sustainable.
- The “single” enterprise cluster is almost as much of a pipe dream as the single enterprise database.
I shall explain.
Arguments for hosting applications on some kind of cluster include:
- If the workload requires more than one server — well, you’re in cluster territory!
- If the workload requires less than one server — throw it into the virtualization pool.
- If the workload is uneven — throw it into the virtualization pool.
Arguments specific to the public cloud include:
- A large fraction of new third-party applications are SaaS (Software as a Service). Those naturally live in the cloud.
- Cloud providers have efficiencies that you don’t.
That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.
Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.
To put all that more simply:
- Moving existing applications to new platforms often isn’t worth the trouble.
- Many needs can be best met by single, physically local devices.
Appliances are a natural form factor for single-purpose computing. It is reasonable to characterize as “appliances” — in the computing sense of the term — medical equipment, vehicles, cash machines, cash registers, enterprise security devices, home entertainment, exercise machines and, yes, refrigerators; computers, in some form, can be found almost anywhere. But appliances also are a convenient way to package enterprise systems — configurations will be correct, installation will be simpler, and fortunate software-centric appliance vendors may capture margins on hardware sales and support. And the idea of SaaS-like continuous updates to your enterprise systems seems much more reasonable in the case of a locked-down appliance-like configuration.
Circling back to the beginning, I’d say there are multiple reasons not to expect all your computing to be done on a single cluster:
- You might want to use appliances don’t fit into that cluster.
- You might want to use SaaS offerings that don’t fit into that cluster.
- The efficiency gains from using a single cluster aren’t that much greater than the gains from using a few of them.
- You might want different parts of your computing work to be done in-house and in the public cloud.
- You might want different parts of your data to be kept in different countries.
- Different kinds of work might fit better onto differently-configured nodes, and current cloud/cluster technology doesn’t do a wonderful job with heterogeneity.
- A lot of computing is so inherently small and local that it shouldn’t be clustered at all.
Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.
If I had my way, the business intelligence part of investigative analytics — i.e. , the class of business intelligence tools exemplified by QlikView and Tableau — would continue to be called “data exploration”. Exploration what’s actually going on, and it also carries connotations of the “fun” that users report having with the products. By way of contrast, I don’t know what “data discovery” means; the problem these tools solve is that the data has been insufficiently explored, not that it hasn’t been discovered at all. Still “data discovery” seems to be the term that’s winning.
Confusingly, the Teradata Aster library of functions is now called “Discovery” as well, although thankfully without the “data” modifier. Further marketing uses of the term “discovery” will surely follow.
Enough terminology. What sets exploration/discovery business intelligence tools apart? I think these products have two essential kinds of feature:
- Query modification.
- Query result revisualization.*
Here’s what I mean.
*I’d wanted to call this re-presentation. But that would have been … pun-ishing.
The canonical form of query modification is:
- There’s a scatter plot or other graphical data visualization.
- You select a rectangular area on the graph.
- A new visualization is drawn.
That capability is much more useful in systems that allow you to change how the data is visualized, both:
- Before you select a subset of the results (so you can choose which visualization is easiest to select from).
- After you’ve made the selection (it would be silly to stay in a monthly bar chart if you’ve just selected a single month).
Other forms of query modification, such as faceted drill-down or parameterization, don’t depend as heavily on flexible revisualization. Perhaps not coincidentally, they’ve been around longer in some form or other than have the QlikView/Tableau/Spotfire kinds of interfaces. But at today’s leading edge, query modification and query result revisualization are joined at the hip.
What else is important for these tools?
- Good UI design, of course.
- Speed — split seconds matter.
- Most of the same features that matter for business intelligence tools with other kinds of UI.
Please note that speed is a necessary condition for exploratory BI, not a sufficient one; a limited UI that responds really fast is still a limited UI.
As for how the speed is achieved — three consistent themes are columnar storage, compression, and RAM. Beyond that, the details vary significantly from product to product, and I won’t try to generalize at this time.
- The importance of data exploration flexibility (July, 2012)
- QlikView architecture (June, 2010)
- A cool QlikView feature that isn’t particularly tied to data exploration (November, 2011)
- Endeca’s underlying technology (April, 2011)
Oracle Universal Content Management 10gR3 was released in May 2007. Since that time, Oracle WebCenter Content 11g has been released, and Oracle WebCenter 12c is on the horizon. For 10gR3 customers, the next step down the WebCenter path is to upgrade to 11g. However, some customers don’t know where to begin in terms of an upgrade – not when their current version is supporting numerous business processes, contains thousand of high-value content items, and has been customized numerous time to meet business requirements.
Join Jason Lamon, Senior Marketing Associate, and Alan Mackenthun, Technical Program Manager at Fishbowl Solutions as they discuss Fishbowl’s path, package and promise for WebCenter Content 11g upgrades. They are also privileged to be joined by Mike Kohorst – IT Application Manager at Ryan Companies, who will discuss their recent 11g upgrade success, as well as their future plans for the system. We hope you will be able to join us!
Date: Thursday, March 21st
Time: 1 pm EST, Noon CST
The post Fishbowl Webinar – A Path, Package, and Promise for WebCenter Content 11g Upgrades appeared first on C4 Blog by Fishbowl Solutions.
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.
But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.
… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.
RDBMS-oriented Hadoop file formats are confusing
I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.
Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:
- Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
- What’s the nested data structure story? (It seems there is one.)
- What’s the compression story?
Come to think of it, the name “Parquet” suggests that either:
- Rows and columns are mixed together.
- Somebody has the good taste to be a Celtics fan.
Whither analytic platforms?
I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.
I think that problems include:
- Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
- But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
- … a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
- More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.
Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.
I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:
- Metadata management in a structured-file context.
- Lineage/provenance, auditing, and similar stuff.
Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.
My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:
- A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
- But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.
As for the specific products, both of which you might want to check out:
- Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
- Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)