Skip navigation.

Other

Notes on the analysis of large graphs

DBMS2 - Sun, 2012-05-13 21:35

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

  • The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
  • The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
  • The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:

  • An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
  • A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.

Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.

Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be:

  • Store a table with one row per edge.
  • Do an (n-1)-way join, where n is the number of edges in the path or subgraph.

In many cases the cardinality of intermediate result sets would be high, and you’d basically be doing a series of full table scans. Those could take a while.

There are various approaches to dealing with this challenge. For example:

  • Graph analysis has been around long enough that much of it has surely been done relationally.
  • I wrote about some specific relational strategies for graph analysis five years ago.
  • A lot of graph analysis these days is being done in Hadoop (or other MapReduce, notably Aster Data’s).
  • Objectivity Infinite Graph  and Google Pregel emphasize pre-fetching (or pre-shipping) edges that might soon be needed.
  • Yarcdata, with its Cray genes, tries to optimize hardware (single RAM image across a cluster, with a whole lot of multithreading) for in-memory Apache Jena performance. Unfortunately, I’m not clear as to which data structure(s) Jena uses.

When trying to figure out which of these techniques is likely to win in the most demanding cases, I run into the key controversy around analytic graph data management — how successfully can graphs be partitioned? Opinions vary widely, with the correct answers in each case surely depending on:

  • The topology of the graph.
  • The size of the graph.
  • The length of the paths that need to be examined.

But in the interest of getting this posted tonight, I’ll leave further discussion of graph partitioning to another time.

Categories: Other

We’re back

DBMS2 - Sun, 2012-05-13 21:11

Our blogs have been moved to a new hosting company, and everything should be working. Ditto our business site.

If you notice any counterexamples, please be so kind as to ping me.

Categories: Other

Collaborate 12 Recap – 12 Brief Takeaways

 

Fishbowl Solutions wrapped up a successful Collaborate conference on April 26th. Here are my 12 takeaways from the event – in no particular order.

  1. A lot of user group crossover. Badges denoting the various user groups – IOUG, OAUG, Quest – were marked by different colors, and I saw quite a mix of these colors within WebCenter sessions. I think the organizers and user groups that comprise Collaborate have always wanted to see this user group crossover, and it appears that it is finally happening. This has evolved pretty much in parallel with WebCenter releases. As WebCenter added more integrations with Oracle applications (E-Business Suite, PeopleSoft) and leveraged more products in the Middleware layer (Identity Management, Business Intelligence), organizations using those applications and middleware components have had almost 5 years now to hear the WebCenter story. It appears organization with a heavy Oracle footprint are finally understanding the value of including WebCenter as a core piece of their organization’s infrastructure, which includes being able to surface high-value content to business applications. The story is resonating.

  2. SharePoint. “We have SharePoint. Help!” Ok, I didn’t actually hear anyone say that, but that is how I would summarize the overall angst by customers that have both WebCenter and SharePoint, or are Oracle shops that have seen Microsoft encroachment. It isn’t surprising that we talked to a lot of attendees that have SharePoint because most organizations are using SharePoint – in one form or another. However, organizations are still trying to figure out and separate their use cases for SharePoint and Oracle WebCenter. One discussion we had was with a gentleman from an organization who had just recently been exposed to SharePoint, and they were planning to roll it out and use it for content management. They were a smaller organization – about 150 employees – and they decided on SharePoint because it was going to be so much cheaper than other content management systems. As we discussed their use case, the other Oracle products they were using (Oracle Database and JD Edwards), as well as the technical staff they had on hand (mostly JAVA developers), the conversation shifted to the ongoing costs of SharePoint. We helped make them aware that their 3-5 year costs to manage, maintain and integrate SharePoint with their other Oracle systems were probably going to be higher when compared to WebCenter. The long-term and ongoing costs of SharePoint are typically not considered when it first gets rolled out, and Oracle and Fishbowl have some great material available that proves this out – all backed by industry findings. Check out these white papers:

    SharePoint 2010 Cost of Ownership: Expect the Unexpected
    SharePoint is NOT an ECM System – Reasons Why
  3. Document Imaging. This 25-year plus technology is still relevant, and the reason should be obvious. ROI. Imaging systems enable organizations to reduce the costs associated with handling and processing paper, especially paper tied to a financial process – such as accounts payable. All organizations have to pay invoices, and organizations with hundreds and thousands of invoices struggle with getting them paid on time and efficiently as possible. Oracle WebCenter Imaging provides robust imaging capabilities and has been integrated with Oracle Applications to provide end-to-end imaging within business processes. With this system, organizations using E-Business Suite, PeopleSoft, and JD Edwards are able to quickly automate invoice processing and achieve a clear and measurable ROI.

  4. Sites. Lots of excitement and uncertainty around WebCenter Sites. Organizations using Oracle’s web content management (WCM) system commonly known as SiteStudio, have been anxiously awaiting the next generation off WCM. However, WebCenter Sites is listed as a Web Experience Management platform that includes such features as segmentation and analytics. For organizations looking to increase web site retention and conversion, have access to real-time data on web content effectiveness, and create online catalogs, WebCenter Sites really packs a punch. However, the questions we received from attendees were regarding what was going to happen to SiteStudio, what if they didn’t need all the features that WebCenter Sites offers, and was there a Migration path from SiteStudio to WebCenter Sites? I do know that there are some integrations/migrations available between SiteStudio and WebCenter Sites, but I think Oracle WCM customers simply want to know if WebCenter Sites will ultimately be the system they use to manage their website.

  5. Mobility. By far and away, mobile content management was the most popular topic at Fishbowl’s booth. We had a few iPads showing off our application for WebCenter Content, and attendees were initially just excited to see that it was indeed possible to surface WebCenter Content images, documents and videos to iPads. There were also many questions around the technologies used to surface this content. My colleague John Sim did a great job explaining these technologies during this session “Exposing WebCenter Data on Mobile & Desktop Devices through the REST API”. Check out the Slides and White Paper for more details.

  6. User Experience & Interaction. Social applications like Facebook, Twitter and LinkedIn have helped put more focus on the user interface design of enterprise business systems. Users of these social media applications enjoy how easy it is to interactively share information with others. If systems used in the workplace don’t at least offer some level of this intuitiveness and interaction they may not get used as much as the organization would like if at all. Technologies like HTML5, Adobe Air, and touch gestures make it possible to enhance the end-user experience and make the process of contributing and interacting with content not only easy but also fun. I would again like to give a plug to my colleague John Sim by pointing you to his presentation and white paper on the topic.

  7. Lack of Customer Lead Sessions. Collaborate is a user group conference. Unfortunately however, of the 70+ WebCenter sessions, I think there were around 10 sessions given by actual WebCenter customers. That doesn’t mean that customers don’t have great stories and use cases to share, as I and my Fishbowl colleagues talked to many of them throughout the week that would have made for great presentations. This lack of customer presentations is also probably the most popular negative feedback regarding Collaborate. I think the way to change this is for WebCenter partners to step up and encourage their customers to speak at Collaborate. Partners can help out by doing some of the heavy lifting – help draft the white paper and presentation and coordinate attendee logistics. Fishbowl Solutions partnered with three of our customers to deliver presentations at Collaborate. These sessions all had higher attendance than sessions given solely by Fishbowl representatives.

  8. Social. I heard great things about the Oracle Social preview given at the Oracle WebCenter Customer Advisory Board meeting. I wasn’t able to attend, but I heard the presenters actually collected attendee feedback through the Social interface. Cool stuff. Other social/collaborative tools from Oracle have underwhelmed, but Oracle Social looks and sounds very promising.

  9. Enterprise Portals. Still a lot of interest in Portals. My Fishbowl colleagues and I heard from many customers that were still on older portal offerings from Oracle and wanted to discuss getting to WebCenter Portal 11g. Additionally, I think organizations in general are still looking to have in place an enterprise portal that delivers that true, personalized, one-stop shop for all relevant enterprise information – newsfeeds, financial dashboards, links to content, blogs, chat, etc. Pulling all this content together using JSR-286 or ADF-based portlets is possible and there have been successes, but it seems organizations want their portals to have more features and functionality (bells and whistles) and haven’t quite got there yet.

  10. Toga Party!? The “Back to the 80s Party” on Wednesday night was totally rad. Sorry, couldn’t resist. I did see 3 dudes wearing togas, and I had to stop and make sure I was at the right party.

  11. Customer attendance is growing. It seemed attendance overall was up from past Collaborate events. Lots and lots of buzz around Exadata, Cloud, Social, Fusion Applications, and WebCenter Sites and Social. There are a lot of exciting things happening at Oracle.

  12. WebCenter interest is growing. This was my 4th Collaborate event, and I had become accustomed to seeing the same faces year after year. It is always good to see and talk to those same people, but I was very happy to see and meet many new Oracle WebCenter customers this year. The word is getting out that Collaborate is the user group conference for WebCenter customers, and no other event offers the breadth and depth of presentations, demonstrations, and peer-to-peer engagement.

By the way, if you missed any of Fishbowl’s 14 presentations at Collaborate, you can download them and their associated white papers here. See you in Denver for Collaborate 13!


Filed under: 11g, Collaborate, Collaboration, Content Management, Event, Portal, UCM, UX, WebCenter
Categories: Fusion Middleware, Other

Comments are briefly being turned off

DBMS2 - Wed, 2012-05-09 08:56

I need to move web hosts, and am initiating the process now. This involves a large file copy, a recopy of same, and a variety of manual steps. So until the process is complete, updating site databases is a bad idea.

A comment is, of course, an update. So we’re closing off comments across DBMS 2, Strategic Messaging, Text Technologies, Software Memories, and the Monash Report. I hope to turn them back on shortly.

The sites should remain readable all the way through — unless, of course, there are more hosting company outages.

Categories: Other

Site reliability has been ghastly

DBMS2 - Mon, 2012-05-07 11:53

Unfortunately, we’ve had serious site outages over the past few days, as well as an increased frequency of shorter-term problems. My ordinarily excellent hosting company is going through a bad stretch, and I’ll have to move away from them. (As usual, I’ll rely on http://www.webhostingtalk.com for recommendations.)

When I pull the trigger on the move, there will be a short period when I turn off comments across all my blogs. I’ll post again here to announce when that is happening.

I apologize for the inconvenience.

Categories: Other

Relationship analytics application notes

DBMS2 - Mon, 2012-05-07 08:06

This post is part of a series on managing and analyzing graph data. Posts to date include:

In my recent post on graph data models, I cited various application categories for relationship analytics. For most applications, it’s hard to get a lot of details. Reasons include:

  • In adversarial domains such as national security, anti-fraud, or search engine ranking, it’s natural to keep algorithms secret.
  • The big exception – influencer analytics, aka social network analysis — is obscured by a major hype/reality gap (so, come to think of it, is a lot of other predictive modeling).

Even so, it’s fairly safe to say:

  • Much of relationship analytics is about subgraph pattern matching.
  • Much of relationship analytics is about identifying subgraph patterns that are predictive of certain characteristics or outcomes.
  • An important kind of relationship analytics challenge is to identify influential individuals.

Notes on that middle point include:

  • Pattern identification could be done through trial-and-error visualization, through predictive modeling, or through any form of investigative analytics in between.
  • I presume what’s hardest about all this from a processing-performance standpoint would often be enumerating the subgraphs of a particular candidate pattern.

So I’m tempted to say “it’s all about subgraphs.” But it might be more accurate yet to say “It’s about paths”. Arguably, that’s saying the same thing; paths are subgraphs, and subgraphs are made up of paths, so a way of finding one is also a way of finding the other. But referring to paths nods to such standard tasks as:

  • Finding the shortest path between two nodes.
  • Calculating centrality metrics.

Paths are also simpler than subgraphs, and hence also simpler to think about.

Let’s drill down a bit more on the cases of influencer analysis and centrality. Telecom service providers around the world compete with relatively few of their peers (because they’re so geographically bound), and hence are pretty good about sharing technical ideas with each other. One application that has spread like wildfire is influencer analysis for churn control. The idea is to identify influential subscribers who, if they left your service, would be particularly likely to take other people with them, so that you can make great efforts to retain them. The key data used is CDRs (call detail records).

As in many things, it’s tough to separate influencer analysis adoption fact from fiction.

  • The telecom case is surely real; I’ve heard of many examples.
  • Social networking is a harder call. Top-down, the story sounds good; but bottom-up, I’m not so sure.*
  • I’m quite dubious about attempts to use influencer analysis based on, say, credit card records; the detailed information about person-to-person connections isn’t there.
  • National security clearly uses similar kinds of techniques, albeit for slightly different purposes.

Specific conclusions I’ve heard include:

  • Who calls you is a better predictor of whether you influence cellular subscribers to churn along with you than who you call.
  • Length of calls is an indicator of involvement influence in terrorist networks (short ones suggest there’s serious business being done).

*For example my Klout profile asserts I’m more influential about Airlines than about Databases or Software. A bit of manual intervention could surely change that — which just serves to underscore my doubts about the effectiveness of social network analytic automation.

One more thing — relationship analytics on social networks rarely works unless you take out a few spurious highly-connected nodes. The paradigmatic example is the local pizza parlor, which receives many phone calls, but is neither a terrorist mastermind nor a major influence  upon telecom service churn. More on that point when I write about the partitioning of large graphs.

Categories: Other

Terminology: Relationship analytics

DBMS2 - Mon, 2012-05-07 08:05

This post is part of a series on managing and analyzing graph data. Posts to date include:

In late 2005, I encountered a company called Cogito that was using a graphical data manager to analyze relationships. They called this “relational analytics”, which I thought was a terrible name for something that they were trying to claim should NOT be done in a relational DBMS. On the spot, I coined relationship analytics as an alternative. A business relationship ensued, which included a short white paper. Cogito didn’t do so well, however, and for a while the term “relationship analytics” faltered too. But recently it’s made a bit of a comeback, having been adopted by Objectivity, Qlik Tech, Yarcdata and others.

“Relationship analytics” is not a perfect name, both because it’s longish and because it might over-connote a social-network focus. But then, no other term would be perfect either. So we might as well stick with it.

In that case, “relationship analytics” could use an actual definition, preferably one a little heftier than just:

Analytics on graphs.

At the risk of sounding circular, I’ll try:

Relationship analytics is analytics that focuses upon relationships encoded in data.

Notes on that proposed definition include:

  • The more directly the relationships are encoded — for example by a node-edge-node graph data model — the more applicable the term is likely to be.
  • It can still be relationship analytics if the nodes of the graph are ultimately more important than the edges. The edges just have to be central — no pun intended — to the analytics.
  • “Analytics” is a vague term, and “relationship analytics” inherits the vagueness. That said, I think of relationship analytics as being more about investigative analytics than the operational kind.

So what do you think — does this definition of “relationship analytics” work?

Categories: Other

Notes on graph data management

DBMS2 - Fri, 2012-05-04 02:07

This post is part of a series on managing and analyzing graph data. Posts to date include:

Interest in graph data models keeps increasing. But it’s tough to discuss them with any generality, because “graph data model” encompasses so many different things. Indeed, just as all data structures can be mapped to relational ones, it is also the case that all data structures can be mapped to graphs.

Formally, a graph is a collection of (node, edge, node) triples. In the simplest case, the edge has no properties other than existence or maybe direction, and the triple can be reduced to a (node, node) pair, unordered or ordered as the case may be. It is common, however, for edges to encapsulate additional properties, the canonical examples of which are:

  • Weight. Usually, the intuition here is that the weight is a number indicating the strength of the connection. This is generally derived from more basic data.
  • Kind. The edge can encapsulate one or more descriptors indicating the kind of relationship between the nodes.

Many of the graph examples I can think of fit into four groups:

  • Networks of people, aka social networks.Three (overlapping) areas of particular importance are:
    • People/communications.
      • One canonical example is influencer-finding in telecommunications customer bases. The nodes are subscribers; the edges are call details (raw or aggregated).
      • Other examples may be found as subgraphs of our next category, namely …
    • … people/places/things.
      • This is the classic structure for anti-terrorism, law enforcement, or anti-fraud use cases.
      • Nodes are people, buildings/addresses, cars, businesses, etc. , except that …
      • … nodes can actually be ordered pairs (tangible thing, timestamp). After all, it’s more interesting if two people were, not just in the same place, but in the same place at the same time.
    • People/connections/recommendations.
      • Similarly, there are use cases in which various people have social network connections, and then also recommend products of some kind.
      • Edges can carry information about the evident strength of the social network connection …
      • … but also about apparent similarities in taste.
  • Graphs of IT objects.Various sets of conceptual IT objects can be viewed as graphs. For example:
    • I visited Workday recently. They refer to their Java object model as a “graph.”
    • Neo Technology (the neo4j guys) started out doing a content management system, and eventually decided that what they really wanted underneath it was a graph-oriented DBMS.
    • Now one of Neo’s major application areas is MDM (Master Data Management).
    • Most dramatically, there’s Tim Berners-Lee’s “Semantic Web”, which is built on RDF, which models things as “a directed, labeled graph”. SPARQL, OWL and so on are in the mix as well. To date, the Semantic Web has been a lot of hot air, only without the hot aspect; still, it’s obviously influenced many people’s thinking about graphs.
    • Edit: Please see Marie’s comment below for a rather major example I left out. :)
  • Taxonomies, ontologies, and/or semantic networks.
    • To a large extent this overlaps with my previous category …
    • … but I’m particularly fond of the example of straightforward taxonomies of words, e.g. WordNet. The nodes are the words themselves, or more precisely word senses (i.e., specific meanings of a word); edges are typically chosen from a limited set of alternatives such as is_a, is_part_of, or entails.
  • Finally, there are representations of physical graphs. Examples might include telecom networks, utility grids, or locations and routes for physical deliveries.

My main reason for reciting these diverse examples is to illustrate that, for any really interesting technical discussion, it is necessary to focus on a subset of the possible use cases.

This post is intended to start a short series. When the next one goes up — focusing on a particular set of use cases :) — this footer will be edited accordingly.

Categories: Other

Big Data hype?

DBMS2 - Thu, 2012-05-03 04:17

A reporter wrote in to ask whether investor interest in “Big Data” was justified or hype. (More precisely, that’s how I reinterpreted his questions. :) ) His examples were Splunk’s IPO, Teradata’s stock price increase, and Birst’s financing. In a nutshell:

  • My comments, lightly edited, are in plain text below.
  • Further thoughts are in italics.
  • Of course I also linked him to my post “Big Data” has jumped the shark.
  • Overall, my responses boil down to “Of course there’s some hype.”

1. A great example of hype is that anybody is calling Birst a “Big Data” or “Big Data analytics” company. If anything, Birst is a “little data” analytics company that claims, as a differentiating feature, that it can handle ordinary-sized data sets as well.

When I checked Birst’s website, “Big Data” was nowhere to be found. On the other hand, the term was all over its press pitch for the financing.

2. The great growth in database sizes is both caused and balanced out by Moore’s Law. The net effect is healthy but not enormous growth in the overall data management and analytics markets.

I’ve made versions of that point many times before.

3. Incumbent data and analytic technology vendors such as Oracle, IBM, and Microsoft are vulnerable, but are competing very hard. Favorable exits have ensued for companies such Netezza, DATAllegro, Vertica, and Aster Data.

The connection between those two points is that the big companies will hold a lot of share, but part of how they’ll hold it is through acquisitions. For example, IBM, Microsoft, HP, Teradata, and Greenplum all bought newish analytic RDBMS vendors, at an aggregate cost of several billion dollars. And SAP bought Sybase.

But while there have been billions of dollars in fairly recent analytics-related acquisitions, the pace of acquisition would have to accelerate much further yet to justify current valuations.

Upon reflection, I may have overestimated the acquisition/IPO total-value-created ratio somewhat. Even so, what’s the last enterprise technology vendor to create huge investor value by going public, continuing to prosper, and so on? Red Hat and Autonomy may be as good as it gets. VMware isn’t really an example, because of its ownership structure.

4. I’m worried that people may be overestimating the business benefit of accurate analytics, great thought that value truly is. For example, it’s not plausible that all enterprises in the world use better analytics to all improve their respective market shares.

Yes, it’s great to be an arms dealer to all sides. But “Big Data” technology is just another chapter in the ever-growing importance of IT.

Categories: Other

Thinking about market segments

DBMS2 - Tue, 2012-05-01 05:00

It is a reasonable (over)simplification to say that my business boils down to:

  • Advising vendors what/how to sell.
  • Advising users what/how to buy.

One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.

Last June I wrote:

In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.

I’d further say that it matters whether the buyer:

  • Is a large central IT organization.
  • Is the well-staffed IT organization of a particular business department.
  • Is a small, frazzled IT organization.
  • Has strong engineering or technical skills, but less in the way of IT specialists.
  • Is trying to skate by without much technical knowledge of any kind.

Now let’s map those considerations (and others) to some specific market segments.

  • Traditional large enterprises’ central IT organizations commonly:
    • Favor large, proven vendors and well-accepted IT methodologies.
    • Would like to consolidate their IT vendors as much as possible.
    • Have major challenges with legacy systems and data integration …
    • … which are often exacerbated by mergers.
    • Spend a lot of cycles on bureaucracy and company politics.
    • Notwithstanding the forgoing, have resources to invest in some “sizzle” initiatives.
  • The very largest enterprises are more likely than their slightly smaller counterparts to:
    • View IT as a potential area of competitive differentiation.
    • Believe much of what they do should be custom, due to their unique needs and resources.
    • Experiment with unproven technologies.
  • Smaller enterprises may:
    • Have small, generalist, overwhelmed staffs.
    • Hope for turnkey application solutions (SaaS or otherwise).
    • Get very committed to/reliant on a small number of vendors.
  • In particular, IBM or Microsoft loyalists can be:
    • Extremely locked into their preferred vendor’s strategies.
    • Not very fruitful for rival vendors to attempt to sell to.
  • Humongous consumer internet companies tend to:
    • Have very high opinions of themselves and their technical abilities.
    • Be open source zealots, for reasons both of free-like-beer and free-like-speech.
    • In particular, not want to buy anybody else’s software.
    • Not be big fans of relational database designs.
  • Other large consumer internet companies tend to:
    • Be like the humongous ones they look up to, but maybe not to the same extremes.
    • In particular, be more willing to pay for software.
    • Be mired in company politics only/mainly to the extent they are both large and old(er).
  • Smaller consumer internet companies tend to:
    • Be like the large ones they look up to, but …
    • … be quite short on traditional IT skills, and work around that shortage by reinventing various wheels.
  • Business-oriented SaaS (Software as a Service) companies commonly:
    • Are drawn to the cool open source technologies consumer internet companies use …
    • … but may wind up using more traditional kinds of DBMS, for the same reasons those DBMS are used in other business applications.
    • Are more primitive in the analytic capabilities they offer their customers than I think they should be (analytics-only vendors sometimes excepted).
    • Are refreshingly free of traditional IT politics, because technology is too important to them to mess around with too badly. (Of course, any other kinds of company politics may still come into play.)
  • Internet operations of traditional enterprises:
    • Sometimes are just like stand-alone internet businesses.
    • Sometimes are just like — and part of — the rest of the enterprise’s IT operations.
    • More commonly are somewhere in between.
  • Marketing departments of traditional enterprises sometimes:
    • Want to do their own data acquisition, management, and/or analysis …
    • … without having great IT resources of their own.
    • Invest in departmental analytics efforts or even …
    • … have line executives who are analytically proficient.
    • Make heavy use of SaaS, as an alternative to relying on central IT, or as a natural byproduct of acquiring third-party data.
  • Large investment firms commonly:
    • Have numerous departments, each with its own IT experts.
    • Care about sub-millisecond latency …
    • … and sub-week time-to-value.
    • Experience return-on-investment in a very different way than most businesses do.
  • Telecom service companies commonly differ from other similarly-sized enterprises in that:
    • They are more aggressive about using innovative technology to manage (and analyze) data.
    • Somewhat resemble investment firms in having multiple departments that each have broad engineering discretion.
  • National security customers often:
    • Want the best, cutting-edge, sometimes custom technology, yet …
    • … make themselves very cumbersome to sell to and support.
    • Are not forthcoming about how they use what they buy.

I could keep going for quite a while — but for now I won’t. Vertical markets I’m thus omitting include but are not limited to:

Finally, for yet another omission — in my original outline I contemplated distinguishing among various geographical areas, with my first-pass segmentation being:

  • North America
  • Europe
  • Japan
  • China
  • Smaller geographies
Categories: Other

Notes on the Hadoop and HBase markets

DBMS2 - Tue, 2012-04-24 02:40

I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:

  • Cloudera now has 220 employees.
  • Cloudera now has over 100 subscription customers.
  • Over the past year, Cloudera has more than doubled in size by every reasonable metric.
  • Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
  • Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
  • Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
  • Cloudera has trained over 12,000 people.
  • Hortonworks is training people too.
  • Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
  • A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
  • I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
  • There should be more HBase information at HBaseCon on May 22.
  • If MapR still has momentum, nobody I talked with has noticed.
Categories: Other

Three quick notes about derived data

DBMS2 - Tue, 2012-04-24 01:43

I had one of “those” trips last week:

  • 20 meetings, a number of them very multi-hour.
  • A broken laptop.
  • Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.

So please pardon me if things are a bit disjointed …

I’ve argued for a while that:

Here are a few notes on the derived data trend.

He doesn’t generally use the term, but a big proponent these days of the derived data story is Hortonworks founder/CTO Eric Baldeschwieler, aka Eric 14. Eric likes to position Hadoop as a “data refinery”, where — among other things — you transform data and do “iterative analytics” on it. And he’s getting buy-in; for example, that formulation was prominent in the joint Teradata/Hortonworks vision announcement.

The KXEN guys don’t use the term “derived data” much either, but they tend to see the idea as central to predictive modeling even so. The argument in essence is that traditional predictive modeling consists of three steps:

  1. Think hard about exactly which variables you want to model on.
  2. Do transformations on those variables so that they fit into your favored statistical algorithm (commonly linear regression, although KXEN favors nonlinear choices).
  3. Press a button to run the algorithm.

#3 is the most automated part, and #1 is what KXEN thinks its technology makes unnecessary. Hence #2, they suggest, is often the bulk of the modeling effort — except now they want to automate that away too.

And then there are my new clients at MarketShare, a predictive modeling consulting company focused on marketing use cases, which also has a tech layer (accelerated via the acquisition of JovianDATA). It turns out that a typical MarketShare model is fed by a low double-digit number of other models, each of which is doing some kind of data transformation. The final step is typically a linear regression, yielding coefficients of the sort that marketers recognize and (think they) understand. Earlier steps are typically transformations on individual variables. I didn’t see many examples, but the transformations clearly go beyond the traditional rescaling — log, log (x/(1-x)), binning, whatever — to involve multiplication by what could be construed as other variables. I.e., there seemed to be a polynomial flavor to the whole thing.

Categories: Other

Now Available on iTunes: Free Fishbowl Solutions Collaborate12 Mobile Application

iPADiPhoneComposite-SM

Fishbowl is pleased to announce our official Collaborate12 mobile application is now available for iPad and iPhone from the iTunes Store. This free App includes information about all things Fishbowl at Collaborate12, the Oracle user group conference being held April 22-26, at Mandalay Bay in Las Vegas.

Download today to check out: 

  • Session Schedule
  • Booth Info
  • Presentations Downloads
  • Whitepapers Downloads
  • Expo Hall Map
  • Featured SolutionsAppStore
Download Now
Filed under: Uncategorized
Categories: Fusion Middleware, Other

@JRSim_UIX at Collaborate 2012 Vegas #C12LV

Flying on out tomorrow from sunny Minneapolis.
It’ll be my first time over at Collaborate; Presenting to an external audience and in Vegas.

And as its Vegas I hear you don’t do things small out there…  So, I have 3 presentations ready to roll out as part of the 14 we have at Fishbowl. It’s a lot to keep track of so to help you we’ve created a nice mobile app giving you an overview of the sessions and options to download and view our whitepapers and presentations – as well as a nice map to help you navigate around the booths and finds us. Download the Collaborate Mobile App

 

I’ll be over at booth #1178 so don’t be shy come say “hello”; I’ll be their talking User eXperience and journey through multiple devices hand helds to desktops; responsive design pros and cons and how you can transform WebCenter ADF Portal template to make it responsive; Mobility ADF Mobile, Native Apps through to HTML5 and other integration opportunities with Phonegap-Callback Cordova to help create adaptive cross device and OS apps and also talk about what are currently the good and bad options with developing with these techniques and technologies; Using WebSockets with WebCenter to create a true realtime collaborative environment and integration with BPM, UCM workflow or any system for messaging and getting away from email clutter to bringing in collaborative XMPP services or Lync via its XMPP gateway.

As well as discussing future innovative concepts such as voice enablement via speech engine recognition for your web apps; integration of multitouch events and current requirements for touch guestures and where we can take interaction to the next level; how we can now leverage the kinect for spatial guesture interaction; Bringing in 3D elements to help innovate how we can now interact with content especially with Digital Asset Management Systems like WebCenter Content.

If your interested to see what sessions we are presenting check out this earlier post.

Also a great deep dive will also be happening tomorrow so be sure to jump in and check it out.


Filed under: Collaboration, Content Management, Enterprise Search, Labs, Mobile, UCM, UX, WebCenter
Categories: Fusion Middleware, Other

WebCenter Performance Tip: Bug fix for file lock contention on /cs/data/alerts directory

At a current customer of ours running a WebCenter Spaces based Intranet with WebCenter Content (UCM) as the backend we started running into some strange performance issues.  The customer is running on Patch Set 4 (11.1.1.5.0).  The site would slow way down at specific times during the day and become unusable for 5 minute periods. It would then clear itself up and be usable again.

During this time the user load on the site would not change so something strange had to be going on.  To diagnose the problem we turned up the system audit tracing in UCM to Full Verbose and looked at the following sections: systemdatabase, searchquery, requestaudit, file*.

The main symptom that were were first seeing was that during these times where the server became unresponsive there would be upwards of 40-50 active database connections for UCM, while the average is usually less than 5.  We also saw a lot of long running queries on the System Audit page that the DBAs told us were happening in milliseconds on the database side.

Looking at the tracing output we were able to confirm that the database queries were only taking milliseconds or even fractions of milliseconds.  So what was keeping the connections open?

A closer look at the output yielded the following for each and every request:

>filelock/7 04.09 09:48:06.221 IdcServer-140003 Reserving /proj/wcsshare/ucm/cs/data/alerts/
>filelock/6 04.09 09:48:06.227 IdcServer-140003 /proj/wcsshare/ucm/cs/data/alerts/--<no-agent>--Locked directory
>filelock/6 04.09 09:48:06.227 IdcServer-140003 /proj/wcsshare/ucm/cs/data/alerts/--<no-agent>--Released

So there was a process that was locking a directory for writing on every request.  Continuing on it became apparent this is what was keeping the connections open.

>filelock/7 04.09 09:48:07.652 IdcServer-140025 Reserving /proj/wcsshare/ucm/cs/data/alerts/
>filelock/7 04.09 09:48:07.653 IdcServer-140015 Lock bounce on loop 2
>filelock/6 04.09 09:48:07.653 IdcServer-140015 /proj/wcsshare/ucm/cs/data/alerts/--<no-agent>--Next sleep interval is 720
>filelock/7 04.09 09:48:07.654 IdcServer-140025 Lock bounce on loop 0
>filelock/6 04.09 09:48:07.654 IdcServer-140025 /proj/wcsshare/ucm/cs/data/alerts/--<no-agent>--Next sleep interval is 120
>filelock/6 04.09 09:48:07.705 IdcServer-139986 /proj/wcsshare/ucm/cs/data/alerts/--<no-agent>--Locked directory
>filelock/6 04.09 09:48:07.707 IdcServer-139986 /proj/wcsshare/ucm/cs/data/alerts/--<no-agent>--Released

As you can see in the above output  there was contention between threads on the same UCM managed server (not to mention the other 3 nodes in the cluster) for locking on that directory.  Since the lock attempt was happening on a filter in the service call it was keeping the database connection open until it could get the lock and release it.

We reached out to some contacts in Oracle development and it turned out that the issue was already fixed and in an available OPatch!  The patch in question for PS4 is 13502977: UCM 11.1.1.5.0 patch (20111230-1105). Looking at the README reveals the following:

2011-05-31 15:34:45 - Core - idcsvn:92934 - Bug 11781402-Moved locking of alerts directory into the if condition. This reduces the unncessary locking of the directory when there is no change in alerts.hda file.

Moral of the story:  If you are on PS4 or earlier of WebCenter Content (UCM), make sure you apply the latest OPatch with this fix.  If you are on PS5 or later, the fix is already included.


Filed under: 11g, Content Management, Portal, tips, UCM, WebCenter Tagged: performance bug fix opatch
Categories: Fusion Middleware, Other

Many kinds of memory-centric data management

DBMS2 - Sat, 2012-04-07 19:33

I’m frequently asked to generalize in some way about in-memory or memory-centric data management. I can start:

Getting more specific than that is hard, however, because:

  • The possibilities for in-memory data storage are as numerous and varied as those for disk.
  • The individual technologies and products for in-memory storage are much less mature than those for disk.
  • Solid-state options such as flash just confuse things further.

Consider, for example, some of the in-memory data management ideas kicking around.

  • In many cases there is essentially an in-memory DBMS, trying for as much ACIDity as RAM reasonably allows, then (usually) also copying data synchronously to persistent storage. These can have many different architectures. For example:
    • SAP HANA is an in-memory columnar DBMS, with text indexing/inverted-list antecedents, except when it uses one of a couple of approaches to in-memory row-based data management.
    • solidDB, now an IBM product, is an RDBMS that relies on Patricia tries. It is actually a hybrid memory/disk product, but optimized for in-memory operation.
    • eXtremeDB is an OODBMS, but relies on B-trees.
    • H-Store and its commercialization VoltDB are row-based RDBMS that make drastic assumptions about the nature of your workload, but in return get to drop much of the overhead other DBMS need.
    • Oracle TimesTen is a row-based RDBMS, oriented to OLTP (OnLine Transaction Processing), which stores its data persistently via another RDBMS. (MySQL was the default choice before Oracle bought the company.)
    • Oracle’s answer to SAP HANA is to take TimesTen and do analytics on it, via the Exalytics appliance.
  • Some disk-based DBMS just happen to be architected in ways so that for good performance you’re going to want to keep all the data in RAM. Often, their in-memory architecture is lot like their on-disk architecture, with memory mapping for I/O. This is done in very different kinds of DBMS.
    • MongoDB is one visible example. In general, scale-out web databases (whether NoSQL or MySQL) often keep all their data in RAM, whether or not that plan is baked into the DBMS architecture.
    • Various analytic DBMS vendors have at time been memory-oriented. At the moment, I think:
  • My last technical briefing on Applix TM1 (now an IBM Cognos product) was in September, 2005. (The product itself dates back to 1984.) At the time TM1 had an interesting sparse MOLAP (Multi-Dimensional OnLine Analytic Processing) story, the point being that the system worked hard to isolate what was actually non-zero. Loading of raw data seemed to be batch, but you could update models with derived data, and there was a transaction log for confident persistence.
  • Alternatively, you can use a caching layer, typically on a separate set of servers from your DBMS, which has no responsibility for managing data persistence. For example:
    • TimesTen and solidDB are used, respectively, as relational caches for Oracle and DB2.
    • Peter Zencke told me years ago that SAP had a purpose-built caching layer that kept over 99% of requests from touching disk.
    • The key-value store memcached is central to many of the world’s largest web sites, typically backed by a MySQL cluster.
    • ScaleArc has key-value cache that stores — rather than individual records — the entire TCP string sent by an RDBMS in response to a particular SQL query.
  • Some systems manage data in memory in one kind of structure, then ensure persistence via a very different structure on disk. Examples include:
    • Workday’s architecture — object-oriented in RAM, MySQL (really key-value) on disk. Edit: Workday thinks “key-value” is a slightly misleading way to put it. Stay tuned for more.
    • Oracle Coherence (formerly Tangosol) — object-oriented in RAM, Oracle on disk. Edit: Actually, Coherence isn’t really a write-through ORM (Object-Relational Mapper). It functions more like memcached, albeit with a very different data model.
    • Couchbase — memcached (key-value) in-memory, evolving from SQLite to CouchDB on disk.
  • Similarly, business intelligence suites can manage data in-memory that comes from some other kind of data store (usually an RDBMS, sometimes Hadoop or whatever). I haven’t had a lot of luck in getting details, with one exception — QlikView, which uses a simple tabular data structure.
  • Stream processors — i.e. CEP engines — are a whole other sort of in-memory engine, doing something that’s a lot like data management.

And that, kiddies, is why I hesitate to generalize in too much detail about “in-memory database management.”

Despite its length, this is still a very partial list of memory-centric data management approaches. I encourage you to add other examples into the comments that I might have left out.

Related link

  • I did a simpler overview of memory-centric alternatives in 2005.
Categories: Other

Human real-time

DBMS2 - Thu, 2012-04-05 21:12

I first became an analyst in 1981. And so I was around for the early days of the movement from batch to interactive computing, as exemplified by:

  • The rise of minicomputers as mainframe alternatives (first VAXen, then the ‘nix systems that did largely supplant mainframes).
  • The move from batch to interactive computing even on mainframes, a key theme of 1980s application software industry competition.

Of course, wherever there is interactive computing, there is a desire for interaction so fast that users don’t notice any wait time. Dan Fylstra, when he was pitching me the early windowing system VisiOn, characterized this as response so fast that the user didn’t tap his fingers waiting.* And so, with the move to any kind of interactive computing at all came a desire that the interaction be quick-response/low-latency.

*That was well put. Unfortunately, VisiOn didn’t meet Dan’s standard, which is a big part of why VisiCorp wound up on the ash heap of history.

Once again, we’re in an era that features:

  • A move from batch to interactive computing.
  • Users’ desire for zero-wait interactions.

The two big examples I have in mind for a batch-to-interactive trend are:

My top examples for zero-wait interactions are:

  • “Speed of thought” business intelligence.
  • Anything to do with consumer web page response times.

Let me be clear about confessing something — I’m conflating two different kinds of low latency, namely database freshness and user interface response. My two main reasons are:

  • If you want to make decisions based on fresh data, you probably don’t want to take a long time making them.
  • If you care enough about an analytic problem to repeatedly query a database, then you probably would like the database to be as fresh as possible.

I’ve been conflating those two things at least since I first came up with the speed of a turtle vs. speed of light analogy.

But how should we refer to more-or-less-immediate computing? The term “interactive” has long been played out. “Real-time” has definitional issues, as captured in the Wikipedia passage:

Real-time programs must guarantee response within strict time constraints. Often real-time response times are understood to be in the order of milliseconds and sometimes microseconds. In contrast, a non-real-time system is one that cannot guarantee a response time in any situation, even if a fast response is the usual result.

The use of this word should not be confused with the two other legitimate uses of real-time, which in the domain of simulations, means real-clock synchronous, and in the domain of data transfer, media processing and enterprise systems, the term is intended to mean without perceivable delay.

Similar problems adhere to a term I nonetheless sometimes use, namely “quasi real-time”.

The Sumo Logic guys propose an interesting alternative: human real-time. Billy Bosworth recently emailed me with a similar idea, from a conference panel that obviously struck a nerve. I like it, because it conveys the impression:

  • Effectively real-time from a human perspective …
  • … but not necessarily from a machine standpoint.

So am I overlooking some drawback to the term? If not, I’m going to start using “human real-time” to mean something like fast enough that humans don’t perceive an annoying lag.

Categories: Other

Clarifying IBM DB2 Express-C crippleware

DBMS2 - Wed, 2012-04-04 20:43

When Conor O’Mahony briefed me about DB2 10, he kept commenting that cool features he was talking about could be found in all editions of DB2, even the free one. So I asked what the limitations were on free DB2. He researched the matter and got back to me — and they sounded like what appeared to have been the limits when free DB2 was first introduced, over 6 years ago.

I tweeted about this, and was very fortunate that Ian Bjorhovde spoke up and said it wasn’t correct. Some scrambling ensued. It seems that the main sources of error were:

  • People tend to confuse DB2 Express and DB2  Express-C; only the latter is free.
  • What IBM said about the limitations DB2 Express-C upon its introduction 6 years ago should not be interpreted in line with what a plain reading might suggest.

In particular, we shouldn’t take IBM’s repeated 2006 statements that

DB2 Express-C may be deployed on …  on AMD or Intel x86 systems with up to 2 dual-core chips. 4 GB of memory is the maximum supported.

to mean that you were ever allowed to use DB2 Express-C with 4 cores, nor with 4 GB of RAM.

To clarify things, Conor sent over email with permission to quote, as follows:

DB2 Express-C (C for Community) was introduced in 2006. There is no charge for DB2 Express-C. When originally introduced, DB2 Express-C could use up to 2 processor cores and 2GB of RAM. As of 3 April 2012, memory entitlements were increased and DB2 Express-C can now use up to 4GB of RAM. There is no database size limit, no limit on the number of instances or databases per server, and no restriction on the number of users. The supported platforms include: Linux [x86 (32-bit), x86_64 (32- and 64-bit), PPC64 (POWER 64-bit)] and Windows [x86 (32-bit), x86_64 (64-bit)].

If you want to increase processors, memory, or supported features, DB2 Express is available at a cost that is either based on processors or users. When originally introduced, DB2 Express could use up to 2 processor cores and 4GB of RAM. In 2007, DB2 Express processor entitlements were increased to 4 processor cores. As of 3 April 2012, memory entitlements were increased and DB2 Express can now use up to 8GB of RAM.

So, to summarize: the “no charge” version is called DB2 Express-C can use up to 2 processor cores and 4GB of memory; the lowest chargeable edition is DB2 Express which can use up to 4 processor cores and 8GB of memory.

DB2 Express-C includes features like Time Travel Query, pureXML, Graph Store, Spatial Extender, SQL Compatibility, and Backup Compression. DB2 Express adds features like Label-Based Access Control and Row & Column Access Control. In addition, you can purchase the High Availability Disaster Recovery (HADR) feature for use with DB2 Express.

So if I’m reading that correctly, the real story is:

  • For over 6 years, the ceiling on DB2 Express-C didn’t go up at all.
  • As of this week, you can use twice as much RAM as you could 6 years ago, but still the same number of cores.

“Generous” is not the word.

Categories: Other

IBM DB2 10

DBMS2 - Tue, 2012-04-03 22:46

Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.

*I hope any errors in interpretation are minor.

Major aspects of DB2 10 include new or improved capabilities in the areas of:

  • Compression.
  • Analytic query performance.
  • Data ingest.
  • Multi-temperature data management.
  • Workload management.
  • Graph management/relationship analytics.
  • Time-travel, bitemporal features, and bitemporal time-travel.

Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*

*Also, the data ingest part isn’t in base DB2.

The most remarkable claims Conor made were in the area of compression. Previously, IBM claimed 2.2-3X compression as typical, with 7X as best case. But as is (approximately) illustrated by Slide 12, IBM now says 7X is typical, with 4-10X being a realistic range and 45X having been the best case to date. Apparently, the DB2 compression strategy is now:

  • Keep the old DB2 compression scheme, which is dictionary compression across the top 4096 values in a table or range partition. Notably, that compression …
  • … extends to indexes, temp space, and so on, as well as the data itself.
  • Add a similar page-level compression scheme. Other than saying it too was dictionary-based, Conor didn’t give details.
  • Have some automation determining which values are compressed table-wide and which are compressed at the page level.

Those numbers are pretty bold claims for dictionary compression, especially in a row-based system.* The two special features I can think of in IBM’s compression that might allow it to outdo other dictionary schemes are:

  • You can compress multiple columns at once. (The canonical example is different fields in an address.)
  • (If I understood Conor correctly) You can also compress substrings within a column, or across columns.

*Row-based vs. columnar doesn’t matter for table-wide dictionary compression, but it does for page-level; the more comparable values you have per page, the better your chances to compress.

IBM claims consistent 3X query performance improvements on a variety of (non-published) benchmarks, with occasional examples of much higher figures. If the compression claims are really true, they could explain much of the query speed-up right there. Beyond that, the associated feature list is on Slides 7 and 8. The feature Conor called out was pre-fetching of indexes, which makes good index organization less important (Slide 9), which hence means DBAs have to worry less about index maintenance.

Prior to DB2 10, it appears that data ingest was through a single core, and it required the core to be dedicated. Now data ingest is just one more task that can be parallelized, workload-managed, and so on. It would seem that the biggest relevance of this feature is when data is being streamed from a transactional system — which is of course what you want to do whenever practical, versus the batch-load alternative.*

*My first clue for that was the feature name “real-time data warehousing.”

IBM DB2 10 introduces the beginnings of multi-temperature data management. That is, you can have different ranges in the same range-partitioned table be on different classes of storage — solid-state, faster disks, slower disks, whatever.

DB2′s workload management as described by Conor sounds more primitive than what Tim Vincent told me about a year and a half ago. Probably it’s just a difference in emphasis or something. Anyhow, DB2 workload management:

  • Newly sets limits on CPU consumed by certain workloads, rather than just divvying up CPU resources.
  • Doesn’t manage I/O or RAM.
  • Newly works on its own, rather than relying on underlying operating systems.
  • Takes the “temperature” of data (or type of storage it’s on?) into account as part of workload prioritization.

IBM is introducing both time-travel and bitemporal capabilities, but we didn’t spend much — um — time on them. “Time-travel” means you can do queries on the state of the database as of some previous date. “Bitemporal”  means data can have an effective dates — i.e., dates on which the fact recorded (e.g. insurance coverage) begins or stops to be true.

IBM is also introducing some graph data features, and is showing the good taste to use my term relationship analytics.* Mainly, this is SPARQL 1.0 support, implemented via a variety of relational tables. We’re planning a follow-up briefing for me to learn more.  An internal benchmark — 3.5X speed-up — is memorialized on Slide 17.

*Contrary to my complaints of several years ago, I think the term relationship analytics — which I coined in 2005 — is finally becoming mainstream.

Categories: Other

Our clients, and where they are located

DBMS2 - Sat, 2012-03-31 14:36

From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:

  • This is a list of Monash Advantage members.
  • All our vendor clients are Monash Advantage members, unless …
  • … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
  • We do not usually disclose our user clients.
  • We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
  • Excluded from this round of disclosure is one vendor I have never written about.
  • Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.

For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me.

City of San Francisco

  • KXEN
  • Metamarkets
  • PivotLink
  • salesforce.com
  • Splunk
  • WibiData

Other San Francisco area

  • 10gen
  • ClearStory
  • Cloudera
  • Couchbase
  • DataStax
  • Hortonworks
  • MarketShare
  • MarkLogic
  • Schooner
  • Sybase, an SAP company
  • VMware
  • Yarcdata, a division of Cray

Boston and Cambridge

  • Akiban
  • Cloudant
  • Hadapt
  • Vertica, an HP company

Other Boston area

  • Netezza, an IBM company
  • StreamBase

Everywhere else

  • CodeFutures
  • Infobright
  • SAND
  • Syncsort
  • Tableau
  • Teradata

For most of the companies listed above, you can find coverage here, and specifically a blog category in the list on the right. The exceptions, for now, are:

  • Cloudant
  • MarketShare
  • Metamarkets
  • PivotLink
  • VMware
  • Yarcdata

The main reason I threw in the geographical notes is to support the idea that there’s a real suburb-to-urban shift in the startup tech industry. Mike Arrington made the point recently about the San Francisco area, primarily with respect to the mass/consumer tech areas he focuses on, and of course it’s echoed by the rise of the New York City tech sector. My point is to add that it’s also true for system and enterprise technology, at least in the areas I cover.

In particular, the re-urbanization of the Boston-area software industry is striking:

  • Akiban and Cloudant are in the same office complex in the city of Boston. I was surprised to find even one tech startup in the city of Boston proper, but it seems there is a two-building complex packed with them.
  • Hadapt moved from Connecticut to the Kendall Square area of Cambridge. Edit: Actually, see the comments below.
  • Vertica moved from Burlington to the Alewife area of Cambridge. That’s (evidently deliberately) at the boundary of what might be regarded as the “urban” and “suburban” parts of metro Boston.

The cluster in the city of San Francisco — which also half-includes Cloudera — is relatively new as well.

Categories: Other