Skip navigation.

Other

Datameer at the time of Datameer 5.0

DBMS2 - Sun, 2014-10-26 02:42

Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:

  • Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
  • Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.

Key aspects include:

  • Datameer runs straight against Hadoop.
  • Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
  • The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
  • Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
  • Datameer lets you write derived data back into Hadoop.

Datameer use cases sound like the usual mix, consisting mainly of a lot of customer analytics, a bit of anti-fraud, and some operational analytics/internet-of-things. Datameer claims 200 customers and 240 installations, the majority of which are low-end/single-node users, but at least one of which is a multi-million dollar relationship. I don’t think those figures include OEM sell-through. I forgot to ask for any company size metrics, such as headcount.

In a chargeable add-on, Datameer 5.0 has an interesting approach to execution. (The lower-cost version just uses MapReduce.)

  • An overall task can of course be regarded as a DAG (Directed Acyclic Graph).
  • Datameer automagically picks an execution strategy for each node. Administrator hints are allowed.
  • There are currently three choices for execution: MapReduce, clustered in-memory, or single-node. This all works over Tez and YARN.
  • Spark is a likely future option.

Datameer calls this “Smart Execution”. Notes on Smart Execution include:

  • Datameer sees a lot of tasks that look at 10-100 megabytes of data, especially in malware/anomaly detection. Datameer believes there can be a huge speed-up from running those on a single-node rather than in a clustered mode requiring data (re)distributed, with at least one customer reporting >20X speedup of at least one job.
  • Yes, each step of the overall DAG might look to the underlying execution engine as a DAG of its own.
  • Tez can fire up processes ahead of when they’re needed, so you don’t have to wait for all the process start-up delays in series.
  • Datameer had a sampling/preview engine from the getgo that outside of Hadoop MapReduce. That’s the basis for the non-MapReduce options now.

Strictly from a BI standpoint, Datameer seems clunky.

  • Datameer doesn’t have drilldown.
  • Datameer certainly doesn’t let you navigate from one visualization to the next ala QlikView/Tableau/et al. (Note to self: I really need to settle on a name for that feature.)
  • While Datameer does have a bit in the way of event series visualization, it seems limited.
  • Of course, Datameer doesn’t have streaming-oriented visualizations.
  • I’m not aware of any kind of text search navigation.

Datameer does let you publish BI artifacts, but doesn’t seem to have any collaboration features beyond that.

Last and also least: In an earlier positioning, Datameer made a big fuss about an online app store. Since analytic apps stores never amount to much, I scoffed.* That said, they do have it, so I asked which apps got the most uptake. Most of them seem to be apps which boil down to connectors, access to outside data sets, and/or tutorials. Also mentioned were two more substantive apps, one for path-oriented clickstream analysis, and one for funnel analysis combining several event series.

*I once had a conversation with a client that ended:

  • “This app store you’re proposing will not be a significant success.”
  • “Are you sure?”
  • “Almost certain. It really just sounds like StreamBase’s.”
  • “I ‘m not familiar with StreamBase’s app store.”
  • “My point exactly.”
Categories: Other

The Benefits of Integrating a Google Search Appliance with an Oracle WebCenter or Liferay Portal

This month, the Fishbowl team presented two webinars on integrating a Google Search Appliance with a WebCenter or Liferay Portal. Our new product, the GSA Portal Search Suite, makes integration simple and also allows for customization to create a seamless, secure search experience. It brings a powerful, Google-like search experience directly to your portal.

The first webinar, “The Benefits of Google Search for your Oracle WebCenter or Liferay Portal”, focused on the Google Search Appliance and the positive experiences users have had with incorporating Google search in the enterprise.

 

The second webinar, “Integrating the Google Search Appliance with a WebCenter or Liferay Portal”, dove deeper into how the GSA Portal Search Suite and how it improves the integration process.

 

The following is a list of questions and answers from the webinar series. If you have any other questions, please feel free to reach out to the Fishbowl team!

Q. What version of SharePoint does this product work with?

A. This product is not designed to work with SharePoint. Google has a SharePoint connector that indexes content from SharePoint and pulls it into the GSA, and then the GSA Portal Search Suite would allow any of that content to be served up in your portal.

Fishbowl also has a product called SharePoint Connector that connects SharePoint with Oracle WebCenter Content.

Q. Is Fishbowl a reseller of the GSA? Where can I get a GSA?

A. Yes, we sell the GSA, as well as add-on products and consulting services for the GSA. Visit our website for more information about our GSA services.

Q. What is the difficulty level of customizing the XSLT front end? How long would it take to roll out?

A. This will depend on what you’re trying to customize. If it’s just colors, headers, etc., you could do it pretty quickly because the difficulty level is fairly low. If you’re looking at doing a full-scale customization and entirely changing the look and feel, that could take a lot longer – I would say upwards of a month. The real challenge is that there isn’t a lot of documentation from Google on how to do it, so you would have to do a lot of experimentation.

One of the reasons we created this product is because most customers haven’t been able to fully customize their GSA with a portal, partly because Google didn’t design it to be customizable in this way.

Q. What versions of Liferay does this product support?

A. It supports version 6.2. If you have another version you’d like to integrate with, you can follow up with our team and we can discuss the possibility of working with other versions.

Q. Do you have a connector for IBM WCM?

A. Fishbowl does not have a connector, but Google has a number of connectors that can integrate with many different types of software.

Q. Are you talking about WebCenter Portal or WCM?

A. This connector is designed for WebCenter Portal. If you’re talking about WCM as in SiteStudio or WebCenter Content, we have done a number of projects with those programs. This particular product wouldn’t apply to those situations, but we have other connectors that would work with programs such as WebCenter Content.

Q. Where is the portlet deployed? Is it on the same managed node?

A. The portlets are deployed on the portlet server in WebCenter Portal.

Q. Where can we get the documentation for this product?

A. While the documentation is not publically available, we do have a product page on the website that includes a lot of information on the Portal Search Suite. Contact your Fishbowl representative if you’d like to learn more about it.

Q. What are the server requirements?

A. WebCenter Portal 11g or Liferay 6.2 and Google Search Appliance 7.2.

Q. Does this product include the connector for indexing content?

A. No, this product does not include a connector. We do have a product called GSA Connector for WebCenter that indexes content and then allows you to integrate that content with a portal. Depending on how your portal is configured, you could also crawl the portal just like you would in a regular website. However, this product focuses exclusively on serving and not on indexing.

Q. How many portals will a GSA support? I have several WebCenter Content domains on the same server.

A. The GSA is licensed according to number of content items, not number of sources. You purchase a license for a certain number of content items and then it doesn’t matter how many domains the content is coming from.

The post The Benefits of Integrating a Google Search Appliance with an Oracle WebCenter or Liferay Portal appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Is analytic data management finally headed for the cloud?

DBMS2 - Wed, 2014-10-22 02:48

It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:

  • Amazon Redshift appears to be prospering.
  • So are some SaaS (Software as a Service) business intelligence vendors.
  • Amazon Elastic MapReduce is still around.
  • Snowflake Computing launched with a cloud strategy.
  • Cazena, with vague intentions for cloud data warehousing, destealthed.*
  • Cloudera made various cloud-related announcements.
  • Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
  • The general argument for cloud-or-at-least-colocation has compelling aspects.
  • Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.

Also — although the specifics on this are generally vague and/or confidential — I sense a narrowing of the gap between:

  • The hardware + networking required for performant analytic data management.
  • The hardware + networking available in the cloud.

*Cazena is proud of its team of advisors. However, the only person yet announced for a Cazena operating role is Prat Moghe, and his time period in Netezza’s mainstream happens not to have been one in which Netezza had much technical or market accomplishment.

On the other hand:

  • If you have processing power very close to the data, then you can avoid a lot of I/O or data movement. Many cloud configurations do not support this.
  • Many optimizations depend upon controlling or at least knowing the hardware and networking set-up. Public clouds rarely offer that level of control.

And so I’m still more confident in SaaS/colocation analytic data management, or in Redshift, than I am in true arm’s-length cloud-based systems.

Categories: Other

Snowflake Computing

DBMS2 - Wed, 2014-10-22 02:45

I talked with the Snowflake Computing guys Friday. For starters:

  • Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
  • The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
  • The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
  • Snowflake claims excellent SQL coverage for a 1.0 product.
  • Snowflake, the company, has:
    • 50 people.
    • A similar number of current or past users.
    • 5 referenceable customers.
    • 2 techie founders out of Oracle, plus Marcin Zukowski.
    • Bob Muglia as CEO.

Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*

*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.

In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:

  • Ingest formats are JSON, XML or AVRO for now.
  • I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.

I don’t know enough details to judge whether I’d call that an example of schema-on-need.

A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather:

  • A lateral view is something like a join on a table function, inner or outer join as the case may be.
  • “Lateral view” is an Oracle term, while “Cross apply” is the term for the same thing in Microsoft SQL Server.
  • Lateral views are one of the ways of making SQL handle hierarchical data structures (others evidently are WITH and CONNECT BY).

Lateral views seem central to how Snowflake handles nested data structures. I presume Snowflake also uses or plans to use them in more traditional ways (subqueries, table functions, and/or complex FROM clauses).

If anybody has a good link explaining lateral views, please be so kind as to share! Elementary googling isn’t turning much up, and the Snowflake folks didn’t send over anything clearer than this and this.

Highlights of Snowflake’s cloud/elastic/simple/inexpensive story include:

  • Snowflake’s product is SaaS-only for the foreseeable future.
  • Data is stored in compressed 16 megabyte files on Amazon S3, and pulled into Amazon EC2 servers for query execution on an as-needed basis. Allegedly …
  • … this makes data storage significantly cheaper than it would be in, for example, an Amazon version of HDFS (Hadoop Distributed File System).
  • When you fire up Snowflake, you get a “virtual data warehouse” across one or more nodes. You can have multiple “virtual data warehouses” accessing identical or overlapping sets of data. Each of these “virtual data warehouses” has a physical copy of the data; i.e., this is not related to the Oliver Ratzesberger concept of a virtual data mart defined by workload management.
  • Snowflake has no indexes. It does have zone maps, aka data skipping. (Speaking of simple/inexpensive — both those aspects remind me of Netezza.)
  • Snowflake doesn’t distribute data on any kind of key. I.e. it’s round-robin. (I think that’s accurate; they didn’t have time to get back to me and confirm.)
  • This is not in in-memory story. Data pulled onto Snowflake’s EC2 nodes will commonly wind up in their local storage.

Snowflake pricing is based on the sum of:

  • Per EC2 server-hour, for a couple classes of node.
  • Per S3 terabyte-month of compressed storage.

Right now the cheaper class of EC2 node uses spinning disk, while the more expensive uses flash; soon they’ll both use flash.

DBMS 1.0 versions are notoriously immature, but Snowflake seems — or at least seems to think it is — further ahead than is typical.

  • Snowflake’s optimizer is fully cost-based.
  • Snowflake thinks it has strong SQL coverage, including a large fraction of SQL 2003 Analytics. Apparently Snowflake has run every TPC-H and TPC-DS query in-house, except that one TPC-DS query relied on a funky rewrite or something like that.
  • Snowflake bravely thinks that it’s licked concurrency from Day 1; you just fire up multiple identical virtual DWs if needed to handle the query load. (Note: The set of Version 1 DBMS without concurrent-usage bottlenecks has cardinality very close to 0.)
  • Similarly, Snowflake encourages you to fire up a separate load-only DW instance, and load mainly through trickle feeds.
  • Snowflake’s SaaS-only deployment obviates — or at least obscures :) — a variety of management, administration, etc. features that often are lacking in early DBMS releases.

Other DBMS technology notes include:

  • Compression is columnar (various algorithms, including file-at-a-time dictionary/token).
  • Joins and other database operations are performed on compressed data. (Yay!)
  • Those 16-megabyte files are column-organized and immutable. This strongly suggests which kinds of writes can or can’t be done efficiently. :) Note that adding a column — perhaps of derived data — is one of the things that could go well.
  • There’s some kind of conflict resolution if multiple virtual DWs try to write the same records — but as per the previous point, the kinds of writes for which that’s an issue should be rare anyway.

In the end, a lot boils down to how attractive Snowflake’s prices wind up being. What I can say now is:

  • I don’t actually know Snowflake’s pricing …
  • … nor the amount of work it can do per node.
  • It’s hard to imagine that passing queries from EC2 to S3 is going to give great performance. So Snowflake is more likely to do well when whatever parts of the database wind up being “cached” in the flash of the EC2 servers suffice to answer most queries.
  • In theory, Snowflake could offer aggressive loss-leader pricing for a while. But nobody should make a major strategic bet on Snowflake’s offerings unless it shows it has a sustainable business model.
Categories: Other

Cloudera’s announcements this week

DBMS2 - Thu, 2014-10-16 09:05

This week being Hadoop World, Cloudera naturally put out a flurry of press releases. In anticipation, I put out a context-setting post last weekend. That said, the gist of the news seems to be:

  • Cloudera continued to improve various aspects of its product line, especially Impala with a Version 2.0. Good for them. One should always be making one’s products better.
  • Cloudera announced a variety of partnerships with companies one would think are opposed to it. Not all are Barney. I’m now hard-pressed to think of any sustainable-looking relationship advantage Hortonworks has left in the Unix/Linux world. (However, I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.)
  • Cloudera is getting more cloud-friendly, via a new product — Cloudera Director. Probably there are or will be some cloud-services partnerships as well.

Notes on Cloudera Director start:

  • It’s closed-source.
  • Code and support are included in any version of Cloudera Enterprise.
  • It’s a management tool. Indeed, Cloudera characterized it to me as a sort of manager of Cloudera Managers.

What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:

  • Hadoop, like most analytic RDBMS, tightly couples processing and storage in a shared-nothing way.
  • Standard cloud architectures, however, decouple them, thus mooting a considerable fraction of Hadoop performance engineering.

Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.

Categories: Other

Context for Cloudera

DBMS2 - Mon, 2014-10-13 02:02

Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.

So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):

  • CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
  • Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
    • Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
    • In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
  • Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
  • Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
  • Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
  • Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
  • Cloudera Enterprise Basic Edition contains:
    • All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
    • Commercial licenses for all that code.
    • A license key to use the entirety of Cloudera Manager, not just the free part.
    • Support for the “Core Hadoop” part of CDH.
    • Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
    • The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
  • Cloudera Enterprise Data Hub Edition contains:
    • Everything in Cloudera Basic Edition.
    • A license key for Cloudera Navigator.
    • Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
  • Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.

In analyzing all this, I’m focused on two particular aspects:

  • The “zero, one, many” system for defining the editions of Cloudera Enterprise.
  • The use of “Data Hub” as a general marketing term.

Given its role as a highly influential yet still small “platform” vendor in a competitive open source market, Cloudera even more than most vendors faces the dilemma:

  • Cloudera wants customers to adopt its views as to which Hadoop-related technologies they should use.
  • However, Cloudera doesn’t want to be in the position of trying to ram some particular unwanted package down a customer’s throat.

The Flex/Data Hub packaging fits great with that juggling act, because Cloudera — and hence also Cloudera salespeople — get paid exactly as much when customers pick 2 Flex options as when they use all 5-6. If you prefer Cassandra or MongoDB to HBase, Cloudera is fine with that. Ditto if you prefer CitusDB or Vertica or Teradata Hadapt to Impala. Thus Cloudera can avoid a lot of religious wars, even if it can’t entirely escape Hortonworks’ “More open source than thou” positioning.

Meanwhile, so far as I can tell, Cloudera currently bets on the “Enterprise Data Hub” as its core proposition, as evidenced by that term being baked into the name of Cloudera’s most comprehensive and expensive offering. Notes on the EDH start:

  • Cloudera also portrays “enterprise data hub” as an architectural/reference architecture concept.
  • “Enterprise data hub” doesn’t really mean anything very different from “data lake” + “data refinery”; Cloudera just thinks it sounds more important. Indeed, Cloudera claims that the other terms are dismissive or disparaging, at least in some usages.

Cloudera’s long-term dream is clearly to make Hadoop the central data platform for an enterprise, while RDBMS fill more niche (or of course also legacy) roles. I don’t think that will ever happen, because I don’t think there really will be one central data platform in the future, any more than there has been in the past. As I wrote last year on appliances, clusters and clouds,

Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.

and earlier in the same post

… these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.

One system is not going to be optimal for all computing purposes.

Categories: Other

Notes on predictive modeling, October 10, 2014

DBMS2 - Fri, 2014-10-10 02:40

As planned, I’m getting more active in predictive modeling. Anyhow …

1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.

2. The most controversial part of that post was probably the claim:

I think the predictive modeling state of the art has become:

  • Cluster in some way.
  • Model separately on each cluster.

In particular:

  • It is always possible to instead go with a single model formally.
  • A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
  • Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.

3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start:

  • While KXEN was distinguished by how limited its choice of model templates was, Nutonian is distinguished by its remarkable breadth. Is the best model for your data a quadratic polynomial in which some of the terms are trigonometric functions? Nutonian is happy to find that for you.
  • Nutonian is starting out as a SaaS (Software as a Service) vendor.
  • A big part of Nutonian’s goal is to find a simple/parsimonious model, because — although this is my phrasing rather than theirs — the simpler the model, the more likely it is to have robust explanatory power.

With all those possibilities, what do Nutonian models actually wind up looking like? In internet/log analysis/whatever kinds of use cases, I gather that:

  • The model is likely to be a polynomial — of multiple variables of course — of order no more than 3 or 4.
  • Variables can have time delays built into them (e.g., sales today depend on email sent 2 weeks ago). Indeed, some of Nutonian’s flashiest early modeling successes seem to be based around the ease with which they capture time-delayed causality.
  • In each monomial, all variables except 1 are likely to be “control”/”capping”/”transition-point”/”on-off switch”/logical/conditional/whatever variables — i.e., variables whose range is likely to be either {0,1} or perhaps [0,1] instead.

Nutonian also servers real scientists, however, and their models can be all over the place.

4. One set of predictive modeling complexities goes something like this:

  • A modeling exercise may have 100s or 1000s of potential variables to work with. (For simplicity, think of a potential variable as a column or field in the input data.)
  • The winning models are likely to use only a small fraction of these variables.
  • Those may not be variables you’re thrilled about using.
  • Fortunately, many variables have strong covariances with each other, so it’s often possible to exclude your disfavored variables and come out with a model almost as good.

I pushed the Nutonian folks to brainstorm with me about why one would want to exclude variables, and quite a few kinds of reasons came up, including:

  • (My top example.) Regulatory compliance may force you to exclude certain variables. E.g., credit scores in the US mustn’t be based on race.
  • (Their top example.) Some data is just expensive to get. E.g., a life insurer would like to come up with a way to avoid using blood test results in their decision making, because they’d like to drop the expense of the blood tests.
  • (Perhaps our joint other top example.) Clarity of explanation is an important goal. Some models are black boxes, and that’s that. Others are also supposed to uncover causality that helps humans make all kinds of better decision. Regulators may also want clear models. Note: Model clarity can be affected by model structure and variable(s) choice alike.
  • Certain variables can simply be more or less trusted, in terms of the accuracy of the data.
  • Certain variables can be more or less certain to be available in the future. However, I wonder how big a concern that is in a world where models are frequently retrained anyway.

5. I’m not actually seeing much support for the theory that Julia will replace R except perhaps from Revolution Analytics, the company most identified with R. Go figure.

6. And finally, I don’t think it’s wholly sunk in among predictive modeling folks that Spark both:

  • Has great momentum.
  • Was designed with machine learning in mind.
Categories: Other

Upcoming Webinar Series: Using Google Search with your Oracle WebCenter or Liferay Portal

GSA Portal Search LogoFishbowl will host a series of webinars this month about integrating the Google Search Appliance with an Oracle WebCenter or Liferay Portal. Our new product, the GSA Portal Search Suite, fully exposes Google features within portals while also maintaining the existing look and feel.

The first webinar, “The Benefits of Google Search for your Oracle WebCenter or Liferay Portal”, will be held on Wednesday, October 15 from 12:00-1:00 PM CST. This webinar will focus on the benefits of using the Google Search Appliance, which has the best-in-class relevancy and impressive search features, such as spell check and document preview, that Google users are used to.

Register now

The second webinar, “Integrating the Google Search Appliance and Oracle WebCenter or Liferay Portal”, further explains how Fishbowl’s GSA Portal Search Suite helps improve the process of setting up a GSA with a WebCenter or Liferay Portal. This product uses configurable portlets so users can choose which Google features to enable and provides single sign-on between the portal and the GSA. The webinar will be held on Wednesday, October 22 from 12:00-1:00 PM CST.

Register now

For more information on the GSA Portal Search Suite, read our previous blog post on the topic.

The post Upcoming Webinar Series: Using Google Search with your Oracle WebCenter or Liferay Portal appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Spark vs. Tez, revisited

DBMS2 - Sun, 2014-10-05 02:59

I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:

Tez vs Spark = Apples vs Oranges.

Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.

Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it

[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN

That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.

Related link

Categories: Other

Streaming for Hadoop

DBMS2 - Sun, 2014-10-05 02:56

The genesis of this post is that:

  • Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
  • Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
  • This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
  • This also all fits with the importance I place on log analysis.
  • Cloudera reached out to talk to me about all this.

Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.

In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are:

  • Getting data into the plumbing from whatever systems it’s being generated in. This is the province of Flume, one of Cloudera’s earliest projects. I’d add that this is also one of the core competencies of Splunk.
  • Getting data where it needs to go. Flume can do this. Kafka, a publish/subscribe messaging system, can do it in a more general way, because streams are sent to a Kafka broker, which then re-streams them to their ultimate destination.
  • Processing data in flight. Storm can do this. Spark Streaming can do it more easily. Spark Streaming is or soon will be a part of every serious Hadoop distribution. Flume can do some lightweight processing as well.
  • Serving up data for further query. Cloudera would like you to do this via HBase or Impala. But Oracle is a fine choice too, and indeed a popular choice among Cloudera customers.

I guess there’s also a step of receiving data out of the plumbing system. Cloudera and I glossed over that aspect when we talked, but I’ll say:

  • Spark commonly lives over HDFS (Hadoop Distributed File System).
  • Flume feeds HDFS. Flume was also hacked years ago — rah-rah open source! — to feed Kafka instead, and also to be fed by it.

Cloudera has not yet decided whether to make Kafka part of CDH (which stands for Cloudera Distribution yada yada Hadoop). Considerations in that probably include:

  • Kafka has impressive adoption among high-profile internet companies, but not so much among conventional enterprises.
  • Surely not coincidentally, Kafka is missing features in areas such as security (e.g. it lacks Kerberos integration).
  • Kafka lacks cool capabilities to let you configure rather than code, although Cloudera thinks that in some cases you can work around this problem by marrying Kafka and Flume.

I still find it bizarre that a messaging system be named after an author famous for writing about depressingly inescapable situations. Also, I wish that:

  • Kafka had something to do with transformations.
  • The name Kafka had been used by a commercial software company, which could offer product trials.

Highlights from the Storm vs. Spark Streaming vs. Samza part of my discussion with Cloudera include:

  • Storm has a companion project Trident that makes Storm somewhat easier to program and/or configure. But Trident only has some of the usability advantages of Spark Streaming.
  • Cloudera sees no advantages to Samza, a Kafka companion project, when compared with whichever of Spark Streaming or Storm + Trident is better suited to a particular use case.
  • Cloudera likes the rich set of primitives that Spark Streaming inherits from Spark. Cloudera also notes that, if you learn to program over Spark for any reason, then you will in particular have learned how to program over Spark Streaming.
  • Spark Streaming lets you join Spark Streaming data to other data that Spark can get access to. I agree with Cloudera that this is an important advantage.
  • Cloudera sees Storm’s main advantages as being in latency. If you need 10-200 millisecond latency, Storm can give you that today while Spark Streaming can’t. However, Cloudera notes that to write efficiently to your persistent store — which Cloudera fondly hopes but does not insist will be HBase or Impala — you may need to micro-batch your writes anyway.

Also, Spark Streaming has a major advantage over bare Storm in whether you have to manually configure your topology, but I wasn’t clear as to how far Trident closes that particular gap.

Cloudera and I didn’t particularly talk about data-consuming technologies such as BI, predictive analytics, or analytic applications, but we did review use cases a bit. Nothing too surprising jumped out. Indeed, the discussion reminded me of a 2007 list I did of applications — other than extreme low-latency ones — for CEP (Complex Event Processing).

  • Top-of-mind were things that fit into one or more of the buckets “internet”, “retail”, “recommendation/personalization”, “security” or “anti-fraud”.
  • Transportation/logistics got mentioned, to which I replied that the CEP vendors had all seemed to have one trucking/logistics client each.
  • At least in theory, there are potentially huge future applications in health care.

In general, candidate application areas for streaming-to-Hadoop match those that involve large volumes of machine-generated data.

Edit: Shortly after I posted this, Storm creator Nathan Marz put up a detailed and optimistic post about the history and state of Storm

Categories: Other

Fishbowl’s GSA Portal Search Suite introduces JSR-286 portlet integration that brings Google search to Oracle WebCenter and Liferay Portal

Integrated Google search has arrived for Oracle WebCenter and Liferay Portal. Last week, Fishbowl Solutions announced the GSA (Google Search Appliance) Portal Search Suite. This is Fishbowl’s fourth product for the Google Search Appliance, and introduces a productized integration that exposes Google search features like spelling suggestions, dynamic navigation, and document previews directly within the portal.

Previous integrations between the GSA and WebCenter or Liferay Portal had to be heavily customized to expose similar search features and functionality. In most cases, extensive customization was needed even when adding only one new search feature, such as autocomplete query suggestions, to portal search pages. Additionally, such customization had to be done by someone with specialized technical expertise including portal development, familiarity with the GSA response format, and XML transformation. Alternately some organizations have used the GSA’s built-in stylesheet typically directing users to search functions outside of the portal either as an iframe or a completely separate search page. This disconnect devalues the portal as being the single, universal location to access enterprise information, and detracts from the overall portal user experience.

Fishbowl’s GSA Portal Search Suite seamlessly integrates the GSA with WebCenter and Liferay portals. The integration is made possible by a collection of JSR-286 portlets that provide a search box and search results layout directly within the portal. These configurable portlets let customers choose which Google search features to expose, and lets them mix and match portlets for specific pages. The GSA Portal Search Suite also includes an authentication mechanism to provide single-sign-on between the portal and the GSA when performing secure searches. All these features help ensure that searches conducted from the portal return results with higher relevancy, and that search pages match the look and feel of the portal, leading to an enhanced user experience.

Customers with WebCenter or Liferay Portal that are looking to improve relevancy and provide search features that users have come to expect can do so with the GSA. And now with Fishbowl’s GSA Portal Search Suite, a seamless and flexible integration is available decreasing time to value and helping to maximize your WebCenter, Liferay and GSA investment.

Fishbowl will be demonstrating GSA Portal Search Suite, as well as our other GSA value-add products, at Oracle OpenWorld from September 29th through October 1st. You can see us in booth #2036 Moscone South. To read the brochure, click here.

GSA Portal Search Screen

 

The post Fishbowl’s GSA Portal Search Suite introduces JSR-286 portlet integration that brings Google search to Oracle WebCenter and Liferay Portal appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Some stuff on my mind, September 28, 2014

DBMS2 - Sun, 2014-09-28 18:21

1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?

2. A few thoughts on cloud, colocation, etc.:

  • The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
  • The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
  • Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
  • I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
  • I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
  • In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
  • Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
  • Cloud hype is of course still with us.
  • Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.

3. As for the analytic DBMS industry:

  • Concurrency is still a challenge. But otherwise …
  • … great SQL query performance isn’t something to get excited about any more, especially in immature systems.
  • Be careful about systems that have great performance when intermediate result sets fit into RAM, but not when they spill to disk. In particular, watch for this problem in the Hadoop/Spark world.
  • Vendors are getting better about ANSI SQL coverage (SQL 99 Analytics, windowing, etc. …)
  • “Runs on Hadoop” isn’t an exciting claim unless you can mix and match SQL and generic Hadoop processing in the same jobs against the same data, even though lesser forms of SQL/Hadoop integration might also with help some aspects of TCO (Total Cost of Ownership).
  • More generally, what’s needed is:
    • The ability to mix SQL and other kinds of analytic processing.
    • The ability to mix traditional tabular data, JSON, and log data.
    • The ability to mix data in place with data that’s trickling/streaming in.

4. Meanwhile, the analytic ease of use story remains popular, in business intelligence and predictive analytics/data science alike. Marketers typically oversimplify it to their own detriment, however, just as they do performance stories.

5. On the short-request side:

  • NoSQL is still going gangbusters.
  • NewSQL still isn’t, except that I haven’t talked with MemSQL for a while and they were doing well when I did.
  • Transparent sharding has stagnated as a business, good technology notwithstanding, and the vendors are pivoting.

6. Finally, one vendor note — Sharmila assures me by brief email that things are going gangbusters at ClearStory. This is unsurprising, as ClearStory exemplifies several trends I believe in, including robust analytic stacks, strong data navigation, Spark, and the incorporation of broad varieties of data.

And of course ClearStory also empowers business analysts to make do without IT involvement, like the other cool analytic kids also do.

Categories: Other

Meet Fishbowl’s WebCenter Experts at OpenWorld

Oracle OpenWorld will be held from September 28-October 2 in San Francisco.

Fishbowl Solutions will once again be at Oracle OpenWorld this year to connect with fellow WebCenter users! The event is now only a few days away, and our team is really looking forward to discussing how our value-add solutions can help your organization.

Our booth in the exhibition hall will be located at 2036 Moscone South, and will feature demos of Mobile ECM, the Google Search Appliance, Portal Solution Accelerator, SharePoint integration, and a free iPad giveway. We will also have many representatives on hand to answer your WebCenter content, portal, or imaging questions. All exhibition halls will be open from 10:00 a.m. – 6:00 p.m. on Monday and Tuesday, and from 9:30 a.m. – 3:30 p.m. on Wednesday.

Other activities at this year’s event include:

  • Sunday, September 28
    A Successful Oracle WebCenter Upgrade: What You Need to Know
    12:00 PM-12:45 PM, Moscone South 305This session’s speakers share facts and use cases that you will be able to apply to your Oracle WebCenter 11g upgrade. You will learn from tips and best practices from successful upgrades to Release 11g that you will be able to utilize as well. The session includes a fact-sharing discussion on upgrades; use case stories from Oracle WebCenter customers; and a roundtable forum during which attendees will be able to ask questions specific to their Oracle WebCenter Content, Oracle WebCenter Portal, or Oracle WebCenter Imaging upgrade.
  • Wednesday, October 1
    Automate Financial Processes for PeopleSoft and Oracle E-Business Suite
    12:45 PM-1:30 PM, Moscone West 3018This session’s speakers share facts and use cases that you will be able to apply to your Oracle WebCenter 11g upgrade. You will learn from tips and best practices from successful upgrades to Release 11g that you will be able to utilize as well. The session includes a fact-sharing discussion on upgrades; use case stories from Oracle WebCenter customers; and a roundtable forum during which attendees will be able to ask questions specific to their Oracle WebCenter Content, Oracle WebCenter Portal, or Oracle WebCenter Imaging upgrade.
  • Wednesday, October 1
    Oracle WebCenter for Education and Research
    2:00 PM-2:45 PM, Marriott Marquis Golden Gate C3Digital, social, and mobile technologies are creating new and transformational education experiences to engage students, faculty, parents, and administrators in their collective pursuit of student success. This session features case studies from higher education and K–12 that illustrate the power of Oracle WebCenter in enabling twenty-first-century learning.
  • Monday, September 29
    Oracle WebCenter and Oracle BPM Customer Appreciation Reception
    6:30 PM-8:30 PM, Old Mint, Old Mint PlazaRegister for the reception here.

If you’d like to meet with any of Fishbowl’s representatives at the event, feel free to email info@fishbowlsolutions.com. To learn more about what we’ll be doing at OpenWorld this year, download our Focus On guide. See you in San Francisco!

The post Meet Fishbowl’s WebCenter Experts at OpenWorld appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Data as an asset

DBMS2 - Sun, 2014-09-21 21:49

We all tend to assume that data is a great and glorious asset. How solid is this assumption?

  • Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
  • In many cases, however, data’s value diminishes quickly.
  • Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
    • Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
    • Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.

*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.

This all raises the idea – if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include: 

  • Actual scientific, clinical, seismic, or engineering research.
  • Actual selling of (usually proprietary) data, with the straightforward economic proposition of “Get once, sell to multiple customers more cheaply than they could get it themselves.” Examples start:
    • This is the essence of the stock quote business. And Michael Bloomberg started building his vast fortune by adding additional data to what the then-incumbents could offer, for example by getting fixed-income prices from Cantor Fitzgerald.*
    • Multiple marketing-data businesses operate on this model.
    • Back when there was a small but healthy independent paper newsletter and directory business, its essence was data.
    • And now there are many online data selling efforts, in niches large and small.
  • Internet ad-targeting businesses. Making money from your great ad-targeting technology usually involves access to lots of user-impression and de-anonymization data as well.
  • Aggressive testing by internet businesses, of substantive offers and marketing-display choices alike. At the largest, such as eBay, you’ll rarely see a page that doesn’t have at least one experiment on it. Paper-based direct marketers take a similar approach. Call centers perhaps should follow suit more than they do.
  • Surveys, focus groups, etc. These are commonly expensive and unreliable (and the cheap internet ones commonly irritate people who do business with you). But sometimes they are, or seem to be, the only kind of information available.
  • Free-text data. On the whole I’ve been disappointed by the progress in text analytics. Still — and this overlaps with some previous points — there’s a lot of information in text or narrative form out there for the taking.
    • Internally you might have customer emails, call center notes, warranty reports and a lot more.
    • Externally there’s a lot of social media to mine.

*Sadly, Cantor Fitzgerald later became famous for being hit especially hard on 9/11/2001.

And then there’s my favorite example of all. Several decades ago, especially in the 1990s, supermarkets and mass merchants implemented point-of-sale (POS) systems to track every item sold, and then added loyalty cards through which they bribed their customers to associate their names with their purchases. Casinos followed suit. Airlines of course had loyalty/frequent-flyer programs too, which were heavily related to their marketing, although in that case I think loyalty/rewards were truly the core element, with targeted marketing just being an important secondary benefit. Overall, that’s an awesome example of aggressive data gathering. But here’s the thing, and it’s an example of why I’m confused about the value of data — I wouldn’t exactly say that grocers, mass merchants or airlines have been bastions of economic success. Good data will rarely save a bad business.

Related links

Categories: Other

Misconceptions about privacy and surveillance

DBMS2 - Mon, 2014-09-15 11:07

Everybody is confused about privacy and surveillance. So I’m renewing my efforts to consciousness-raise within the tech community. For if we don’t figure out and explain the issues clearly enough, there isn’t a snowball’s chance in Hades our lawmakers will get it right without us.

How bad is the confusion? Well, even Edward Snowden is getting it wrong. A Wired interview with Snowden says:

“If somebody’s really watching me, they’ve got a team of guys whose job is just to hack me,” he says. “I don’t think they’ve geolocated me, but they almost certainly monitor who I’m talking to online. Even if they don’t know what you’re saying, because it’s encrypted, they can still get a lot from who you’re talking to and when you’re talking to them.”

That is surely correct. But the same article also says:

“We have the means and we have the technology to end mass surveillance without any legislative action at all, without any policy changes.” The answer, he says, is robust encryption. “By basically adopting changes like making encryption a universal standard—where all communications are encrypted by default—we can end mass surveillance not just in the United States but around the world.”

That is false, for a myriad of reasons, and indeed is contradicted by the first excerpt I cited.

What privacy/surveillance commentators evidently keep forgetting is:

  • There are many kinds of privacy-destroying information. I think people frequently overlook just how many kinds there are.
  • Many kinds of organization capture that information, can share it with each other, and gain benefits from eroding or destroying privacy. Similarly, I think people overlook just how pervasive the incentive is to snoop.
  • Privacy is invaded through a variety of analytic techniques applied to that information.

So closing down a few vectors of privacy attack doesn’t solve the underlying problem at all.

Worst of all, commentators forget that the correct metric for danger is not just harmful information use, but chilling effects on the exercise of ordinary liberties. But in the interest of space, I won’t reiterate that argument in this post.

Perhaps I can refresh your memory why each of those bulleted claims is correct. Major categories of privacy-destroying information (raw or derived) include:

  • The actual content of your communications – phone calls, email, social media posts and more.
  • The metadata of your communications — who you communicate with, when, how long, etc.
  • What you read, watch, surf to or otherwise pay attention to.
  • Your purchases, sales and other transactions.
  • Video images, via stationary cameras, license plate readers in police cars, drones or just ordinary consumer photography.
  • Monitoring via the devices you carry, such as phones or medical monitors.
  • Your health and physical state, via those devices, but also inferred from, for example, your transactions or search engine entries.
  • Your state of mind, which can be inferred to various extents from almost any of the other information areas.
  • Your location and movements, ditto. Insurance companies also want to put monitors in cars to track your driving behavior in detail.

Of course, these categories overlap. For example, information about your movements can be derived not just from your mobile phone, but also from your transactions, from surveillance cameras, and from the health-monitoring devices that are likely to become much more pervasive in the future.

So who has reason to invade your privacy? Unfortunately, the answer boils down to “just about everybody”. In particular:

  • Any internet or telecom business would like to know, in great detail, what you are doing with their offerings, along with any other information that might influence what you’re apt to buy or do next.
  • Anybody who markets or sells to consumers wants to know similar things.
  • Similar things are true of anybody who worries about credit or insurance risk.
  • Anybody who worries about fraud wants to know who you’re connected to, and also wants to match you against any known patterns of fraud-related behavior.
  • Anybody who hires employees wants to know who might be likely to work hard, get sick or quit.
  • Similarly, they’d like to know who does or might engage in employee misconduct.
  • Medical researchers and caregivers have some of the most admirable reasons for wanting to violate privacy.

And that’s even without mentioning the most obvious suspects — law enforcement and national security of many kinds, who can be presumed to in at least certain cases be able to get any information that’s available to any other organization.

Finally, my sense is:

  • People appreciate the potential of fancy-schmantzy language and image recognition.
  • The graph analysis done on telecom metadata is so simple that people generally “get” what’s going on.
  • Despite all the “big data analytics” hype, commentators tend to forget just how powerful machine learning/predictive analytics privacy intrusions could be. Those psychographic clustering techniques devised to support advertising and personalization could be applied in much more sinister ways as well.

Related links

Categories: Other

Webinar: 21st Century Education Goes Digital with Oracle WebCenter

Oracle Corporation Banner 21st Century Education Goes Digital with Oracle WebCenter

Learn how The Digital Campus with WebCenter can address top-of-mind issues for creating exceptional digital learning experiences, put content in context for the user and optimize business processes

The global education market is under-going a fundamental transformation — from the printed textbook and physical classroom to newer digital, online and mobile experiences.  Today, students can learn anywhere, anytime, from anyone on any device, bridging administrative and academic systems into single universal view.

Oracle WebCenter is at the center of innovation and engagement for any digital enterprise looking to empower exceptional experiences for students, faculty, administrators and researchers. It powerfully connects people, processes, and information with the most complete portfolio of portal, content management, Web experience management and collaboration technologies to enable student success.

Join this special event featuring the University of Pretoria, Fishbowl Solutions and Oracle, whose experts will illustrate successful design patterns and solution delivery for:

  • Student Portals. Create rich, interactive student experiences
  • Digital Repository. Deliver advanced content capture, tagging and sharing while securing enterprise data
  • Admissions. Leverage image capture and business process design to enable improved self-service

Attendees will benefit from the use-case insights and strategies of a world re-knowned university as well as a pre-built solution approach from Oracle and solutions partner Fishbowl to enable a truly modern digital campus.

Audio information:

Dial in Numbers: U.S / Canada: 877-698-7943 (toll free)
International: 706-679-0060(chargeable)
Passcode:
solutions2 Red Button Top Register Now Red Button Bottom

Calendar Sep 11, 2014
10:00 AM PT |
01:00 PM ET

If you are an employee or official of a government organization, please click here for important ethics information regarding this event. Hardware and Software Engineered to Work Together Copyright © 2014, Oracle Corporation and/or its affiliates.
All rights reserved. Contact Us | Legal Notices and Terms of Use | Privacy Statement SEO100151617

Oracle Corporation – Worldwide Headquarters, 500 Oracle Parkway, OPL – E-mail Services, Redwood Shores, CA 94065, United States

Your privacy is important to us. You can login to your account to update your e-mail subscriptions or you can opt-out of all Oracle Marketing e-mails at any time.

Please note that opting-out of Marketing communications does not affect your receipt of important business communications related to your current relationship with Oracle such as Security Updates, Event Registration notices, Account Management and Support/Service communications.

The post Webinar: 21st Century Education Goes Digital with Oracle WebCenter appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

An idealized log management and analysis system — from whom?

DBMS2 - Sun, 2014-09-07 06:38

I’ve talked with many companies recently that believe they are:

  • Focused on building a great data management and analytic stack for log management …
  • … unlike all the other companies that might be saying the same thing :)
  • … and certainly unlike expensive, poorly-scalable Splunk …
  • … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.

At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.

Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.

A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: 

  • Agents for any kind of machine that admits streams of data.
  • Parsers that:
    • Immediately identify explicit name-value pairs in popular formats such as JSON or XML.
    • Also immediately extract a significant fraction of all implicit fields in text strings — timestamps for sure, but also a lot else. (Splunk is the current gold standard for such capabilities.)
    • Allow you to easily write rules for more such extractions.
  • Immediate indexing in line with everything the parsers do.
  • Easy import of log files, relational tables, and other relevant data structures.
  • Queries that can exploit all the indexes, at least up to the functionality level of SQL 2003 analytics (including windowing) and StreamSQL, of course with …
  • … blazing scalable performance.
  • Strong workload management and concurrent performance support. (Teradata is the gold standard for such capabilities in the analytic sphere.)
  • Various other mature-DBMS features, e.g. in backup, manageability, and uptime.

Further, there would be numerous styles of business intelligence interface, at least including:

  • Generic BI like we generally see for tabular data.
  • Constantly-changing displays of streaming data.
  • BI with an event-series orientation.
  • Strong alerting.
  • Mobile versions of everything.

And there would be good support for quick-turnaround, easily-operationalized predictive analytics, of the sort that’s fairly central to the visions for Kiji and Spark.

The data management part of that is particularly hard, in that:

  • Different architectures seem naturally well-suited for different parts of the problem.
  • Maturing a new data management product is always difficult, costly and slow.

My thoughts on strengths and weaknesses of some obvious log data management contenders start:

  • Oracle, IBM, and Microsoft have a lot of heft in all things database. But while each of those vendors has great resources and occasionally impressive pieces of new database engineering, none shows much evidence of framing, let alone solving, the problem in the right way(s).
  • SAP owns Sybase, HANA, several old CEP companies, and Business Objects. Add them to the Oracle/IBM/Microsoft list.
  • Teradata has a lot going for them. Their core analytic data management strengths are obvious. They’ve owned Aster for a while, and Aster innovated nPath quite some time ago. They recently added Hadapt, a leader in schema-on-need, as well as Revelytix, which has some good ideas in dataset management. Like most other DBMS vendors, however, Teradata doesn’t yet have much of a story for streaming data, and anyhow the most optimistic case for Teradata involves the difficult task of stitching together disparate data management technologies.
  • HP Vertica has a decent position as well. Probably more proven in general concurrent, scalable performance than others in their peer group (Netezza, Greenplum, et al.), Vertica also was relatively early in innovations relevant to log analysis, including a range of time series/event series features and its own schema-on-need effort. Vertica was also founded by people who were also streaming pioneers (there were heavily overlapping groups of academics behind StreamBase, Vertica and VoltDB), but it’s not clear how that background is reflected in present Vertica product.
  • Splunk, of course, has a complete stack. At the data acquisition and parsing layers, it’s second to none, and it has a considerable set of log-appropriate BI capabilities as well. And for data management it in effect is stitching together two different inverted-list data stores, plus Hadoop.
  • Hadoop distribution vendors such as Cloudera, MapR or Hortonworks offer typically bundle a range of relevant capabilities. HDFS (Hadoop Distributed File System) is the default place to dump entire logs. In most distros, Spark offers a new approach to streaming. Impala, Drill and so on offer query. Flume gathers the log data in the first place. But a lot of the cooler capabilities are immature or unproven, and in some cases that’s putting it mildly.

In the interest of length, I’ll omit discussion of smaller vendors, except to say that Platfora’s integrated-stack event series analytics story deserves attention, and I’m disappointed that I never hear about Sumo Logic. And I don’t know a lot about companies positioned as SIEM (Security Information and Event Management), especially now that SenSage has left the scene.

Categories: Other

Migrating Existing PeopleSoft Attachments into the Managed Attachments Solution

This post comes from Fishbowl’s Mark Heupel. Mark is an Oracle Webcenter consultant, and he has worked on a few different projects over the last year helping customers integrate WebCenter with Oracle E-Business Suite and PeopleSoft. One of WebCenter’s strengths is it provides these integrations out-of-the-box, including a document imaging integration to automate invoice processing with WebCenter’s capture, forms recognition and imaging capabilities, as well as workflows leveraging Oracle Business Process Management. Mark discusses WebCenter’s integration with PeopleSoft and its managed attachments solution below.

Application Integration

Oracle’s Managed Attachments solution enables business users in PeopleSoft to attach, scan, and retrieve document attachments stored in an Oracle WebCenter Content Server repository.

One of the issues that our clients face when moving to Oracle’s Managed Attachments solution is determining what to do with the attachments that already exist in PeopleSoft. We at Fishbowl have come up with a method to migrate these attachments into WebCenter Content in bulk while still maintaining the attachments’ context within PeopleSoft.

A high-level view of the solution is as follows. Queries are written on the PeopleSoft side to export each of the attachments, as well as a file containing each attachment’s metadata and PeopleSoft contextual information, to a network share. This is a task done by a PeopleSoft administrator. We then use our Enterprise Batchloader product to bulk load these files into WebCenter Content. We’ve written a customization that overrides the set of services that qualify for Managed Attachments to include our Enterprise Batchloader service. Since the context of the attachments is included in the metadata file, the Enterprise Batchloader check-ins work in the same way that a normal check-in from Managed Attachments would and the attachments retain their PeopleSoft context. Let’s get into the details of how this works.

Managed Attachments Overview

In order to understand the migration strategy, we first need to understand how Managed Attachments works under the covers. The important piece to know for this migration is that the table that stores the Managed Attachment object information on the WebCenter side is the AFObjects table. This table stores the PeopleSoft context information as well as the dDocName of each of the attachments currently being stored in WebCenter. Here is an example of what the AFObjects table looks like:

AFObjects Table

Each row in this table represents one PeopleSoft attachment being managed in WebCenter Content. The dAFApplication, dAFBusinessObjectType, and dAFBusinessObject fields make up the context for where the attachment is located in PeopleSoft. The dAFApplication field represents the application, the dAFObjectType field represents the page, and the dAFBusinessObject field is a pipe delimited list of the primary key values from the page where the attachment is located in PeopleSoft. The dDocName field is simply the dDocName of the content item in WebCenter.

When a user clicks the Managed Attachments link on the PeopleSoft screen a request is made over to WebCenter that contains the contextual page information from PeopleSoft (dAFApplication, dAFBusinessObjectType, and dAFBusinessObject). Using this contextual information, a query is then made against the AFObjects table to find the content IDs of the attachments that should be returned back to the user. A similar request is made when a user checks in a document through the Managed Attachments screen in PeopleSoft. The PeopleSoft context information is sent to WebCenter, the document is checked in, and then a row is inserted into the AFObjects table that contains the PeopleSoft contextual information as well as the dDocName of the newly checked-in document.

Loading Content into WebCenter

In order to be able to successfully load a large number of content items into WebCenter, while still maintaining the correct PeopleSoft context, we had to write a customization to hook into the existing Managed Attachments check-in functionality. The AppAdapterCore component, one of the two components installed on WebCenter for Managed Attachments, contains the core Managed Attachments code. This component contains a list of services such as CHECKIN_NEW that, when called with the PeopleSoft contextual information in the binder (dAFApplication, dAFObjectType, and dAFObject), executes the query that inserts a row into the AFObjects table. The customization that we wrote overrides the list of services specified in the AppAdapterCore component to include our Enterprise Batchloader check-in services. By doing so, we’re able to hook into the same insert query that Managed Attachments already uses, assuming we have placed the correct PeopleSoft context information in the binder.

Here is an example of what a standard Enterprise Batchloader blf (batch load file) would look like:

Batch Load File
As you can see, the file simply contains the action to take (insert), the location of the primary file, and the required metadata fields for WebCenter. In order to assign the correct PeopleSoft context we simply need to specify the dAFApplication, dAFObjectType, and dAFObject fields in the blf file:

Batch Load File 2

This effectively places each of those fields into the binder in WebCenter. When Enterprise Batchloader is run and performs its check-ins into WebCenter, the Managed Attachments functionality gets called and a row is inserted into the AFObjects table for each attachment that specifies the PeopleSoft context information. As long as the correct PeopleSoft contextual information is placed into the Enterprise Batchloader blf file, we’re able to bulk load as many attachments as needed into WebCenter while still retaining the correct PeopleSoft context information for use with the Managed Attachments solution.

I hope this provides you with an example of how your existing PeopleSoft Managed Attachments content could be migrated to WebCenter. After all, getting this content into WebCenter has many additional benefits, such as version control, renditions, retention management and the ability to surface this content to WebCenter-based mobile apps and portals. If you have questions or would like to engage with Fishbowl on such projects, please email info@fishbowlsolutions.com.

 

The post Migrating Existing PeopleSoft Attachments into the Managed Attachments Solution appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Notes from a visit to Teradata

DBMS2 - Sun, 2014-08-31 03:17

I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.

First, let’s catch up with some personnel gossip. So far as I can tell:

  • Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
  • … Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
  • Oliver Ratzesberger runs Teradata’s software development.
  • Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
  • Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
  • Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
  • With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.

The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.

In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:

  • The Teradata 1xxx series is focused on cost-per-bit.
  • The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
  • The Teradata 6xxx series is supposed to be above to do “everything”.
  • The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”

Also:

  • 1xxx and 2xxx systems are meant to be I/O-constrained. 6xxx systems are meant to be constrained mainly by CPU, but every system will be I/O-constrained at some point.
  • There is at least one example of a Very Well Known organization buying Teradata’s Hadoop-only appliance despite not otherwise being a Hadoop customer. Teradata concedes, however, that this is not a common occurrence.
  • Customers are increasingly using co-location rather than their own data centers. Many colo organizations charge more or less strictly by floor space. Hence, there’s a push for maximum processing density per rack, power density and weight be damned.

Speaking of not being CPU-constrained — I heard 7-10% as an estimate for typical Hadoop utilization, and also 10-15%. While I didn’t ask, I presume these figures assume traditional MapReduce types of Hadoop workloads. I’m not sure why these figures are yet lower than eBay’s long-ago estimates of Hadoop “parallel efficiency”.

Like Carson used to do, Jeff shared a variety of hardware and networking tidbits with me. In particular:

  • Jeff is confident in Moore’s Law continuing for at least 5 more years. (I think that’s a near-consensus; the 2020s, however, are another matter.)
  • Teradata still uses SAS rather than SATA for all disk (spinning or solid-state) controllers. They’re now seeing 6-700 MB/sec/device on SSDs (Solid State Disk), up from 3-400.
  • SSD prices are down 60% over the past 6 months, vs. much slower declines previously.
  • Formerly a SanDisk/Pliant partisan, Teradata now thinks there are multiple vendors of good SSDs. (I’m not sure whether they’d be happy if I said which one they currently like best.)
  • Jeff foresees InfiniBand and Ethernet more or less merging. Right now Teradata is using a lot of 56 Gb/sec InfiniBand.

Since Oliver is now a Teradata mucky-muck, I asked about virtual data marts, an idea that he pretty much invented or at least popularized back in his eBay days. Comments included:

  • Teradata now calls them Data Labs.
  • Adoption is very high.
  • One major feature is “time boxing” — they expire after a period of time unless you renew them.
  • Analysis of virtual data mart usage is a good guide as to what you might want to add to your permanent data warehouse.

And I’ll stop here, although I hope that a couple more-focused posts will also eventually flow from the visit.

Categories: Other

Subscription Notifier Version 4.0 Enables WebCenter Users to Create Custom Content Email Notifications

Fishbowl Solutions’ Subscription Notifier has been used by many of our customers for years to manage business content stored in Oracle WebCenter Content. Subscription Notifier automatically sends email notifications based on scheduled queries. Fishbowl released version 4.0 of the product last week, and it includes several significant updates.

Now, users of Subscription Notifier can:

  • Attach native or web-viewable files to notification emails
  • Send individual notification emails for each content item
  • Configure hourly notification schedules
  • Run subscription side effects without sending emails

In addition to the latest updates, the product also offers a host of other features that enable WebCenter users to keep track of their high-value content.

You begin by naming the subscription and specifying whether emails should be sent for items matching the query. The scheduler lets you specify exactly when you want email notifications to go out (note the hourly option, new with version 4.0).

 

SubNoti general settings

The email settings specify who you want to send emails to and how they should appear to recipients. The new “Attach Content” feature gives you the option of sending web-viewable or native files, which provides a way for recipients who don’t use Oracle WebCenter to still see important files. Using the query builder is very simple and determines what content items are included in the subscription. Advanced users also have the option to write more complex queries using SQL.

SubNoti email

The Current Subscription Notifications page gives a summary of all subscriptions. In Version 4.0, simple changes such as enabling, disabling, or deleting subscriptions can be done here.

SubNoti current subscription notifications

Subscription Notifier is a very useful tool for any organization that needs to keep tabs on a large amount of business content. It is part of Fishbowl’s Administration Suite, which also includes Advanced User Security Mapping, Workflow Solution Set, and Enterprise BatchLoader. This set of products works together to simplify the most common administrative tasks in Oracle WebCenter Content.

To learn more about Subscription Notifier, visit Fishbowl’s website or read the press release announcing Version 4.0.

The post Subscription Notifier Version 4.0 Enables WebCenter Users to Create Custom Content Email Notifications appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other